What is it about?

Biologists are intimately familiar with DNA, RNA, and proteins — three types of biological sequences that make life, as we know it, possible. Less familiar (though at least just as important) are glycans or complex carbohydrates. These chains of various sugars (or monosaccharides as they are technically called) can either occur by themselves, for instance to constitute the capsules of bacteria, fungi, as well as plant cells, or adorn all kinds of other biomolecules such as proteins, lipids, or RNA. The specific glycan sequence that is physically attached to a protein fundamentally alters its properties and capabilities, fine-tuning stability, structure, and function. This results in a melange of incredibly complex interactions, in turn producing the exceedingly complex phenomenon we know as life. And incredibly complex it is indeed, as glycans boast an alphabet of hundreds of monosaccharides, compared to the rather paltry 20 amino acids for proteins and four nucleotides for DNA. Additionally, glycans are not merely the only nonlinear biological sequence — resulting in molecules with multiple branches — but are also the only non-templated sequence, being created via an interplay of dozens of specialized enzymes intimately dependent on the current state of the cell. All this makes glycans the most diverse biological sequence and also the most dynamic one, being able to adjust sequences on the fly without genetic mutations. On top of all this, glycans have been implicated in basically all human diseases, from inflammatory disorders to cancer. While there is a research area titled “glycobiology,” the prominence of glycans in the life sciences still is paltry compared to their relevance. And while deep learning has recently revolutionized the analysis of other complex biological sequences such as proteins or RNA, glycans have so far eluded machine learning. The reason for this lies mostly in their nonlinearity, which prohibited the application of standard natural language processing tools developed for linear sequences. Yet we were confident that — as the most diverse and complex biopolymer — glycans would profit most from state-of-the-art deep learning techniques that could finally facilitate a mapping from sequence to function and allow for a comprehensive approach to glycobiology. This is why, in our recent work published in the journal Cell Host & Microbe, we developed the first language model for glycans, which we dubbed SweetTalk. Prior to that, of course, we had to gather data. We collected as many glycan sequences as we could find and gathered all 19,299 of them in our own database SugarBase. SweetTalk then consisted of a bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) units. Basically, we trained SweetTalk to learn sequence dependencies in glycans by predicting the next monosaccharide or bond (which we defined as glycoletters) given preceding glycoletters. This equipped us both with a trained glycan language model as well as learned representations for each glycoletter that we could use for visualization purposes. To avoid the nonlinear nature of glycan sequences, we processed the sequences by extracting “glycowords” from glycan sequences — sets of five glycoletters that gave SweetTalk the opportunity to learn correct context relations between glycoletters.

Featured Image

Why is it important?

As the whole lining of our gut is covered in glycans (mucins, as in mucus), and every cellular surface is covered in glycans as well (including intestinal bacteria), we reasoned that interactions between our microbiome (or pathogens) and ourselves might offer a promising area of application for our methods. This is especially relevant as research has for instance shown that some bacteria mimic our glycans to fool our immune system. We thus wanted to see if we could predict whether specific glycans would be recognized by the human immune system. For this, we fine-tuned our pre-trained SweetTalk model on a smaller dataset of glycans labeled with their immunogenicity and achieved an accuracy of well over 90%, outperforming other machine learning methods such as random forests. This analysis supported the notion of molecular mimicry, with bacterial glycans that are similar to human glycans having a lower immunogenicity score. We also managed to train classifiers with an accuracy of close to 90% for predicting the pathogenicity of strains of the common bacterium Escherichia coli purely based on their glycans, clearly supporting our assumption that glycans are informative for properties such as pathogenicity. Importantly, we could even identify sequence motifs that seem to be most predictive of pathogenicity, again strongly pointing to molecular mimicry of glycans. Another fascinating observation we made is that, despite working with binary labels of pathogenicity, our model predicted a continuum of pathogenicity, reflecting the dependence of pathogenicity on environmental circumstances and supporting earlier reports about this phenomenon. Lastly, we addressed the pressing issue of a lack of data compared to the high dimensionality of glycan sequences. While our transfer learning scheme of using an unsupervised language model already improved model outcomes, we set out to devise a data augmentation scheme to further enhance the efficiency of glycan-focused machine learning. For this, we designed a multiclass benchmark problem which we dubbed SweetOrigins: predicting the taxonomic group a given glycan sequence came from. This represented a range of difficulties, ranging from predicting one of four taxonomic domains up to predicting one of hundreds to thousands of species. For our data augmentation approach, we capitalized on the notoriously ambiguous notation used for glycan sequences, with brackets indicating a branch. This apparent shortcoming allowed us to formulate various bracket notation strings for the same glycan molecule, effectively amplifying our amount of sequences. This made our models more robust and indeed led to marked performance increases in our benchmark task, with up to 6% increases in absolute accuracy for the hardest tasks.

Read the Original

This page is a summary of: Deep-Learning Resources for Studying Glycan-Mediated Host-Microbe Interactions, Cell Host & Microbe, October 2020, Elsevier,
DOI: 10.1016/j.chom.2020.10.004.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page