Talk

How to generalize machine learning models to both canonical and non-canonical peptides

Abstract

Bioactive peptides are an important class of natural products with great functional diversity. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for canonical peptides (non-modified) is more abundant than for non-canonical (chemically modified). We explored whether current methods are sufficient to generalize from canonical data to non-canonical datasets. To do this, we first considered two critical aspects of the modeling problem, namely choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model. We demonstrate, across four peptide function prediction tasks, that chemical fingerprint-based similarity measures outperform traditional sequence alignment-based metrics for partitioning canonical peptide datasets, challenging standard practices. We have also found that the deep-learned embeddings from Chemical Language Models (CLMs) generally outperform chemical fingerprints and other peptide-specific pre-trained models, performing best for non-canonical peptides and second best for canonical. Despite this, models trained on only one of the two peptide classes fail to properly extrapolate to the other. However, by enriching the canonical datasets with a small proportion of non-canonical peptides, we are able to build robust joint models that generalise adequately to both canonical and non-canonical data. These insights are implemented in the AutoPeptideML open-source project and webserver for automatically constructing predictive models of peptide biophysical properties. All code and data necessary for reproducing the experiments are available in Zenodo and Github.