Publication
Bioinformatics
Paper

AutoPeptideML: A study on how to build more trustworthy peptide bioactivity predictors

View publication

Abstract

Motivation Automated machine learning (AutoML) solutions can bridge the gap between new computational advances and their real-world applications by enabling experimental scientists to build their own custom models. We examine different steps in the development life-cycle of peptide bioactivity binary predictors and identify key steps where automation can not only result in a more accessible method, but also more robust and interpretable evaluation leading to more trustworthy models. Results We present a new method for drawing negative peptide subsets for training the models that achieves a better balance between specificity and generalisation than current alternatives. We study the effect of introducing an homology-based partitioning algorithm for generating the training and testing data subsets, by comparing it to previous studies where no such correction for homology between sequences in training and testing was introduced, and demonstrate that model performance is overestimated when no homology correction was applied which indicates that prior studies have tended to overestimate their model performance when applied to new peptide sequences. We also conduct a systematic analysis of the performance of different protein language models as peptide representation methods and find that they can serve as better descriptors than a naive alternative, but that there is no significant difference across models with different sizes or algorithms. Finally, we demonstrate that an ensemble of optimised traditional machine learning algorithms can compete with more complex neural network models, while being more computationally efficient. We integrate these findings into AutoPeptideML, an easy-to-use AutoML tool to support the development of new predictive models for peptide bioactivity by researchers without a strong computational expertise in a matter of minutes. Availability Source code, documentation, and data are available at https://github.com/IBM/AutoPeptideML and a dedicated webserver at http://peptide.ucd.ie/AutoPeptideML.