ISMB 2024

Effect of dataset partitioning strategies for evaluating out-of-distribution generalisation for predictive models in biochemistry


We have developed Hestia, a computational tool that provides a unified framework for introducing similarity correction techniques across different biochemical data types. We propose a new strategy for dividing a dataset into training and evaluation subsets (CCPart) and have compared it against other methods at different thresholds to explore the impact that these choices have on model generalisation evaluation, through the lens of overfitting diagnosis. We have trained molecular language models for protein sequences, DNA sequences, and small molecule string representations (SMILES) on the alternative splitting strategies for training and evaluation subsets. The effect of partitioning strategy and threshold depend both on the specific prediction task and the biochemical data type, for tasks for which homology is important, like enzymatic activity classification, being more sensitive to partitioning strategy than others, like subcellular localization. Overall, the best threshold for small molecules seems to lay between 0.4 and 0.5 in Tanimoto distance, for DNA between 0.4 and 0.5, and for proteins between 0.3 and 0.5, depending on the specific task. Similarity correction algorithms showed significantly better ability to diagnose overfitting in 11 out of 15 datasets with CCPart being more clearly dependent on the threshold than the alternative GraphPart, which showed more instability. The source code is freely available at The tool is also made available through a dedicated web-server at