Effect of dataset partitioning strategies for evaluating out-of-distribution generalisation for predictive models in biochemistry

Raúl Fernández Díaz; Lam Thanh Hoang; Vanessa Lopez; Denis Shields

MoML 2024

Poster

19 Jun 2024

Effect of dataset partitioning strategies for evaluating out-of-distribution generalisation for predictive models in biochemistry

Visit website

Abstract

Evaluating the generalisation capabilities of predictive models in biochemistry is a nuanced problem that requires simulating out-of-distribution evaluation sets that can be used to estimate the real-world performance of the models. Generating this out-of-distribution evaluations sets requires minimising the similarity between entities in the training dataset and the hold-out evaluation set. There is not a unified framework for creating similarity-based partitions for different biochemical data types.

We have developed Hestia, a computational tool that allows for the creating of independent training/evaluation partitions based on pairwise similarity measurements between all entities. new strategy for dividing a dataset into training and evaluation subsets (CCPart) and have compared it against other methods at different thresholds and have seen that CCPart leads to more challenging evaluation sets at lower similarity thresholds and more stable behaviour across thresholds.

Furthermore, using this tool we propose a novel metric for measuring model generalisation, area between the similarity-performance curve and the in-distribution performance. The similarity-performance curve is a function of model performance when evaluated against evaluation sets with decreasing similarity thresholds (i.e., increasingly out-of-distribution) whereas the in-distribution performance is calculated with random partitioning. Thus, the area between the first curve and the second line, measures the difference in model performance between in-distribution and out-of-distribution data. Values closer to 0 indicate that the model behaves similarly in both settings, indicating perfect generalisation; whereas values closer to 1 indicate that the model performs much better against in-distribution data than out-of-distribution data, thus indicating a problem with model generalisation.

We have trained molecular language models for protein sequences, DNA sequences, and small molecule string representations (SMILES) on the alternative splitting strategies for training and evaluation subsets. The effect of partitioning strategy and threshold depend both on the specific prediction task and the biochemical data type, for tasks for which homology is important, like enzymatic activity classification, being more sensitive to partitioning strategy than others, like subcellular localization. The source code is freely available at \url{https://github.com/IBM/Hestia}. The tool is also made available through a dedicated web-server at \url{http://peptide.ucd.ie/Hestia}.

Paper