Evaluation of partitioning algorithms for trustworthy out-of-distribution evaluation of machine learning models in biochemistry

Raúl Fernández Díaz; Lam Thanh Hoang; Vanessa Lopez; Denis Shields

VIBE 2025

Talk

08 Dec 2025

Evaluation of partitioning algorithms for trustworthy out-of-distribution evaluation of machine learning models in biochemistry

Abstract

Machine learning models in scientific discovery are expected to make predictions in new, unseen scenarios, i.e., out-of-distribution. Machine learning model evaluation is usually performed by dividing a dataset into two mutually exclusive subsets: training and testing. Model parameters are fitted to the training subset and the evaluation is performed against the testing subset. The process of creating these subsets is called partitioning. Traditionally, the machine learning literature relies on random partitioning. The problem with this approach is that it assumes that the prediction scenario will be in-distribution as random sampling is an in-distribution sampling.

Recently, we have introduced the concept of similarity partitioning as a method for correcting this assumption. Similarity partitioning algorithms ensure that the testing subset contains molecules different to those the model has been exposed during training, and thus better simulates the real-world out-of-distribution scenario. However, it is not clear what algorithms are the best suited for generating these testing subsets. Thus, we have conducted a systematic benchmark of different partitioning algorithms previously described in the literature and examined which ones can generate the most challenging test subsets. We also propose a new algorithm called CCPart.

Our results show that the three best similarity partitioning algorithms are Butina, CCPart, and UMAP. Where UMAP is limited to small drug-like organic molecules and both Butina and CCPart can be applied to any other entity (biosequences, 3D structures, small molecules, etc.). Further, they also show that choice of partitioning algorithm is dataset-dependent and a prior analysis of both algorithms and similarity metrics need to be performed. These results open the way for more trustworthy evaluation of machine learning models in the biochemical domain, that better estimate their real-world performance.

Workshop paper