Conference paper

A new framework for evaluating model out-of-distribution generalisation for the biochemical domain

Abstract

The last decade has been characterised by the impact that the introduction of machine learning models has had in the acceleration of scientific discovery. These models are frequently used to predict the properties of entities (drug candidates, materials, cells, etc.) that are inherently different from those present in their training distribution. This deployment scenario is known as out-of-distribution (OOD) and it is particulary frequent within the biochemical domain which encompasses both biological and chemical modelling.

In response to this gap, we first present a framework to study and quantify model generalisation to OOD data for biochemistry. Unlike existing algorithms , we propose a novel dataset partitioning method that is broadly applicable across various biochemistry contexts, including proteins and small molecules. Our approach is agnostic to the underlying data types; instead, it relies on defined similarity metrics between any two data instances.

When there are various similarity metrics of interest, we present a set of criteria to identify the best similarity metric for defining out-of-distribution generalization with minimal reliance on domain knowledge. Additionally, we propose a statistical metric to compare the generalization capabilities of n models against one another.

Related