Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug-, materials-, and molecular- discovery.
Abstract
Data-driven techniques for establishing quantitative structure property relations are a pillar of modern materials and molecular discovery. Fuelled by the recent progress in deep learning methodology and the abundance of new algorithms, it is tempting to chase benchmarks and incrementally build ever more capable machine learning (ML) models. While model evaluation has made significant progress, the intrinsic limitations arising from the underlying experimental data is often overlooked. In the chemical sciences, where data collection is costly, datasets are small and experimental errors can be significant. These limitations of such datasets affect their predictive power, a fact that is rarely considered in a quantitative way. In this study, we analyse commonly used ML datasets for regression and classification from drug discovery (aggregated in the Therapeutics Data Commons) and materials discovery. We aim to establish realistic maximum performance bounds for these datasets by introducing noise based on estimated or actual experimental errors. We then compare the estimated performance bounds in commonly used evaluation metrics to the reported performance of existing ML models in the literature. This comparison will help us understand whether current models are reaching dataset limitations, if there is still room for model improvement, or if models are fitting to noise. Additionally, we systematically examine how the range of data, the magnitude of experimental error, and the number of data points influence these maximum performance bounds. Alongside this paper, we plan to release a Python package and a web-based application for computing realistic performance bounds. This study and the resulting tools will help practitioners in the field understand the limitations of datasets and set realistic expectations for ML model performance. This work stands as a reference point, offering analysis and tools to guide development of future ML models in the chemical sciences.