Combining high-quality, humanly curated data with language models: the dawn of on-demand machine learning models for digital chemistry
Abstract
With few exceptions, the majority of machine learning research on computer assisted synthesis and forward reaction prediction has used datasets of chemical reaction records derived from USPTO patents. Numerous works demonstrate that the various USPTO datasets are characterized by the absence of relevant areas of chemistry, systematic inconsistencies, and chemically semantic errors that adversely affect the end user experience when using models trained on this data. An example is that of the many examples among textbook reactions or reaction classes such as pericyclic or thermal/photochemical rearrangements, rarely reported in patent documents. On the other hand, human-curated datasets are an untapped resource for the customization of models to specific reaction classes with the potential to facilitate the transition to an era of on-demand machine learning models for specific chemical reaction problems. We assessed the performance of USPTO-baseline language models extended to high-quality, peer-curated datasets, reporting significant performance gains across all metrics for forward and single-step retrosynthesis prediction. At the same time, we identify chemical reaction domains in need of additional data curation efforts. We advocate for a cultural shift toward more dynamic practices for developing, maintaining, and disseminating datasets that, while respecting the intellectual property and privacy rights of data creators and data subjects, foster user trust and lead the data-driven chemical revolution.