Low-data regime yield predictions with uncertainty estimation using deep learning approaches
Artificial intelligence is driving one of the most important revolutions in organic chemistry. Multiple platforms, including tools for reaction prediction and synthesis planning based on machine learning, have successfully become part of the organic chemists’ daily laboratory work, assisting in domain-specific synthetic problems. Unlike reaction prediction and retrosynthetic models, the prediction of reaction yields has received less attention in spite of the enormous potential of accurately predicting reaction conversion rates. Reaction yields models, describing the percentage of the reactants converted to the desired products, could help chemists navigate reaction space, optimize reactions, and accelerate the design of more effective routes. Here, we investigate high-throughput experimentation data sets and show how data augmentation on chemical reactions can improve yield predictions’ accuracy, even when only small training sets are available. Previous work used molecular fingerprints, physics-based or categorical descriptors of the precursors. In our work, we fine-tune natural language processing-inspired reaction transformer models on different augmented data sets to predict yields solely using a text-based representation of chemical reactions. When the augmented training sets contain 2.5% or more of the data, our models outperform previous models, including those using physics-based descriptors as inputs. Moreover, we demonstrate the use of test-time augmentation to generate uncertainty estimates, which correlate with the prediction errors.