ACS Fall 2021

Inferring missing molecules in incomplete chemical equations


Deep-learning models applied to chemical reactions have received much attention in recent years: from the design of algorithms for forward reaction prediction and retrosynthetic analysis that help chemists plan the design and execution of chemical syntheses, to the generation of reaction fingerprints and prediction of reaction classes [1], yields [2], activation energies [3], or sequences of experimental steps [4]. Several of the latter predictive models require all the reagents to be specified, including solvents and catalysts. Unfortunately, both algorithms and chemists do not provide any guarantee of generating complete chemical reaction equations. It is therefore desirable to infer the missing molecules to provide higher quality data and to comply with a larger class of machine learning models. In fact, an algorithm fulfilling this task can also be used for data curation of reactions extracted from electronic notebooks or from the literature. Interestingly, the task of inferring missing compounds in a reaction equation is a generalization of forward and single-step retrosynthetic prediction models. As a consequence, a properly tuned algorithm completing partial reaction equations also a forward or retrosynthetic prediction model. We present a deep-learning model based on the transformer architecture that infers the molecules in partial reaction SMILES strings [5]. This model does not contain any chemical knowledge except the one learned from the data during training. We illustrate its application for data curation, as well as its use for forward and retrosynthesis prediction.