The natural process of language acquisition or becoming fluent in a foreign language is essentially the mapping of various linguistic elements to understand the connection between the individual words, expressions and concepts and how their precise order maps to your mother tongue.
Coming back to the language of organic chemistry, we asked ourselves two basic but very important questions: What if there was a possibility to visualize the mapped patterns that you’ve learned? What if the rules of a language could be extracted from these learned patterns?
It may be impossible to extract this information from the human brain, but we thought it possible when the learner is a neural network model, such as a reaction ‘Transformer.’ We let the model learn the language of chemical reactions by repeatedly showing it millions of examples of chemical reactions. We then unboxed the trained artificial intelligence model by visually inspecting the learned patterns, which revealed that the model had captured how atoms rearrange during reactions without supervision or labeling. From this atom rearrangement signal, we extracted the rules governing chemical reactions. We found that the rules were similar to the ones we learn in organic chemistry.
In 2018, we created a state-of-the-art online platform called RXN for Chemistry using Natural Language Processing (NLP) architectures in synthetic chemistry to predict the outcome of chemical reactions. Specifically, we used Molecular Transformer, where chemical reactions are represented by a domain-specific language called SMILES, or Simplified Molecular Input Line Entry System, is a notation system for representing molecules and reactions.SMILES.3 Back then, we framed chemical transformations as translations from reactants to products, similar to translating, say, English to German. The model architecture we used in this new work is very similar, which brought up another important question: why do Transformers work so well for chemistry?
Transformer models are so powerful because they learn to represent inputs (atoms or words) in their context. If we take our example from Figure 1, “See” (German for lake) has an entirely different meaning than “see” in English, despite the same spelling. Similarly, in chemistry, an oxygen atom will not always carry the same meaning. Its meaning is dependent on the context or the surrounding atoms, i.e., on the atoms in the same molecule and the atoms it interacts with during a reaction.
Transformers are made of stacks of self-attention layers (Fig. 3). The attention mechanism is responsible for connecting concepts and making it possible to build meaningful representations based on the context of the atoms. Every self-attention layer consists of multiple ‘heads’ that can all learn to attend the context differently. In human language, one head might focus on what the subject is doing, another head on why, while a third might focus on the punctuation in the sentence. Learning to attend to different information in the context is crucial to understanding how the different parts of a sentence are connected to decipher the correct meaning.
We then used this atom-mapping signal to develop RXNMapper, the new state-of-the-art, open-source atom-mapping tool. According to a recent independent benchmark study,2 RXNMapper outperforms what’s now commercially available. Considering the fact that the atom-mapping signal was learned without supervision, this is a remarkable result.
What impact will it have on the work of chemists? High-quality atom-mapping is an extremely important component for computational chemists. Hence, RXNMapper is an essential tool for traditional downstream applications such as reaction prediction and synthesis planning. Now we can extract the ‘grammar’ and ‘rules’ of chemical reactions from atom-mapped reactions, allowing a consistent set of chemical reaction rules to be constructed within days, rather than years, as is the case with manual curation by humans. Our RXNMapper is not only accurate, but it is also incredibly fast, mapping reactions at ~7ms/reaction. This makes it possible to map huge data sets containing millions of reactions within a few hours.
RXNMapper may be far from being the Rosetta Stone of chemistry, but it unveils the grammar contained in a coherent set of chemical reaction data in a way that we can experience full immersion. If organic chemistry isn’t a language, then tell us…what is it?
Give RXNMapper a try on our online demo, and make sure to star our open-source repo on GitHub .
Schwaller, P., Hoover, B., Reymond, J.-L., Strobelt, H. & Laino, T. Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. Sci. Adv. 7, eabe4166 (2021). ↩
Madzhidov, T. et al. Atom-to-Atom Mapping: A Benchmarking Study of Popular Mapping Algorithms and Consensus Strategies. (2020) ↩ ↩2
Schwaller, P. et al. Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci. 5, 1572–1583 (2019). ↩