Research
7 minute read

IBM RXN for Chemistry: Unveiling the grammar of the organic chemistry language

In “Extraction of organic chemistry grammar from unsupervised learning of chemical reactions,” RXNMapper extracts the grammar of organic chemistry.

In “Extraction of organic chemistry grammar from unsupervised learning of chemical reactions,” RXNMapper extracts the grammar of organic chemistry.

Talk to any organic chemist, and they will tell you that learning organic chemistry is like learning a new language, with grammar similar to a myriad of chemical reaction rules. And it’s is also about intuition and perception, much like acquiring a language as a child.

In our paper “Extraction of organic chemistry grammar from unsupervised learning of chemical reactions ,” published in the peer-reviewed journal Science Advances, scientists from IBM Research Europe, the MIT-IBM Watson AI Lab, and the University of Bern for the first time extracted the ‘grammar’ of organic chemistry’s ‘language’ from a large number of organic chemistry reactions.1 For that, we used RXNMapper-Vis, a cutting-edge, open-source atom-mapping tool we developed. RXNMapper performs better than the current commercially available tools, and learns without human supervision.2

Cracking the language code with the Rosetta Stone

In the 19th century, the Rosetta Stone provided the starting point for scholars to crack the code of hieroglyphics, the ancient Egyptian writing system that combines logographics, syllabic and alphabetic elements. While scholars were able to quickly translate the 54 lines of Greek and 32 lines of demotic inscribed on the stone, it took years to fully decipher the 14 lines of hieroglyphs. British scholar Thomas Young made a major breakthrough in 1814, but it was Frenchman Jean-Francois Champollion who delivered a full translation in 1822. Deciphering those 14 lines through translation mapping with the other two languages written on the Rosetta Stone was enough to reconstruct the grammar and give scholars a window into a flourishing period of Egyptian language and culture.

Fast forward to today, the Rosetta Stone experience is equivalent to traveling to a foreign country to learn the native language through total immersion. The more you as the ‘scholar’ interact with the locals, their dialect, culture, customs, even street signs, the more you begin to recognize and map the recurring patterns in the structure of the language, its colloquial phrases and pronunciations, without a formal language course. Spend enough time in Germany, for example, and you will begin to notice the similarities in vocabulary between English and German, or structural differences such as the placement of the unconjugated second verb at the end of a phrase. This is where modern English deviates from German despite its Germanic roots.

Top: A mapping between an English phrase and the German translation. Bottom: A mapping between reactants (methanol + benzoic acid) and a product molecule (methyl benzoate) in a chemical reaction represented with a text-based line notation called SMILES.
Figure 1: Top: A mapping between an English phrase and the German translation. Bottom: A mapping between reactants (methanol + benzoic acid) and a product molecule (methyl benzoate) in a chemical reaction represented with a text-based line notation called SMILES.

The natural process of language acquisition or becoming fluent in a foreign language is essentially the mapping of various linguistic elements to understand the connection between the individual words, expressions and concepts and how their precise order maps to your mother tongue.

Coming back to the language of organic chemistry, we asked ourselves two basic but very important questions: What if there was a possibility to visualize the mapped patterns that you’ve learned? What if the rules of a language could be extracted from these learned patterns?

It may be impossible to extract this information from the human brain, but we thought it possible when the learner is a neural network model, such as a reaction ‘Transformer.’ We let the model learn the language of chemical reactions by repeatedly showing it millions of examples of chemical reactions. We then unboxed the trained artificial intelligence model by visually inspecting the learned patterns, which revealed that the model had captured how atoms rearrange during reactions without supervision or labeling. From this atom rearrangement signal, we extracted the rules governing chemical reactions. We found that the rules were similar to the ones we learn in organic chemistry.

An example of the study and analogy between learning a new language and learning organic chemistry reactions.
Figure 2: Overview of the study and analogy between learning a new language and learning organic chemistry reactions.

The power of Transformer models

In 2018, we created a state-of-the-art online platform called RXN for Chemistry using Natural Language Processing (NLP) architectures in synthetic chemistry to predict the outcome of chemical reactions. Specifically, we used Molecular Transformer, where chemical reactions are represented by a domain-specific language called SMILES, or Simplified Molecular Input Line Entry System, is a notation system for representing molecules and reactions.SMILES.3 Back then, we framed chemical transformations as translations from reactants to products, similar to translating, say, English to German. The model architecture we used in this new work is very similar, which brought up another important question: why do Transformers work so well for chemistry?

Transformer models are so powerful because they learn to represent inputs (atoms or words) in their context. If we take our example from Figure 1, “See” (German for lake) has an entirely different meaning than “see” in English, despite the same spelling. Similarly, in chemistry, an oxygen atom will not always carry the same meaning. Its meaning is dependent on the context or the surrounding atoms, i.e., on the atoms in the same molecule and the atoms it interacts with during a reaction.

Reaction Transformer model consisting of self-attention layers, each containing multiple heads. Attention patterns learned by different heads in the model
Figure 3: Reaction Transformer model consisting of self-attention layers, each containing multiple heads. Attention patterns learned by different heads in the model.

Transformers are made of stacks of self-attention layers (Fig. 3). The attention mechanism is responsible for connecting concepts and making it possible to build meaningful representations based on the context of the atoms. Every self-attention layer consists of multiple ‘heads’ that can all learn to attend the context differently. In human language, one head might focus on what the subject is doing, another head on why, while a third might focus on the punctuation in the sentence. Learning to attend to different information in the context is crucial to understanding how the different parts of a sentence are connected to decipher the correct meaning.

RXNMapper – the ultimate atom-mapping tool

We then used this atom-mapping signal to develop RXNMapper, the new state-of-the-art, open-source atom-mapping tool. According to a recent independent benchmark study,2 RXNMapper outperforms what’s now commercially available. Considering the fact that the atom-mapping signal was learned without supervision, this is a remarkable result.

What impact will it have on the work of chemists? High-quality atom-mapping is an extremely important component for computational chemists. Hence, RXNMapper is an essential tool for traditional downstream applications such as reaction prediction and synthesis planning. Now we can extract the ‘grammar’ and ‘rules’ of chemical reactions from atom-mapped reactions, allowing a consistent set of chemical reaction rules to be constructed within days, rather than years, as is the case with manual curation by humans. Our RXNMapper is not only accurate, but it is also incredibly fast, mapping reactions at ~7ms/reaction. This makes it possible to map huge data sets containing millions of reactions within a few hours.

RXNMapper atom-mapping illustration
Figure 4: RXNMapper atom-mapping illustration

RXNMapper may be far from being the Rosetta Stone of chemistry, but it unveils the grammar contained in a coherent set of chemical reaction data in a way that we can experience full immersion. If organic chemistry isn’t a language, then tell us…what is it?

Give RXNMapper a try on our online demo, and make sure to star our open-source repo on GitHub .

Notes

  1. Note 1SMILES, or Simplified Molecular Input Line Entry System, is a notation system for representing molecules and reactions. ↩︎

References

  1. Schwaller, P., Hoover, B., Reymond, J.-L., Strobelt, H. & Laino, T. Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. Sci. Adv. 7, eabe4166 (2021).

  2. Madzhidov, T. et al. Atom-to-Atom Mapping: A Benchmarking Study of Popular Mapping Algorithms and Consensus Strategies. (2020) 2

  3. Schwaller, P. et al. Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci. 5, 1572–1583 (2019).