Thieme trains IBM RXN for Chemistry with high-quality data
The results from the IBM Research Europe and Thieme Chemistry collaboration, connecting information from Science of Synthesis and Synfacts to IBM’s Molecular Transformer AI model showed an increase in chemical reaction prediction accuracy by a factor of three for forward predictions, and a factor of nine for retrosynthesis.
The results from the IBM Research Europe and Thieme Chemistry collaboration, connecting information from Science of Synthesis and Synfacts to IBM’s Molecular Transformer AI model showed an increase in chemical reaction prediction accuracy by a factor of three for forward predictions, and a factor of nine for retrosynthesis.
The collaboration with Thieme, a leading supplier of information for developing medical and chemistry industry products and services, united IBM RXN for Chemistry neural machine translation model, Molecular Transformer, with Thieme’s human-curated data from Science of Synthesis (a full-text resource for synthetic organic chemistry) and the journal, Synfacts. These two sources include hundreds of thousands of curated reactions, covering a wide area of chemical space complementary to their platform’s existing patent reactions.
When Try IBM RXN for Chemistry, the free AI Tool in the Cloud for Digital Chemistry.RXN for Chemistry — the AI behind RoboRXN — was launched in 2018, it was trained on more than 3 million chemical reactions, derived from publicly available patents. Since then, the Molecular Transformer has outperformed all data-driven models, achieving more than 90% accuracy on forward chemical reaction predictions (reactants + reagents to products).
While this is quite a success, the chemical reaction space is vast. Organic compounds can react with each other in hundreds of thousands of different ways. So, to begin tapping into the uncharted areas of chemical space to uncover new chemical reactions, more experimental knowledge is needed. Expanding the knowledge of the AI model will not only improve synthesis planning, but it will also save organic chemists from spending countless hours in the lab trying to find the right reactants that form a new single product through trial-and-error.
the Molecular Transformer has outperformed all data-driven models, achieving more than 90% accuracy on forward chemical reaction predictions.
This week, IBM Research Europe and Thieme Chemistry revealed the first results of their work together in the webinar Powering Molecular Transformers with High Quality Data, also embedded below. What’s more, the results were evaluated by seven eminent synthetic chemistry experts and their research groups from China, Germany, Switzerland, New Zealand, and the U.S.
The verdict: higher quality data is essential for improving the overall performance of the RXN for Chemistry platform.
The results showed an increase in prediction accuracy by a factor of three for forward predictions, and a factor of nine for retrosynthesis.
Plugging in information from high-quality sources such as Science of Synthesis and Synfacts to the Molecular Transformer showed an increase in prediction accuracy by a factor of three for forward predictions, and a factor of nine for retrosynthesis. A full analysis of the datasets relying on reaction fingerprints and clustering algorithms revealed that the platform’s existing patent data and Thieme’s data are complementary to each other in terms of reaction coverage. Hence, training with a combination of both datasets maximized the knowledge learned by the AI models.
As the table below indicates, Science of Synthesis and Synfacts have higher quality chemical records, reflected by a larger percentage of usable records. This consistency in Thieme’s dataset is what facilitated the learning process of the AI model, resulting in more consistent predictions.
Chemical Collection | Usability for AI |
---|---|
Thieme’s Science of Synthesis (SOS) | 73% |
Synfacts | 87% |
Commercial Patent Dataset | ~35% |
The RXN model retrained with Science of Synthesis and Synfacts data achieved a chemical accuracy of about 70% on the prediction of complex chemical records and provides diverse retrosynthetic recommendations with suggested reactions closely related to the ones presented in Thieme’s data.
Essentially, a dataset must fulfill the function for which it is collected. So, can the newly retrained RXN model fulfill its duties if it can only predict complex chemical synthesis in about 70% of cases?
The answer to this question depends on which field of chemistry it is applied to, or the goal of the user. For example, chemists looking to refine polymer properties to make plastics more biodegradable could use the chemical reaction predictions from the RXN model to determine how to synthesize the needed monomers.
In any case, it’s an open discussion. IBM Research Europe and Thieme Chemistry are currently consulting experts in the chemistry community to address this question, as well as others that pertain to the role of AI in applied chemistry.
Before RXN for Chemistry, chemists relied on serendipity to discovery new reactions. With the help of this AI tool, chemists could now find new approaches they may have never thought of using traditional discovery methods.
While the collaborative work between Thieme and IBM certainly demonstrates how high-quality chemical reaction data can boost the performance of predictive tools like RXN for Chemistry to unprecedented levels, the results are provisional. Efficiently implementing AI solutions in chemistry processes requires a dialogue among field experts, specifically when it comes to collecting the right data. It will take an ecosystem of bright minds to determine how high-quality data can best be collected and managed in a way that it can be beneficial when coupled with artificial intelligence.
Read more about IBM’s collaboration with Thieme to accelerate discovery in organic chemistry, here.
Notes
- Note 1: Try IBM RXN for Chemistry, the free AI Tool in the Cloud for Digital Chemistry. ↩︎