In building and training our model, we took advantage of multitask transfer learning, an approach by which the model learns not only from a narrowly focused database of biocatalysed reactions, but also from a larger database containing all sorts of other chemical reactions. This latter database allows the model to learn more general features of chemistry. The model can then transfer this knowledge to the task of learning from the more-specific subset of biocatalyzed reactions. Think of transfer learning like how a person learning to play an instrument: Learning to play the guitar will help them if they then tried to learn a similar instrument like the bass.
Multitask transfer would be like learning the guitar and bass at the same time. And in the context of chemistry, it means that we trained the model simultaneously on the general and the specific data sets of enzymatic reactions, rather than sequentially. The simultaneous training proved beneficial for model performance, versus an approach in which the training was done in two subsequent steps.
Despite the paucity of data available for training, our model achieved a good accuracy level on prediction, and in some cases it even corrected some errors found in our ground truth — the portion of the dataset used to test the model — where the products of certain reactions were misstated.
Accelerating the discovery of novel materials is at the heart of IBM’s efforts to help invent what’s next in science and engineering. It’s the sort of work we’re doing with RoboRXN, an AI-powered, data-driven, cloud-based platform for the automation of chemical synthesis. With our new machine learning model, we are expanding RoboRXN’s capabilities to include a new tool to facilitate the use of enzymes for more environmentally friendly chemistry.
The trained model as well as the code are publicly available for anyone to use. We look forward to chemists using them in their research projects. You can download our enzyme-hunting code on GitHub, here, or you can start a project with a trained model on RXN for Chemistry, here.
Machine Learning: Machine learning uses data to teach AI systems to imitate the way that humans learn. They can find the signal in the noise of big data, helping businesses improve their operations.
Probst, D., Manica, M., Teukam, Y, et al. Biocatalysed Synthesis Planning using Data-driven Learning. Nat Commun 13, 964 (2022). ↩