Nature’s molecular machines could speed the development of environmentally friendly biochemical replacements for industrial processes.
Making the products we consume more sustainable is something the world desperately needs. And for manufacturing of everyday chemicals, the solution could lie in enzymes, the little molecular machines that speed up chemical reactions that keep almost all living organisms alive — as well as catalyze many manufacturing processes. But their widespread adoption for industrial use is currently hampered by the difficulty involved in choosing the right enzyme for the right chemical reaction.
To solve this matching puzzle, we built a machine learning model that can help scientists predict which enzymes could be suitable replacements for a given reaction. This could allow us to move closer to more sustainable and safer processes by harnessing the biological catalysts that have been optimized by our nature’s 3.5 billion year-long evolutionary process.
Enzymes are the master accelerators of almost all processes in the human body, playing an instrumental role in everything from digestion to breaking down harmful toxins, and even DNA replication. But the importance of enzymes goes beyond biochemistry; they're also used to make industrial chemical processes more sustainable by lowering their energy consumption or the amount of polluting solvents required to make them. When manufacturing white paper for printing or use in notebooks, for example, the enzyme Xylanase treatment in paper manufacturing has been shown to reduce chlorine consumption 15%, and has been shown to lower dangerous adsorbable organic halides (a chlorine byproduct) by 25%.xylanase helps reduce the amount of chlorine-based bleacher, and in baking, enzymes called proteases help make cookies crumbly by degrading gluten in wheat flour. But there aren't many industrial applications where enzymes are very widely adopted yet, primarily because choosing the right enzymes is a such daunting task. It often requires a great deal of domain-specific knowledge that no chemist, or team of chemists, could ever have a complete grasp of.
This is where our new data-driven AI model1 for biotacalyzed synthesis planning comes in. The model is trained with publicly available USPTO data on enzymatic biocatalysis which, in principle, eliminates the need for a human to be an expert in biocatalysis to select the right enzyme and substrate needed to obtain a desired chemical. In doing this, our model closes a knowledge gap that often prevents more sustainable biocatalyzed reactions from being used in industry.
For some subcategories of enzymes, the dearth of available data to train our model still significantly affects its accuracy. However, this can be mitigated by users with access to proprietary datasets on those specific subclasses of enzymatic reactions, which can be used to fine-tune our model and increase its predictive power.
In building and training our model, we took advantage of multitask transfer learning, an approach by which the model learns not only from a narrowly focused database of biocatalysed reactions, but also from a larger database containing all sorts of other chemical reactions. This latter database allows the model to learn more general features of chemistry. The model can then transfer this knowledge to the task of learning from the more-specific subset of biocatalyzed reactions. Think of transfer learning like how a person learning to play an instrument: Learning to play the guitar will help them if they then tried to learn a similar instrument like the bass.
Multitask transfer would be like learning the guitar and bass at the same time. And in the context of chemistry, it means that we trained the model simultaneously on the general and the specific data sets of enzymatic reactions, rather than sequentially. The simultaneous training proved beneficial for model performance, versus an approach in which the training was done in two subsequent steps.
Despite the paucity of data available for training, our model achieved a good accuracy level on prediction, and in some cases it even corrected some errors found in our ground truth — the portion of the dataset used to test the model — where the products of certain reactions were misstated.
Accelerating the discovery of novel materials is at the heart of IBM’s efforts to help invent what’s next in science and engineering. It’s the sort of work we’re doing with RoboRXN, an AI-powered, data-driven, cloud-based platform for the automation of chemical synthesis. With our new machine learning model, we are expanding RoboRXN’s capabilities to include a new tool to facilitate the use of enzymes for more environmentally friendly chemistry.
The trained model as well as the code are publicly available for anyone to use. We look forward to chemists using them in their research projects. You can download our enzyme-hunting code on GitHub, here, or you can start a project with a trained model on RXN for Chemistry, here.
Machine Learning: Machine learning uses data to teach AI systems to imitate the way that humans learn. They can find the signal in the noise of big data, helping businesses improve their operations.
Date18 Feb 2022
- Note 1: Xylanase treatment in paper manufacturing has been shown to reduce chlorine consumption 15%, and has been shown to lower dangerous adsorbable organic halides (a chlorine byproduct) by 25%. ↩︎
Probst, D., Manica, M., Teukam, Y, et al. Biocatalysed Synthesis Planning using Data-driven Learning. Nat Commun 13, 964 (2022). ↩