4 minute read

An AI foundation model that learns the grammar of molecules

Meet MoLFormer-XL, a pretrained AI model that infers the structure of molecules from simple representations, making it faster and easier to screen molecules for new applications or create them from scratch.

A morphing .gif of molecules ranging from illustration to 3d models in a blue hue.

Meet MoLFormer-XL, a pretrained AI model that infers the structure of molecules from simple representations, making it faster and easier to screen molecules for new applications or create them from scratch.

Large pretrained models are fast becoming AI’s Swiss Army knife. Once limited to summarizing text and translating languages, they can now write code, compose music, and answer obscure questions at length.

Now there’s a new skill to add to their repertoire: the ability to infer the shapes and properties of molecules to predict how they might behave and to propose entirely new ones.

Most molecular models need estimates or measurements of a molecule’s 3D shape to accurately predict many of its properties. Chemists can extract this information through simulations or lab experiments, but it’s an imperfect, expensive process that can take months to years. Perhaps unsurprisingly, we have detailed structures for only a few million molecules out of the trillions upon trillions potentially out there.

But now, there could be a way to eliminate this bottleneck in the discovery process with the help of AI. Introducing MoLFormer-XL, the latest addition to the MoLFormer family of foundation models for molecular discovery. MoLFormer-XL has been pretrained on 1.1 billion molecules represented as machine-readable strings of text. From these simple and accessible chemical representations, it turns out that a transformer can extract enough information to infer a molecule’s form and function.

Instead of training a model on thousands to millions of examples of molecules with detailed structural and property information, we can leverage a dataset that is at least 1,000 times larger. From this mountain of examples, we found that MoLFormer-XL could more easily learn a variety of downstream property prediction tasks. We report our results in the latest issue of Nature Machine Intelligence. 1

We found that MoLFormer-XL could predict a molecule’s physical properties, like its solubility, its biophysical properties, like its anti-viral activity, and its physiological properties, like its ability to cross the blood-brain barrier. It could even predict quantum properties, like a molecule’s bandgap energies, an indicator of how well it converts sunlight to energy.

MoLFormer-XL outperformed other chemical language models at nearly a dozen molecular property benchmarks. It also did better than graph models trained on molecules with precise 3D geometric labels. MoLFormer-XL’s ability to efficiently learn the structures of such varied molecules could make it a powerful tool for discovering new molecules by their desired properties.

A largescale, energy-efficient model for molecular discovery

Many molecular models today rely on graph neural network architectures that predict molecular behavior from a molecule’s 2D or 3D structure. But graph models often require extensive simulations or experiments, or use complex mechanisms, to capture atomic interactions within a molecule. Most graph models, as a result, are limited to datasets of about 100,000 molecules, sharply limiting their ability to make broad predictions.

MoLFormer-XL, by contrast, rests on a foundation of more than 1.1 billion molecules, each represented by a compact snippet of text belonging to the SMILES notation system (short for Simplified Molecular Input Line Entry System). Each SMILES string describes how atoms in the types of organic small molecules targeted for drug and material discovery are bonded and arranged in a so-called molecular graph. At scale, this meager information contains a wealth of structural clues.

To tap it, we trained MoLFormer-XL to focus on the interactions between atoms represented in each SMILES string through a new and improved type of rotary embedding. Instead of having the model encode the absolute position of each character in each string, we had it encode the character’s relative position. This additional molecular context seems to have primed the model to learn structural details that make learning downstream tasks much easier.

The power of MoLFormer-XL lies in its size, which traditionally has come at enormous training costs in computation and energy. However, we took pains to conserve energy throughout. To pack more computation into each GPU, we chose an efficient linear time attention mechanism and sorted our SMILES strings by length before feeding them to the model. Together, both techniques raised our per-GPU processing costs from 50 molecules to 1,600 molecules, allowing us to get away with 16 GPUs instead of 1,000. By eliminating hundreds of unnecessary GPUs, we consumed 61 times less energy and still had a trained model in five days. 

A foundation model that can screen molecules and generate new ones

The more data that AI models ingest, and the more parameters they add, the better they seem to get at understanding underlying structures — be it natural language grammar or the way physical scenes are organized.

We see the same emergent behavior in MoLFormer-XL. We found that our model could distinguish molecules by their flavor, for example, or by their blood-brain barrier permeability, even though we never told the model about either property.

To understand how MoLFormer-XL pulled it off, we extracted an attention map that suggested the model focused closely on the relative position of atoms in a molecule. From this information, we think MoLFormer learned its structure and properties.

MoLFormer-XL’s ability to extract this essential information makes it an excellent tool for screening molecules for new applications or discovering new ones. In our paper we show that MoLFormer-XL can be deployed as an encoder, to identify molecules with similar structures and functions. This is especially useful in finding new applications for already approved drugs. We recently built a cloud-based platform that allows chemists to use MoLFormer to compare molecules of interest with existing databases of chemicals and already approved drugs.

MoLFormer can be also used as a decoder, to generate molecules with desired properties. Early in the pandemic, before the world knew much about the new coronavirus, we and our colleagues at Oxford University and Diamond UK used a beta version of MoLFormer to generate candidate molecules that inhibit the SARS-Cov-2 virus. We used a similar approach to rapidly design a pair of new and potent, broad-spectrum antimicrobial peptides. We are currently expanding MoLFormer-XL into a generative version that will be able to suggest novel, unique, and chemically valid molecules.

What’s next

To make MoLFormer-XL more widely available, we are building out our platform to make it faster and easier to use. It’s part of IBM’s larger effort to innovate across the entire stack to make foundation models accessible to everyone, including researchers in academia and industry.

We are hopeful that MoLFormer-XL and the MoLFormer family of foundation models can make researchers more productive. We are eager to see if it can streamline the discovery of new drugs to fight emerging diseases or new materials that can speed the transition to clean, renewable energy. AI alone won’t solve our problems, but it can help guide us to new knowledge and solutions.