Introducing query-based molecular optimization (QMO), an AI framework that can help improve discovery workflows and accelerate the delivery of new molecules and materials.
Many of today’s most urgent problems demand new molecules and materials, from antimicrobial drugs to fight superbugs and antivirals to treat novel pandemics to more sustainable photosensitive coatings for semiconductors and next-generation polymers to capture carbon dioxide right at its source. We can design these from scratch, using AI1 to expedite the otherwise expensive and slow process, or we can tweak existing molecules2 to fine-tune the properties we care about — such as toxicity, activity, or stability. Starting from a known molecule is like getting a head start on the design and production of candidate molecules, as we know they have some of the characteristics we need, and we can use existing knowledge and manufacturing pipelines to synthesize and test them down the line.
The challenge in this process, called molecular optimization, is that tweaking an existing molecule can produce a huge number of variants. They won’t all have the desired properties, and evaluating them empirically to find those that do would take too much time and money to be feasible. For example, if we had 20 letters available, we could produce almost as many 60-letter sequences as the number of atoms in the known universe (roughly 1080). And as the sequence length and the number of letters available increase, the number of possible variants grows combinatorially, creating an intractably large search space. The same is true in searching for the optimal molecules in a given situation.
Our latest work,3 query-based molecular optimization (QMO), overcomes this challenge by using AI to select and find the best variants of a molecule. Given a lead molecule as a start, QMO learns a representation of all its variants and searches out those predicted to have one or more of the desired properties. QMO can lead us toward the molecules that are right for our biggest research tasks, which has the potential to be a critical step in accelerating scientific discovery in the future.
QMO could replace the screening step in molecular generation to fast-track the discovery and development of high-performance materials.
The components of QMO
Our approach uses a deep generative autoencoder (a pair of encoder and decoder deep neural nets) to represent the variants of a lead molecule and beyond, combined with a search technique that identifies variants optimized for the desired properties, using external guidance about those properties derived from black-box evaluators, such as physics-based simulators, informatics, experiments, or databases.
Representing molecular variants
In the QMO framework, we model a molecule as a sequence of characters representing chemicals or amino acids, depending on the type of molecule. We use an encoder to capture the sequence of the lead molecule and map that to the low-dimensional (or simplified) continuous representation of all its possible variants, each denoted mathematically by an embedding vector. We use a corresponding decoder to translate an embedding vector back to a sequence. Our approach decouples representation learning from optimization to reduce the complexity of the problem.
QMO uses feedback from one or more evaluators that predict the molecular properties of any variant in the search space, which could be based on simulations, informatics, experiments, or databases. Property predictions are based on the sequence of the variant rather than on its representation, which allows us to exploit already existing evaluators. We can use separate evaluators to predict multiple desired properties and to impose constraints on other properties, such as similarity to a reference sequence, whether of the lead molecule or a different molecule of interest. This allows us to optimize for multiple characteristics simultaneously.
Searching for variants with optimal properties
Having learned a representation of the search space and incorporated evaluations of the variants’ molecular properties, we now search the space for the best matches. For this, QMO uses a novel query-based guided search method based on zeroth-order optimization, a technique that finds the best candidates based on their predicted properties rather than on the sequence changes that lead to them. We implement this technique because the actual changes to the sequence that create the best molecules often cannot readily be calculated, because there are too many possibilities or because such calculations are not permitted by the underlying data sources. We use random neighborhood sampling to select points around the candidate embedding vectors, query the properties of the decoded sequences, select the best matches, and use this feedback repeatedly, until we land on the best or desired variants.
Use cases and performance of QMO
QMO addresses two practical use cases: optimizing sequence similarity while satisfying the desired chemical properties, and optimizing chemical properties while respecting sequence similarity constraints. In the first case, the QMO framework aims to find a variant with minimal loss of the desired properties and the most similarity to the reference sequence. Conversely, in the second, it seeks a variant with minimal loss of similarity and the most retention of desired properties. We tested QMO on four tasks of the first type — two standard small-molecule optimization benchmarks and two real-world discovery problems.
Optimizing drug-likeness and solubility
Finding optimized molecules that are sufficiently like a reference sequence and have either improved drug-likeness or improved solubility are common, and relatively simple, benchmark tasks in molecular optimization. We used the same set of 800 molecules as starting sequences for both tasks and compared the performance of QMO with other machine learning methods. QMO achieved a superior performance, with a success rate of almost 93% in optimizing drug-likeness, at least 15% higher than other methods, and a roughly 30% relative improvement in optimizing solubility.
These benchmarks resemble the kinds of tasks involved in optimizing materials like food ingredients, agrochemicals, pesticides, drugs, catalysts, and waste chemicals, showing there are myriad practical applications. But these tasks don’t fully capture the complexity associated with molecular properties more relevant to real-world discovery efforts, such as binding affinity and toxicity. To see how QMO might perform in more complex optimization scenarios, we addressed two new molecular optimization tasks informed by pressing real-world discovery needs.
Improving binding affinity of SARS-CoV-2 inhibitors
During the rapid spread of a new virus like SARS-CoV-2, the virus that causes COVID-19, the search for effective drugs is urgent: Optimization of lead molecules can accelerate discovery of those drugs. In the case of SARS-CoV-2, most efforts focused on molecules targeting the main protease (Mpro) as potential drug candidates. We applied QMO to the task of optimizing 23 existing inhibitors of Mpro by improving their binding affinity while retaining a high degree of sequence similarity. Our results show that QMO can find molecules with high similarity to existing lead molecules and improved in silico binding free energy for Mpro.
Lowering toxicity of antimicrobial peptides
Pathogen resistance to existing antibiotics is increasing at an alarming rate, so discovering new antibiotics is a global health priority. Known antimicrobial peptides (AMPs) are a promising field of lead molecules. Optimal AMP design requires balancing multiple, closely interacting properties, such as potency versus toxicity. We used QMO to find variants of 150 existing toxic AMPs with lower predicted toxicity but a high sequence similarity to the lead molecules. QMO showed a high success rate, optimizing nearly 72% of the lead molecules. As a validation step, we assessed the QMO-optimized sequences with state-of-the-art toxicity predictors not used in the QMO framework. The toxicity of the optimized sequences predicted by these tools closely matches what the QMO evaluators predicted.
What we can learn from QMO
We show that QMO can efficiently identify variants of lead molecules that are optimized for a range of desired properties. In these examples, we applied QMO to organic small molecules and peptides, but the approach could also be used for inorganic materials, like metal oxides. These are often used as catalysts, conductors, and anti-corrosion coatings in nanomaterials, circuits, sensors, and fuel cells. Similarly, QMO can be easily extended to optimize macromolecules like polymers or proteins. By quickly identifying optimal candidates prior to synthesis, QMO could replace the screening step in molecular generation to fast-track the discovery and development of high-performance materials.
We’ll continue to build QMO and explore the opportunities for discovery that it offers. We hope to synthesize and test optimized variants to see how they behave. We’re also planning to expand QMO beyond functional properties, like binding affinity and toxicity, to evaluate more difficult properties, such as three-dimensional molecular structure. Finally, we’re working toward integrating expert feedback into the QMO framework to enable human-AI collaboration. You can check out our work on GitHub here.
As a generic AI framework, QMO has potential well beyond molecular optimization and could be applied to pipelines targeting many other scientific processes or design problems. This makes QMO a powerful and versatile tool for accelerating discoveries in our push to tackle the tough scientific challenges.
- Das, P., Sercu, T., Wadhawan, K. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat Biomed Eng 5, 613–623 (2021).↩
- Chenthamarakshan, V., Das, P., Hoffman, S., et al. CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. Advances in Neural Information Processing Systems 33 (NeurIPS 2020)↩
- Hoffman, S.C., Chenthamarakshan, V., Wadhawan, K. et al. Optimizing molecules using efficient queries from property evaluations. Nat Mach Intell 4, 21–31 (2022).↩