How generative AI models can fuel scientific discovery

Using generative models to come up with new ideas, we can dramatically accelerate the pace at which we can discover new molecules, materials, drugs, and more.

How can generative models fuel scientific discovery?

Throughout history, humanity has made progress often through a combination of curiosity and creativity. When we have problems that need overcoming, we try to understand why something is the case to figure out a solution.

Many scientific discoveries were made as a result of trial and error. While methodical, this process can also be painstakingly slow. And in some fields of study, the impetus for solving problems can be extremely urgent, whether that’s developing new life-saving drugs, or finding new ways to mitigate the effects of climate change. It can take a decade to discover, test, and develop a new drug. In light of new realities like the COVID-19 pandemic, this is simply not fast enough.

We need to find new ways to spur our creativity and inspiration. No one person, or even a group of people, could possibly keep up with all the latest research in their field of study, let alone remember every iota of what they’ve read over their lifetimes. This, though, is an area where AI can greatly help us.

Today, there are already systems that can ingest large volumes data, sift through it, and help find patterns in the noise. And there are newer emerging streams of AI research that we work on that we believe can accelerate the pace of discovery even more. One of these areas is called generative models.

Generative models are a powerful tool in AI that’s crossed over into popular culture in recent years. We’ve seen AI tools that can mimic the styles of master painters, videos where an actor’s face is eerily plastered on a video of another actor, and AI systems where a user gives a prompt, for a picture or a short story, and they generate something entirely fictional based on the request.

These are the green shoots of the potential of generative models. They are probably our most powerful tool right now to leverage the vast troves of data in science and use it to come up with starting points to design and discover new materials, drugs and more, generate new knowledge, and create new solutions to challenging problems, including those related to climate, sustainability, healthcare and life sciences and more.

How generative models can accelerate the scientific method

In scientific discovery, we follow the scientific method — we start with a question, study it, come up with ideas, study some more, create a hypothesis, test it, assess the results, and report back. But in any discovery applications, there’s reams of information to potentially consume and understand to come up with an idea. Scientists can spend years working on a single question and not find an answer.

That’s partly a result of the limits in our knowledge, but it’s also because the space of possible answers is simply too large to systematically search. In just the field of drug discovery, it’s believed that there are some 10⁶³ possible drug-like molecules in the universe. Trial and error can’t possibly get us through all those combinations.

This is where generative models can be our creative aid and help us find new ideas that we might not have thought to consider before. It helps us break through the bottleneck in the process of idea generation and create new eureka moments.

All scientific discovery involves a hypothesis, and until now hypotheses have been exclusively developed by humans. But building AI systems that can learn from data and make novel and valuable suggestions can greatly aid augment human creativity, and drastically speed up the time it takes to find new ideas to test.

In just the field of drug discovery, it’s believed that there are some 10⁶³ possible drug-like molecules in the universe. Trial and error can’t possibly get us through all those combinations.

At IBM Research, we’ve been building a body of research exploring the development and application of generative models in discovery. Specifically, we created generative model-based AI systems to design molecules for a variety of materials discovery applications.

Our team developed one family of generative model algorithms that efficiently combines conditional generative models with reinforcement learning to design ligands¹ with desired activity against specific proteins and hit-like anticancer molecules² for specific omic profiles. We showed how generative models are able to support the initial design phases of the material discovery process and demonstrated how it can be combined with data-driven chemical synthesis planning to swiftly produce candidates for wet-lab experimentations.

Recently, my colleagues built a generative model that can propose new antimicrobial peptides³ (AMPs) with desired properties. AMPs are viewed as a “drug of last resort” against antimicrobial resistance, one of the biggest threats to global health and food security. Our generative model identified novel candidate molecules, and a second AI system filtered them using predicted properties such as toxicity and broad-spectrum activity. In the span of a few weeks, we were able to identify several dozen novel candidate molecules — a process that can normally take years.

Similarly, another team at IBM Research used generative models, along with several other AI and high-performance computing advances, to come up with a new photoacid generator (PAG) — a material key to manufacturing semiconductors — a process that usually takes years and was completed in weeks.

Generative models, however, don’t have to be limited to just the hypothesis step of the scientific method. In the future, they can potentially help us figure out what questions we should even be asking before we try to find answers: Given everything we know about a field, what is the next question we should ask?

We can potentially create generative models to help us answer questions we don’t know where to start with either, such as how to find a new antiviral for an unknown protein, or whether we could make a catalyst for CO₂ in the atmosphere. We can potentially use generative models in testing, to help us determine what conditions we need to create for the most accurate results, and we can even use it to help us refine future tests after we’ve gotten our results.

Creating a scientific community of discovery

As part of our mission to accelerate discovery for IBM and its partners, we want to foster an open community around scientific discovery. Technologies like AI should be a tool that scientists and researchers use to carry out their research quicker and more effectively, rather than something that requires very specific domain knowledge to utilize.

To that end, we recently launched what we’re calling the Generative Toolkit for Scientific Discovery (GT4SD). It’s an open-source library (released under the MIT license) to accelerate hypothesis generation in the scientific discovery process that eases the adoption of state-of-the-art generative models. GT4SD includes models that can generate new molecule designs based on properties like target proteins, target omics profiles, scaffolds distances, binding energies, and additional targets relevant for materials and drug discovery.

GT4SD is an open-source library to accelerate hypothesis generation in the scientific discovery process that eases the adoption of state-of-the-art generative models.

The GT4SD library provides an effective environment for generating new hypotheses (or inference) and for fine-tuning generative models for specific domains using custom data sets (or retraining). It’s compatible with many popular deep learning frameworks, including PyTorch, PyTorch Lightning, HuggingFace Transformers, GuacaMol, and Moses. It serves a wide range of applications, ranging from materials science to drug discovery.

GT4SD’s common framework makes generative models easily accessible to a broad community, including AI/ML practitioners developing new generative models who want to deploy with just a few lines of code. GT4SD provides a centralized environment for scientists and students interested in using generative models in their scientific research, allowing them to access and explore a variety of different pretrained models. GT4SD provides consistent commands and interfaces for inference and retraining with customizable parameters across the different generative models.

The development of problem-specific intelligence is made possible by automatic workflows that allow for retraining with a user’s own data covering molecular structures and properties. The replacement of manual processes and human bias in the discovery process has important effects on applications that rely on generative models, leading to an acceleration of expert knowledge.

The entirety of GT4SD is available on GitHub, and we encourage you to try it out for yourself. In the near-term, we plan to continue expanding the toolkit’s portfolio and release new algorithms, frameworks and pre-trained models. It is our hope that through tools like GT4SD and partnerships, we can build an open community of discovery that together accelerates scientific discovery for urgent problems and speeds up the path for creating solutions that impact the world.

Learn more about:

Trustworthy Generation: Our methods facilitate data augmentation for trustworthy machine learning and accelerate novel designs for drug and material discovery, and beyond.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter