Generative AI refers to deep-learning models that can generate high-quality text, images, and other content based on the data they were trained on.
Artificial intelligence has gone through many cycles of hype, but even to skeptics, the release of ChatGPT seems to mark a turning point. OpenAI’s chatbot, powered by its latest large language model, can write poems, tell jokes, and churn out essays that look like a human created them. Prompt ChatGPT with a few words, and out comes love poems in the form of Yelp reviews, or song lyrics in the style of Nick Cave.
The last time generative AI loomed this large, the breakthroughs were in computer vision. Selfies transformed into Renaissance-style portraits and prematurely aged faces filled social media feeds. Five years later, it’s the leap forward in natural language processing, and the ability of large language models to riff on just about any theme, that has seized the popular imagination. And it’s not just language: Generative models can also learn the grammar of software code, molecules, natural images, and a variety of other data types.
The applications for this technology are growing every day, and we’re just starting to explore the possibilities. At IBM Research, we’re working to help our customers use generative models to write high-quality software code faster, discover new molecules, and train trustworthy conversational chatbots grounded on enterprise data. We’re even using generative AI to create synthetic data to build more robust and trustworthy AI models and to stand-in for real data protected by privacy and copyright laws.
As the field continues to evolve, we thought we’d take a step back and explain what we mean by generative AI, how we got here, and how these models work.
Generative AI refers to deep-learning models that can take raw data — say, all of Wikipedia or the collected works of Rembrandt — and “learn” to generate statistically probable outputs when prompted. At a high level, generative models encode a simplified representation of their training data and draw from it to create a new work that’s similar, but not identical, to the original data.
Generative models have been used for years in statistics to analyze numerical data. The rise of deep learning, however, made it possible to extend them to images, speech, and other complex data types. Among the first class of models to achieve this cross-over feat were variational autoencoders, or VAEs, introduced in 2013. VAEs were the first deep-learning models to be widely used for generating realistic images and speech.
“VAEs opened the floodgates to deep generative modeling by making models easier to scale,” said Akash Srivastava, an expert on generative AI at the MIT-IBM Watson AI Lab. “Much of what we think of today as generative AI started here.”
Autoencoders work by encoding unlabeled data into a compressed representation, and then decoding the data back into its original form. “Plain” autoencoders were used for a variety of purposes, including reconstructing corrupted or blurry images. Variational autoencoders added the critical ability to not just reconstruct data, but to output variations on the original data.
This ability to generate novel data ignited a rapid-fire succession of new technologies, from generative adversarial networks (GANs) to diffusion models, capable of producing ever more realistic — but fake — images. In this way, VAEs set the stage for today’s generative AI.
They are built out of blocks of encoders and decoders, an architecture that also underpins today’s large language models. Encoders compress a dataset into a dense representation, arranging similar data points closer together in an abstract space. Decoders sample from this space to create something new while preserving the dataset’s most important features.
Transformers, introduced by Google in 2017 in a landmark paper “Attention Is All You Need,” combined the encoder-decoder architecture with a text-processing mechanism called attention to change how language models were trained. An encoder converts raw unannotated text into representations known as embeddings; the decoder takes these embeddings together with previous outputs of the model, and successively predicts each word in a sentence.
Through fill-in-the-blank guessing games, the encoder learns how words and sentences relate to each other, building up a powerful representation of language without anyone having to label parts of speech and other grammatical features. Transformers, in fact, can be pre-trained at the outset without a particular task in mind. Once these powerful representations are learned, the models can later be specialized — with much less data — to perform a given task.
Several innovations made this possible. Transformers processed words in a sentence all at once, allowing text to be processed in parallel, speeding up training. Earlier techniques like recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) networks processed words one by one. Transformers also learned the positions of words and their relationships, context that allowed them to infer meaning and disambiguate words like “it” in long sentences.
By eliminating the need to define a task upfront, transformers made it practical to pre-train language models on vast amounts of raw text, allowing them to grow dramatically in size. Previously, people gathered and labeled data to train one model on a specific task. With transformers, you could train one model on a massive amount of data and then adapt it to multiple tasks by fine-tuning it on a small amount of labeled task-specific data.
Transformers have come to be known as foundation models for their versatility. “If you wanted to improve a classifier, you used to have to feed it more labeled data,” said Srivastava. “Now, with foundation models, you can feed the model large amounts of unlabeled data to learn a representation that generalizes well to many tasks.”
Language transformers today are used for non-generative tasks like classification and entity extraction as well as generative tasks like translation, summarization, and question answering. More recently, transformers have stunned the world with their capacity to generate convincing dialogue, essays, and other content.
Language transformers fall into three main categories: encoder-only models, decoder-only models, and encoder-decoder models.
Encoder-only models like BERT power search engines and customer-service chatbots, including IBM’s Watson Assistant. Encoder-only models are widely used for non-generative tasks like classifying customer feedback and extracting information from long documents. In a project with NASA, IBM is building an encoder-only model to mine millions of earth-science journals for new knowledge.
Decoder-only models like the GPT family of models are trained to predict the next word without an encoded representation. GPT-3, at 175 billion parameters, was the largest language model of its kind when OpenAI released it in 2020. Other massive models — Google’s PaLM (540 billion parameters) and open-access BLOOM (176 billion parameters), among others, have since joined the scene.
Encoder-decoder models, like Google’s Text-to-Text Transfer Transformer, or T5, combine features of both BERT and GPT-style models. They can do many of the generative tasks that decoder-only models can, but their compact size makes them faster and cheaper to tune and serve.
Generative AI and large language models have been progressing at a dizzying pace, with new models, architectures, and innovations appearing almost daily.
The ability to harness unlabeled data was the key innovation that unlocked the power of generative AI. But human supervision has recently made a comeback and is now helping to drive large language models forward. AI developers are increasingly using supervised learning to shape our interactions with generative models and their powerful embedded representations.
Instruction-tuning, introduced with Google’s FLAN series of models, has enabled generative models to move beyond simple tasks to assist in a more interactive, generalized way. Feeding the model instructions paired with responses on a wide range of topics can prime it to generate not just statistically probable text, but humanlike answers to questions like, “What is the capital of France?” or requests like, “Sort the following list of numbers.”
By carefully engineering a set of prompts — the initial inputs fed to a foundation model — the model can be customized to perform a wide range of tasks. In some cases, no labeled data is required at all. You simply ask the model to perform a task, including those it hasn’t explicitly been trained to do. This completely data-free approach is called zero-shot learning, because it requires no examples. To improve the odds the model will produce what you’re looking for, you can also provide one or more examples in what’s known as one- or few-shot learning.
Zero- and few-shot learning dramatically lower the time it takes to build an AI solution, since minimal data gathering is required to get a result. But as powerful as zero- and few-shot learning are, they come with a few limitations. First, many generative models are sensitive to how their instructions are formatted, which has inspired a new AI discipline known as prompt-engineering. A good instruction prompt will deliver the desired results in one or two tries, but this often comes down to placing colons and carriage returns in the right place. While effective, prompt engineering can also be fiddly. A prompt that works beautifully on one model may not transfer to other models.
Another limitation of zero- and few-shot prompting for enterprises is the difficulty of incorporating proprietary data, often a key asset. If the generative model is large, fine-tuning it on enterprise data can become prohibitively expensive. Techniques like prompt-tuning and adaptors have emerged as alternatives. They allow you to adapt the model without having to adjust its billions to trillions of parameters. They work by distilling the user’s data and target task into a small number of parameters that are inserted into a frozen large model. There, they modulate the model’s behavior without directly changing it.
“Parameter-efficient tuning methods allow users to have their cake and eat it too,” said David Cox, IBM director of the MIT-IBM Watson AI Lab. “You can leverage the power of a large pre-trained model with your own proprietary data. Together, prompt engineering and parameter-efficient tuning provide a powerful suite of tools for getting a model to do what you want, without spending time and money on traditional deep-learning solutions.”
Most recently, human supervision is shaping generative models by aligning their behavior with ours. Alignment refers to the idea that we can shape a generative model’s responses so that they better align with what we want to see. Reinforcement learning from human feedback (RLHF) is an alignment method popularized by OpenAI that gives models like ChatGPT their uncannily human-like conversational abilities. In RLHF, a generative model outputs a set of candidate responses that humans rate for correctness. Through reinforcement learning, the model is adjusted to output more responses like those highly rated by humans. This style of training results in an AI system that can output what humans deem as high-quality conversational text.
Until recently, a dominant trend in generative AI has been scale, with larger models trained on ever-growing datasets achieving better and better results. You can now estimate how powerful a new, larger model will be based on how previous models, whether larger in size or trained on more data, have scaled. Scaling laws allow AI researchers to make reasoned guesses about how large models will perform before investing in the massive computing resources it takes to train them.
On the flip side, there’s a continued interest in the emergent capabilities that arise when a model reaches a certain size. It’s not just the model’s architecture that causes these skills to emerge but its scale. Examples include glimmers of logical reasoning and the ability to follow instructions. Some labs continue to train ever larger models chasing these emergent capabilities.
Recent evidence, however, is bucking the trend toward larger models. Several research groups have shown that smaller models trained on more domain-specific data can often outperform larger, general-purpose models. Researchers at Stanford, for example, trained a relatively small model, PubMedGPT 2.75B, on biomedical abstracts and found that it could answer medical questions significantly better than a generalist model the same size. Their work suggests that smaller, domain-specialized models may be the right choice when domain-specific performance is important.
“When you want specific advice, it may be better to ask a domain expert for help rather than trying to find the single smartest person you know,” said Cox, at MIT-IBM. “Specialization also comes with other advantages; a smaller model is vastly cheaper and less carbon-intensive to run.”
The question of whether generative models will be bigger or smaller than they are today is further muddied by the emerging trend of model distillation. A group from Stanford recently tried to “distill” the capabilities of OpenAI’s large language model, GPT-3.5, into its Alpaca chatbot, built on a much smaller model. The researchers asked GPT-3.5 to generate thousands of paired instructions and responses, and through instruction-tuning, used this AI-generated data to infuse Alpaca with ChatGPT-like conversational skills. Since then, a herd of similar models with names like Vicuna and Dolly have landed on the internet.
“The Alpaca approach calls into question whether large models are essential for emergent capabilities,” said Cox. “Some models, like Dolly 2, are even skipping the distillation step and instead crowdsourcing instruction-response data directly from humans. Taken together, recent events suggest we may be entering an era where more compact models are sufficient for a wide variety of practical use cases.”
Generative AI holds enormous potential to create new capabilities and value for enterprise. However, it also can introduce new risks, be they legal, financial or reputational. Many generative models, including those powering ChatGPT, can spout information that sounds authoritative but isn’t true (sometimes called “hallucinations”) or is objectionable and biased. Generative models can also inadvertently ingest information that’s personal or copyrighted in their training data and output it later, creating unique challenges for privacy and intellectual property laws. Solving these issues is an open area of research, and something we covered in our next blog post.