08 Nov 2023

Explainer

6 minute read

What is AI alignment?

Alignment is the process of encoding human values and goals into large language models to make them as helpful, safe, and reliable as possible. Through alignment, enterprises can tailor AI models to follow their business rules and policies.

A robot shouldn’t injure a human or let them come to harm. This commonsense rule was conceived by novelist Isaac Asimov in a short story more than 80 years ago. Today, it has become a guiding principle for training our robot assistants to serve human values and goals.

Maintaining control over AI has become a popular area of research with the rise of generative AI, deep-learning models pre-trained on datasets the size of the internet to mimic the way humans communicate and create. Chatbots powered by one form of generative AI, large language models (LLMs), have stunned the world with their ability to carry on open-ended conversations and solve complex tasks. But our growing reliance on them comes with risks.

Alignment is meant to reduce these risks and ensure that our AI assistants are as helpful, truthful, and transparent as possible. Alignment tries to resolve the mismatch between an LLM’s mathematical training, and the soft skills we humans expect in a conversational partner.

LLMs are essentially word-prediction engines. Ask a question, and out tumbles the answer, word after word. But for these answers to be helpful, they must not only be accurate, but also truthful, unbiased, and unlikely to cause harm. Alignment bridges this gap.

But it’s not perfect. Because human values and goals are constantly shifting, alignment is also an ongoing process. Alignment is also subjective. It involves making judgement calls about which values take precedence. Ask a chatbot how to build a bomb, and it can respond with a helpful list of instructions or a polite refusal to disclose dangerous information. Its response depends on how it was aligned by its creators.

“Alignment is more than just tuning the model to solve a task,” said Akash Srivastava, an AI researcher who leads the alignment team at IBM Research. “It ensures that the model does what you want. There’s no clear objective function for safety and values which is why alignment is such a hard problem.”

Imitation learning

Alignment happens during fine-tuning, when a foundation model is fed examples of the target task, whether that’s summarizing legal opinions, classifying spam, or answering customer queries.

Alignment typically involves two steps. In the instruction-tuning phase, the LLM is given examples of the target task so it can learn by example. In the critique phase, a human or another AI interacts with the model and grades its responses in real-time. If reinforcement learning (RL) is used to incorporate these preferences back into the model, this step is called RL with human feedback (RLHF) or AI feedback (RLAIF).

During instruction-tuning, sample queries like “write a report,” are paired with actual reports to show the LLM varied examples. It’s also taught to ask clarifying questions like, “On what topic?” From tens of thousands of dialogue pairs, the LLM learns how to apply knowledge baked into its parameters to new scenarios.

Once the LLM has learned to write reports, it gets fine-grained feedback on its work. For each query, the model outputs two responses. An evaluator — either a human or another LLM — picks the best one. These top-rated responses are then fed to a reward model which learns how to mimic them. These preferences are then typically transferred to the LLM through an RL algorithm known as proximal policy optimization (PPO).

High-quality data is critical to both steps. This is why IBM Research has focused on automating the creation of instruction data to lower the costs of aligning and customizing enterprise chatbots. IBM has integrated three key innovations into its “Granite” models available on watsonx, IBM’s AI and data platform for business. “You can explain what tone you’re looking for, then align your model to match it,” said David Cox, VP for AI models at IBM Research. “If you’re selling entertainment products, you might want a bubbly, lively chatbot — but if you’re an insurance company, and most of your interactions are with customers that have suffered a loss, you want a chatbot that’s serious and empathetic.”

Synthetic data for low-cost, personalized alignment

Garbage in, garbage out: It’s an adage that’s fitting in the field of AI. It speaks to the importance of training AI models on safe, quality data, and it’s as true for alignment as it is pre-training. OpenAI’s ChatGPT performs as well as it does because it was trained on tons of human-labeled instructions and feedback. It was further improved by millions of people playing with it online.

Meta’s popular Llama 2 models were also tuned on human-labeled data: 28,000 demonstrations and 1.4 million preference examples. Available on Hugging Face (and soon, watsonx), the Llama models are available for companies to customize to create their own chatbots.

But there’s a faster way to create instruction data: ask an LLM. IBM has been developing techniques for using open-source LLMs to generate high-quality synthetic data. This allows IBM and others to customize their own proprietary chatbots.

Synthetic data has some key advantages. Language models can crank out tons of dialogue data instantly. And the data can be tailored to the task at hand and infused with personalized values. Ultimately, synthetic data can lead to models that are better aligned, at lower cost.

“Companies can encode their corporate principles, cultural values, and different geographies and have a model that aligns to their business needs,” said Cox. “It’s like choose-your-own-adventure alignment. You can tune the model for your own purposes.”

Toward LLMs that align themselves

IBM is using three methods for generating artificial alignment data to tune its Granite models.

The first, contrastive fine-tuning (CFT), shows the LLM what not to do, reinforcing its ability to solve the task. Contrasting pairs of instructions are created by training a second, ‘negative persona’ LLM to generate toxic, biased, and inaccurate responses. These misaligned responses are then fed, with the matching aligned responses, back to the original model.

IBM researchers found that LLMs trained on contrasting examples outperform models tuned on good examples only, on benchmarks for helpfulness and harmlessness. And the LLMs do this without sacrificing accuracy. The benefit of contrastive tuning, said Srivastava, is it allows you to accomplish more alignment before collecting human preference data, which is time-consuming and expensive.

IBM’s second data-generation method, called Forca (a portmanteau of Falcon and Orca), is also aimed at getting more mileage out of instruction-tuning. Inspired by Microsoft Research’s Orca method, IBM researchers used an LLM to rewrite the responses of Google’s FLAN open-source dialogue dataset. Microsoft used Orca and a proprietary GPT-4 model to rewrite FLAN; IBM used an open-source Falcon model instead and “forcafied” several datasets in addition to FLAN.

Under Forca, terse responses are turned into detailed explanations tailored to a task-specific template. The answer to a word problem, for example, would include the reasoning steps to get there. For a coding task, the response would include comments on what each block of code does. Forca also produces misaligned responses for contrastive tuning. IBM researchers generated 800,000 pairs of high-quality instructions this way and selected 435,000 using Falcon to filter the responses according to self-defined principles.

A third IBM method, called Salmon, is aimed at generating synthetic preference data so that a chatbot can essentially align itself. Prompted with a set of queries, the LLM generates responses that are fed to a reward model programmed to evaluate its writing according to a set of rules. Do use clear, creative, and vivid language; Don’t use biased or discriminatory language.

The reward model upvotes or downvotes each AI-generated response by these rules. The ranked examples are then fed back to the original LLM using the PPO algorithm. Through Salmon, enterprises can imprint their own goals and values on their chatbots.

“IBM models have been aligned to avoid controversial topics, but another enterprise may have a different standard,” said IBM’s Yikang Shen, who co-developed the method. “You can shift the principles to what your company needs. You can also save money by doing away with labeled data.”

The surprising versatility of instruction data

Instruction data can serve many purposes. IBM has applied synthetic instruction data to making LLMs safer, crafting examples for the model to both mimic and avoid. IBM researchers recently combed the social science literature for stigmas in American culture, things like being voluntarily childless, living in a trailer park, or having facial scars.

They then wrote questions hinging on whether to engage with a stigmatized individual in more than two dozen hypothetical scenarios. A pair of LLMs generated 124,000 responses, some of which were used to tune IBM’s Granite models. The team is now working on additional templates to mitigate other risks and biases.

Instruction data can also be used to coax expert knowledge from a pre-trained LLM without having to tune it on data labeled by specialists. Expert knowledge is often baked into a pre-trained model, but because it’s unlabeled, finding it can be difficult.

Using specialized instructions, written by the model itself, IBM researchers show that this buried knowledge can be resurfaced. They recently had an LLM generate 5,000 instructions for solving various biomedical tasks based on a few dozen examples. They then loaded this expert knowledge into an in-memory module for the model to reference when asked, leading to substantial improvement on biomedical tasks at inference time, they found.

“With hardly any labeled data at all, you can specialize your LLM,” said IBM’s Leonid Karlinsky, who co-authored the work.

IBM researchers are also exploring the use of code to nudge LLMs toward more human-like, step-by-step reasoning. In an upcoming study at the natural-language processing conference EMNLP, researchers show that prompting an LLM with synthetic code and code-like-text can improve performance by as much 38% on a wide variety of natural-language tasks over LLMs prompted with natural language only.

Both code, and comments that explain the code, tend to be highly logical, the researchers explained. Computer programs follow a clear chain of reasoning as they set about solving a task. This is in sharp contrast to natural language, where the meaning of words is often ambiguous and context dependent.

If an LLM is exposed to more code, can it learn to be more logical? “These results open up many new directions,” said IBM’s Mayank Mishra, who co-authored the work.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

Bringing a common language to AI evaluation
News
Kim Martineau
23 Jul 2026
IBM is committing up to $50 million worth of quantum compute access for the US Genesis Mission, and more
News
22 Jul 2026
- AI
- Quantum
IBM open sources CodeAlchemy, a massive synthetic dataset of high-quality code
Release
Kim Martineau
16 Jul 2026
Replacing the ‘bones’ of transformer-based models
Research
Peter Hess
09 Jul 2026
- AI
- Generative AI

Imitation learning

Synthetic data for low-cost, personalized alignment

Toward LLMs that align themselves

The surprising versatility of instruction data

Related posts

Bringing a common language to AI evaluation

IBM is committing up to $50 million worth of quantum compute access for the US Genesis Mission, and more

IBM open sources CodeAlchemy, a massive synthetic dataset of high-quality code

Replacing the ‘bones’ of transformer-based models