Research
4 minute read

Serving customized AI models at scale with LoRA

Low-rank adaptation (LoRA) is a faster, cheaper way of turning LLMs and other foundation models into specialists. IBM Research is innovating with LoRAs to make AI models easier to customize and serve at scale.

Low-rank adaptation (LoRA) is a faster, cheaper way of turning LLMs and other foundation models into specialists. IBM Research is innovating with LoRAs to make AI models easier to customize and serve at scale.

Before a foundation model is ready to take on real-world problems, it’s typically fine-tuned on specialized data and its billions, or trillions, of weights are recalculated. This style of conventional fine-tuning is slow, expensive, and ends up producing more bespoke models than can practically be served.

Low-rank adaptation (LoRA) is a quicker solution. With LoRA, you fine-tune a small subset of the base model’s weights, creating a plug-in module that gives the model expertise in, for example, biology or mathematical reasoning at inference time. Colloquially, this module is also called a “LoRA.” Like custom bits for a multi-head screwdriver, LoRAs can be swapped in and out of the base model to give it specialized capabilities.

“Adding on a LoRA to your lovingly crafted, general-purpose model can make it single-mindedly good at, say, analyzing legal documents, without the computational costs of full fine-tuning,” says David Cox, VP of AI models at IBM Research.

By detaching model updates from the model itself, LoRA has become the most popular of the parameter-efficient fine-tuning (PEFT) methods to emerge with generative AI. It has several advantages.

LoRA makes model customization economical. A thorough fine-tuning involves recalculating each of the model’s weights, which requires intensive computation and a lot of memory. A LoRA fine-tune, by contrast, readjusts less than 1% of the model’s weights, without degrading its performance.

The LoRA approach also makes it easier to add new skills and knowledge without overwriting what the model previously learned, a phenomenon known as catastrophic forgetting. LoRA offers a way to inject new information into a model without sacrificing performance.

But perhaps its most powerful benefit comes at inference time. Loading LoRA updates on and off a base model with the help of additional optimization techniques can be much faster than switching out fully tuned models. With LoRA, hundreds of customized models or more can be served to customers in the time it would take to serve one fully fine-tuned model.

When Microsoft researchers introduced the LoRA concept in 2021, language models were their target. Since then, LoRA has spread to image generation and voice recognition. It has also spawned an alphabet soup of add-ons, from O-LoRA, for improved continual learning, to QLoRA, for speeding up training through quantization.

Tens of thousands of LoRAs are available on Hugging Face alone, all ready to be loaded up, like favorite ice cream toppings, on the models they were trained for. “LoRA has democratized LLM training by giving more people the ability to fine-tune larger models,” says Alan Ritter, an associate professor at Georgia Tech who has collaborated with the MIT-IBM Watson AI Lab on LoRA-related work.

Boosting LLM throughput

In LoRA’s original design, updates were merged with the base model, allowing you to serve a customized model quickly. With multiple models to serve, however, both latency and throughput suffered. With the rise of commercial fine-tuning services, throughput has become a top concern for companies trying to lower the cost of hosting and deploying customized AI models.

Earlier this year, researchers at UC Berkeley introduced a system called S-LoRA that loads only the base model on an inference engine while dynamically swapping LoRAs in and out. The new system design made it possible to serve one thousand LoRAs on a single GPU. But with the number of LoRAs increasing, researchers have looked for other efficiencies.

“Although one LoRA is only 1% of the model weights, 1,000 LoRAs is 10 times the model’s size,” says Mikhail Yurochkin, an IBM researcher who heads a team focused on operationalizing LLMs.

One solution proposed by IBM and MIT researchers is to group similar LoRAs together, and consolidate knowledge and skills within each cluster, before compressing the collection. Each LoRA is then quickly reassembled at inference time.

The team incorporated their method into the open-source inference engine vLLM and showed that more than 1,000 LoRAs could be served in the time it previously took to serve just one. The compression process itself also seems to give the LoRAs a performance boost, they report in a preprint study on arXiv.

“Many tasks rely on shared parameters linked to underlying skills,” said the study’s lead author, Rickard Brüel Gabrielsson, a PhD student at MIT. “By exploiting this shared structure, we can significantly reduce the number of required LoRA parameters.”

Migrating old LoRAs to new LLMs

For all their flexibility, LoRAs are compatible with only the base model they were trained for. A LoRA customized for a Llama-2 model won’t work with an IBM Granite, or even a newer Llama-3 model. Like many phone chargers, LoRAs are model specific.

For commercial cloud providers, the task of migrating tens of thousands of LoRAs trained on proprietary data is impractical if not impossible. In search of a solution, IBM and MIT researchers recently proposed Trans-LoRA, a nearly data-free method for moving LoRAs to new versions of a model within or across families.

Trans-LoRA works by using the new model to generate a synthetic data “curriculum” for the old LoRA, then filtering the data so that it approximates the statistical distribution of the old LoRA’s training data. Researchers showed that Trans-LoRA allowed LoRAs to move across Llama and Gemma model families, as well as other PEFT methods like prompt-tuning, without degrading performance.

Restoring safety guardrails

No matter what method you use to customize your model, introducing new data can throw off its safety alignment and open the model to malicious attacks. IBM researcher Pin-Yu Chen, an expert on red-teaming for generative AI, alerted the world to the security threats posed by fine-tuning in an ICLR 2024 study earlier this year.

In follow-up work at NeurIPS this December, Chen and colleagues will present a safety patch called Safe LoRA that works for fully fine-tuned models as well. With no additional data or training and just one line of code, the patch offers significant protection. In tests the team ran on a Llama-7b Chat model, Safe LoRA thwarted 93% of attacks compared to 13% without it.

What’s next

Though LoRA is three years old, its story continues to evolve. IBM will soon release a set of LoRAs for its new Granite 3.0 models aimed at reducing hallucinations and estimating the accuracy of its answers. Internally, researchers also used LoRA as a vehicle for testing whether synthetic data improved Granite performance before incorporating the data into the models.

As enterprises experiment with smaller, more specialized models, IBM is exploring whether lightweight LoRA adapters can help to improve performance, including for multi-agent systems.

“We could have a ‘menu’ of LoRA-customized LLMs for various roles, or create LoRAs on the fly using synthetic data,” says Yurochkin. “I think what’s next is a flexible system with an LLM and multiple LoRAs that can be tailored to enterprise-use cases, including powerful AI agents.”