Research
6 minute read

How memory augmentation can improve large language model efficiency and flexibility

Generative AI models struggle to keep up with long context, but IBM Research is working on creative strategies to reduce their memory footprint to make them more accurate while requiring fewer computing resources.

A 3x3 grid of concentric circles and squares

Generative AI models struggle to keep up with long context, but IBM Research is working on creative strategies to reduce their memory footprint to make them more accurate while requiring fewer computing resources.

Memory capacity is a persistent issue with large language models. They can struggle with long input sequences, thanks to the high cost of memory required by these models. Their training data can also become quickly obsolete as the world turns and new information comes to light. To address these problems, scientists at IBM Research are working on a range of strategies to improve LLMs' memory issues — without weighing them down or rebuilding the models from the ground up. The goal is to reduce the computing resources required for AI inference, while also improving the accuracy of the content these models generate.

In their efforts, scientists are taking cues from human psychology and neuroscience, modeling certain aspects of our own memory in computer code. Even though LLMs can produce text that makes it seem like they’re thinking, they don’t think in the way people do. Humans have a remarkable ability to remember things over a very long time, whether that information is the name of a childhood friend, a random historical fact, or the lyrics to an old favorite song. We also have short-term or working memory that can be quite long — even after talking with someone for many minutes or even hours, we can usually still keep track of everything that’s already been said. Context windows can function as a kind of working memory, but LLMs lack long-term memory, and the transformer architectures that underlie LLMs struggle to keep things straight when dealing with long input sequences.

Teams at IBM Research are working on different approaches to augment LLMs with long-term memory, developing innovative ways to boost the memory capacity of these models without the need to retrain them — a costly and time-consuming process. And there are added benefits to memory augmentation. It isn’t just good for increasing the efficiency of transformers, it can also be used for editing and fact-checking the content that generative AI models produce.

One of these approaches, called CAMELoT (Consolidated Associative Memory Enhanced Long Transformer), proposed an associative memory module that can be plugged into a pre-trained LLM to help it handle longer context. Another, called Larimar, uses a memory module that’s coupled to the LLM and can be quickly and easily updated to add or forget facts.

A major reason LLMs suffer from high memory and computational costs is the self-attention mechanism that characterizes transformers — the neural network architecture underlying many generative AI models. Self-attention makes transformers inefficient, and this effect scales radically with the quantity of content they’re asked to remember. “As the input length increases, the computational cost of self-attention grows quadratically,” says IBM Research scientist Rogerio Feris, part of the team behind CAMELoT. More broadly, even though LLMs are powerful for prediction tasks, they are not adaptable, says IBM Research scientist Payel Das, part of the Larimar team.

“We want to be able to remember information from long inputs,” Feris says. To this end, the CAMELoT team prioritized three crucial properties taken from neuroscience, all of which can also hold true for LLMs: consolidation, novelty, and recency.

Consolidation means that information needs to be compressed so it can be stored. This is true in the brain which has a finite number of neurons, and it’s true in an LLM which would be unwieldy if it stored every bit of data without compressing it. Similarly, when a new token is entered into memory for an LLM, it is merged with other existing similar tokens. Novelty means that when an incoming token has a different concept from existing ones, a new memory slot should be allocated for it. And recency means that if all the memory is filled up, the oldest slot gets replaced when a token with a new concept comes in.

CAMELoT does all three of these things, consolidating related tokens into an averaged set, replacing unused tokens as new ones come in, and recognizing when a new concept has appeared that requires a new category. CAMELoT enables an LLM to look at more information from its memory, beyond the context that’s fed into it with a prompt. In their experiments, when CAMELoT was coupled to a pre-trained Llama 2-7b model, it reduced perplexity by up to 30% — a measure of prediction accuracy for which a lower score is better — over the base model. They also found that CAMELoT coupled with the Llama model could achieve the same level of accuracy as the base model with a much smaller input length.

This brings downstream benefits, too, says Zexue He, who worked on the project during an IBM Ph.D. fellowship and is now an IBM Research scientist. “For example, if you can provide a longer history between a user and the chatbot, the language model behind the chatbot can better understand the intent of the user,” she says. And when a user is asking a chatbot to summarize or analyze documents, CAMELoT will enable them to input more or longer documents, leading to greater user satisfaction, says He.

“The LLM landscape is constantly shifting, with new models emerging all the time,” Feris says. “The idea here is to get any off-the-shelf LLM and very quickly increase its context length, potentially to infinite context, without having to retrain it.” This can help not just reduce perplexity, but also for improving in-context learning. Some major advantages, says IBM Research collaborator Dmitry Krotov, are that CAMELoT doesn’t rely on any specific LLM structure and that it’s lightweight. This means it could be used for edge applications where storage space is an issue, he adds, enabling an LLM to work more efficiently.

In another approach, Larimar can add an adaptable external episodic memory to LLMs. Das and her team embarked on this project to help address some issues with LLMs, like training data leakage and memorization — where a model could accidentally regurgitate sensitive information or dated or wrong verbatim text from training data, instead of generating new text following context.

An LLM has long-term memory to call upon all the data it’s seen during training. “But what it does not have is episodic memory, which is more contextual memory that can be rewritten and forgotten in seconds,” says Das. "It can allow us to do regulation of behavior and context-specific governance in real time.” Drawing from the lessons of neuroscience, Das and team wanted to add a form of short-term memory to LLMs. This led to the creation of Larimar.

When added to an LLM, Larimar works as an episodic memory controller, and it provides a robust mechanism for fast and distributed learning of contextual information. Through end-to-end training of the decoder with an episodic memory module, the team enabled the model to learn a differentiating attention to the readout from the memory. If a conventional LLM is like the brain’s neocortex, which learns slowly and holds memories for a long time, then Larimar is like the hippocampus, which holds short-term memories that can later be consolidated into long-term memory. In this way, new facts and edits can be fed into the LLM through a one-shot gradient-free update of the episodic memory during inference.

In experiments, Das and her colleagues found that one-shot updates to LLM memory can be done cheap and quickly, as well as accurately during inference, either for one sentence at a time, for several sentences in a sequence, or in larger batches. The attention to the episodic memory shows precise and accurate editing of the LLM’s knowledge, while resulting in less hallucination. They found that after updates, the model could handle paraphrased versions of the new knowledge, and that it didn’t affect knowledge accuracy where they didn’t intend to edit it. “We have also shown that the same memory can be used for fact forgetting or censoring in a selective manner,” she says. Larimar-augmented models are therefore much less likely to leak sensitive information, even in attack tests meant to make them do that.

Larimar also helps with something called context length generalization, which describes a model’s ability to process prompts that are much longer than what it was trained on. This is a problem that arises when a model has been trained on datasets that mostly include short context instances, but then it’s suddenly fed a prompt that includes thousands of words of context. An alternative — and more compute-intensive — solution to this is to fine-tune a model on data with the new context length, but that can lead to overfitting, where the model can’t generalize to information that’s not in its training data.

Das and her team presented the Larimar architecture at the International Conference on Machine Learning (ICML) this summer. They have also showcased Larimar’s benefits for context length generalization and hallucination mitigation in two ICML workshops, namely the Next Generation of Sequence Modeling Architectures workshop and the LLMs and Cognition workshop. To follow up on this work, they’re testing how Larimar can help an LLM improve its reasoning and planning skills. CAMELoT was part of the ICML Long-Context Foundation Models workshop. Krotov says he and his colleagues plan to continue investigating how memory models can be used to improve the behavior of LLMs, including by reducing hallucinations.

Date

Share