WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models
Abstract
The need for effective unlearning mechanisms in large language models (LLMs) is increasingly urgent, driven by the necessity to adhere to data regulations and foster ethical generative AI practices. LLM unlearning is designed to reduce the impact of undesirable data influences and associated model capabilities without diminishing the utility of the model if unrelated to the information being forgotten. Despite growing interest, much of the existing research has focused on varied unlearning method designs to boost effectiveness and efficiency. However, the inherent relationship between model weights and LLM unlearning has not been extensively examined. In this paper, we systematically explore how model weights interact with unlearning processes in LLMs and we design the weight attribution-guided LLM unlearning method, WAGLE, which unveils the interconnections between 'influence' of weights and 'influence' of data to forget and retain in LLM generation. By strategically guiding the LLM unlearning across different types of unlearning methods and tasks, WAGLE can erase the undesired content, while maintaining the performance of the original tasks. We refer to the weight attribution-guided LLM unlearning method as WAGLE, which unveils the interconnections between 'influence' of weights and 'influence' of data to forget and retain in LLM generation. Our extensive experiments show that WAGLE boosts unlearning performance across a range of LLM unlearning methods such as gradient difference and (negative) preference optimization, applications such as fictitious unlearning (TOFU benchmark), malicious use prevention (WMDP benchmark), and copyrighted information removal, and models including Zephyr-7b-beta and Llama2-7b. To the best of our knowledge, our work offers the first principled method for attributing and pinpointing the influential weights in enhancing LLM unlearning. It stands in contrast to previous methods that lack weight attribution and simpler weight attribution techniques.
Authors’ notes
In AI research, unlearning and attribution have evolved as two separate fields of study. Unlearning removes unwanted behaviors or knowledge from trained models, without a need to fully retrain the model. Attribution, on the other hand, identifies the network components responsible for specific behaviors. For example, we may have a model that produces toxic output or proprietary information. Unlearning can be applied in multiple scenarios during the large language model life cycle as show in the figure below.
Our team has been working to determine if we could marry these two techniques to get the best of both words and ensure unlearning can cleverly target the right place in the large language model. We wrote a paper, which was accepted at Neurips 2024. In the paper, we move from traditionally applied attribution aspects that focus on data influence on a model. Our work first formalizes “weight attribution”, as a new essential aspect for effective unlearning.
One of the biggest challenges we faced is that attribution for unlearning requires a precise and theoretically grounded method to assess how specific weights influence both unlearning and overall model utility. To overcome this, we leveraged bi-level optimization that attributes weight influence in unlearning while ensuring that weight selections are strategically targeted without compromising model performance. This further leads to our concept, WAGLE or weight attribution-guided LLM unlearning. WAGLE ensures we target the right model components for unlearning without compromising overall utility, as shown in the figure below.
WAGLE accounts for utility preservation and unlearning effectiveness using a bi-level optimization formulation where the upper-level problem evaluates the impact of weight adjustments on unlearning efficacy and the lower-level problem ensures the retention of utility. We evaluated WAGLE to help in the removal of copyrighted material, private information, and harmful knowledge in LLMs. Our results show that it leads to a high utility and high unlearning performance for a variety of benchmarks including TOFU, long documents such as the Harry Potter books, and the WMDP benchmark.
We hope WAGLE will significantly impact generative AI privacy, security, and safety. LLM unlearning can be applied to code generation, removing personal or sensitive information inherited from training data from open sources, as well as potential malware. This ensures that LLM-generated code meets privacy and security standards, protecting against data exposure and malicious code risks. We encourage you to test out WAGLE yourself with the source code here on GitHub.