Our team has been working to determine if we could marry these two techniques to get the best of both words and ensure unlearning can cleverly target the right place in the large language model. We wrote a paper, which was accepted at Neurips 2024. In the paper, we move from traditionally applied attribution aspects that focus on data influence on a model. Our work first formalizes “weight attribution”, as a new essential aspect for effective unlearning.
One of the biggest challenges we faced is that attribution for unlearning requires a precise and theoretically grounded method to assess how specific weights influence both unlearning and overall model utility. To overcome this, we leveraged bi-level optimization that attributes weight influence in unlearning while ensuring that weight selections are strategically targeted without compromising model performance. This further leads to our concept, WAGLE or weight attribution-guided LLM unlearning. WAGLE ensures we target the right model components for unlearning without compromising overall utility, as shown in the figure below.
WAGLE accounts for utility preservation and unlearning effectiveness using a bi-level optimization formulation where the upper-level problem evaluates the impact of weight adjustments on unlearning efficacy and the lower-level problem ensures the retention of utility. We evaluated WAGLE to help in the removal of copyrighted material, private information, and harmful knowledge in LLMs. Our results show that it leads to a high utility and high unlearning performance for a variety of benchmarks including TOFU, long documents such as the Harry Potter books, and the WMDP benchmark.
We hope WAGLE will significantly impact generative AI privacy, security, and safety. LLM unlearning can be applied to code generation, removing personal or sensitive information inherited from training data from open sources, as well as potential malware. This ensures that LLM-generated code meets privacy and security standards, protecting against data exposure and malicious code risks. We encourage you to test out WAGLE yourself with the source code here on GitHub.