Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy
Jie Ren, Zhenwei Dai, et al.
NeurIPS 2025
Alignment techniques are essential for making Large Language Models (LLMs) usable and useful for real-world applications and diverse approaches have been developed, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these paradigms and guide deployment decisions. This paper introduces MEAL (Multi-dimensional Evaluation of ALignment techniques), a comprehensive and systematic evaluation framework for alignment techniques. It focuses on four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. Through experiments of models with different alignment strategies, we demonstrate the utility of our framework in identifying their strengths and limitations, providing valuable insights for future research directions.
Jie Ren, Zhenwei Dai, et al.
NeurIPS 2025
Hazar Yueksel, Ramon Bertran, et al.
MLSys 2020
Tian Gao, Amit Dhurandhar, et al.
NeurIPS 2025
Megh Thakkar, Quentin Fournier, et al.
ACL 2024