Irene Ko, Sihui Dai, et al.
NeurIPS 2024
Alignment techniques are essential for making Large Language Models (LLMs) usable and useful for real-world applications and diverse approaches have been developed, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these paradigms and guide deployment decisions. This paper introduces MEAL (Multi-dimensional Evaluation of ALignment techniques), a comprehensive and systematic evaluation framework for alignment techniques. It focuses on four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. Through experiments of models with different alignment strategies, we demonstrate the utility of our framework in identifying their strengths and limitations, providing valuable insights for future research directions.
Irene Ko, Sihui Dai, et al.
NeurIPS 2024
George Kour, Itay Nakash, et al.
ACL 2025
Michael Hind, Dennis Wei, et al.
ICML 2020
Samuel Ackerman, Ella Rabinovich, et al.
EMNLP 2024