Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy
Jie Ren, Zhenwei Dai, et al.
NeurIPS 2025
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results on I-RAVEN-X show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. For instance, LRMs experience a significantly smaller degradation on arithmetic accuracy (80.5% → 63.0%) compared to LLMs (59.3% → 4.4%). However, LRMs are still significantly challenged by reasoning under uncertainty (−61.8% in task accuracy) and cannot effectively explore multiple probabilistic outcomes in superposition.
Jie Ren, Zhenwei Dai, et al.
NeurIPS 2025
Tian Gao, Amit Dhurandhar, et al.
NeurIPS 2025
Vidushi Sharma, Andy Tek, et al.
NeurIPS 2025
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010