Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy
Jie Ren, Zhenwei Dai, et al.
NeurIPS 2025
The safety of AI agents in multi-turn interaction is a growing concern, particularly as agent behavior may vary over time due to the dynamic nature of both the agent and its environment. We introduce the concept of ``state-induced risk amplification'', hypothesizing that extended AI-environment interaction can lead to agent actions that transition the system into risky states, and that such transitions may increase the likelihood of risky actions by the agent. We provide a formal characterization of these effects using the Markov decision process framework. To empirically test our hypotheses, we introduce a novel experimental setup inspired by traffic monitoring applications. Our results demonstrate the practical occurrence of state-induced risk amplification, highlighting an emerging safety risk for current multi-turn agents and calling for safety evaluation methods that account for state-dependent dynamics. We discuss implications for designing adaptive risk mitigation strategies.
Jie Ren, Zhenwei Dai, et al.
NeurIPS 2025
Tian Gao, Amit Dhurandhar, et al.
NeurIPS 2025
Vidushi Sharma, Andy Tek, et al.
NeurIPS 2025
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010