LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah; Mingji Han; Saurabh Pujar; Hammond Pearce; Ayse Coskun; Gianluca Stringhini

S&P 2024

Conference paper

20 May 2024

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Abstract

Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus perform the most detailed investigation to date on whether LLMs can reliably identify security-related bugs. We construct a series of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions in an automated framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios outside their knowledge cut-off date. Most importantly, our findings reveal significant non-robustness in even the most advanced models like PaLM2' and GPT-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.

Workshop paper