EuroS&P 2023
Conference paper

Code Vulnerability Detection via Signal-Aware Learning

View publication


Machine Learning-based modeling of source code understanding tasks has been gaining popularity. Accompanying their rapid proliferation is an emerging scrutiny over the models' reliability. Concerns have been raised regarding the models not actually learning task-relevant source code features, but fitting other correlated data. To improve model trustworthiness, in this work, we explore data-driven approaches for enhancing model signal awareness, i.e., learning the relevant signals in the input for making predictions. We do so by incorporating the notion of code complexity during model training, both (i) explicitly via curriculum learning, and (ii) implicitly by augmenting the training dataset with simplified signal-preserving programs. With our techniques, we achieve up to 4.8x improvement in signal awareness of vulnerability detection models. Using the notion of code complexity, we present a novel interpretation of the model learning behaviour from the perspective of the dataset. We use it to introspect model learning difficulties, and analyze the learning enhancements achieved with our approaches.