The Landscape of Source Code Representation Learning in AI-Driven Software Engineering Tasks
Abstract
Appropriate representation of source code and its relevant properties form the backbone of Artificial Intelligence (AI)/ Machine Learning (ML) pipelines for various software engineering (SE) tasks such as code classification, bug prediction, code clone detection, and code summarization. In the literature, researchers have extensively experimented with different kinds of source code representations (syntactic, semantic, integrated, customized) and ML techniques such as pre-trained BERT models. In addition, it is common for researchers to create hand-crafted and customized source code representations for an appropriate SE task. In a 2018 survey [1], Allamanis et al. listed nearly 35 different ways of of representing source code for different SE tasks like Abstract Syntax Trees (ASTs), customized ASTs, Control Flow Graphs (CFGs), Data Flow Graphs (DFGs) and so on. The main goal of this tutorial is two-fold (i) Present an overview of the state-of-the-art of source code representations and corresponding ML pipelines with an explicit focus on the merits and demerits of each of the representations (ii) Practical challenges in infusing different code-views in the state-of-the-art ML models and future research directions.