Publication
ICSE 2023
Tutorial

The Landscape of Source Code Representation Learning in AI-Driven Software Engineering Tasks

View publication

Abstract

Appropriate representation of source code and its relevant properties form the backbone of Artificial Intelligence (AI)/ Machine Learning (ML) pipelines for various software engineering tasks such as \textit{code classification}, \textit{bug prediction}, \textit{code clone detection}, and \textit{code summarization}. In the literature, researchers have extensively experimented with different kinds of source code representations (syntactic, semantic, integrated, customized) and properties ranging from tree/graph representations such as Abstract Syntax Trees (ASTs) to pre-trained transformer models like CodeBERT. In addition, it is common for researchers to create hand-crafted and customized source code representations for an appropriate software engineering task. In a 2018 survey, Allamanis et al. listed ~35 different ways of source code representations for different software engineering (SE) tasks like ASTs, customized ASTs, Control Flow Graphs (CFGs), Data Flow Graphs (DFGs) and so on. The main goal of this tutorial is two-fold (i) Present an overview of the state-of-the-art of source code representations and corresponding ML pipelines with an explicit focus on the pros and cons of each of the representations (ii) Practical challenges in infusing different code views in the state-of-the-art ML models.

Date

14 May 2023

Publication

ICSE 2023