AI4Code

Srikanth Tamilselvam; Dinesh Khandelwal; Saswati Dana; Prince Kumar; Dinesh Garg

ACML 2022

Tutorial

12 Dec 2022

AI4Code

Visit website

Abstract

Foundation models are fast becoming the backbone of many AI systems. These large models are built on top of largely unlabelled data and can be fine-tuned for wide range of downstream tasks. Their success can be largely attributed to 1) Availability of Data 2) Advances in self supervised learning and Deep Neural Networks 3) Infrastructure Support. The success of such pre-trained models also motivates the development of pre-trained models for source code. But unlike natural language text, source code carries strict grammar and small variations in code understanding can lead to very erroneous behaviour.

There are some early successes in leveraging such models for traditional software engineering tasks like code completion [1], semantic code search, code clone detection, translation [2] etc. Therefore, in this tutorial, we plan to formally introduce foundation model. We will briefly touch upon the benefits of viewing “code as code” and “code as text” and elaborate on four key representation approaches 1) Data based approaches that model programs as functions that map inputs to outputs to a vector space and combines these vector representations to create a program level embedding. 2) Structural based approaches try to exploit this characteristic of source code. Recent studies model these structural features by combining multiple code views like the AST, CFG, Data Flow Graph (DFG) and read-write graph (RWG) into a single graph model. 3) Sequence approaches treats a formal language just like a natural language and leverages existing state-of-the-art natural language models that are designed to consume long sequences of text. 4)Sequence from Structure approaches combine the previous two approaches. In the first stage, these approaches create a graph that captures some structural aspect of the code. In the second stage, they extract sequences from these graphs and feed this into sequence based deep learning models. We will show the impact of foundational models following some of these approaches. Next, we plan to introduce 2-4 Software Engineering tasks like the code translation, code clone detection, code completion, code clustering and discuss the data processing methods and interesting pre-training tasks employed by popular models. We plan to wrap the tutorial with open challenges for the different tasks and our humble opinion on future roadmap for the application of foundation models.

References:

OpenAI Codex, Evaluating Large Language Models Trained on Code https://arxiv.org/abs/2107.03374
GraphCodeBert, GRAPHCODEBERT: PRE-TRAINING CODE REPRESEN- TATIONS WITH DATA FLOW https://arxiv.org/abs/2009.08366

Workshop paper