Code representation learning
Appropriate representation of source code form the backbone of Machine Learning (ML) pipelines for various software engineering tasks such as code classification, bug prediction, code clone detection, code summarization etc.. software engineering tasks. Therefore, representing source code for use in ML models without loss of important information is an active area of research. Code representation approaches can be divided into three broad categories:
(i) Token Representation: Code is treated as natural language tokens. They represent code as a bag of words or a list of tokens, and then apply techniques like Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA);
(ii) Structured Representation: View code as beyond NL tokens and exploit rich syntactic & semantic properties of the code using structured representations like AST, CFG, DFG etc.;
(iii) Combined Representation or Multi Code-view: Combine different structured representations and some even try to mix code sequences with structured representations.
While there have been success with each of these approaches. There is still lot more research to be done to compose code views for wide variety of languages and using them effectively in GNNs and transformers. Therefore, we at IBM Research India in partnership with academia are working on improving the ML pipline for representation learning and demonstrating its effectiveness in multiple tasks as described in Figure 1. We have opensourced tree-sitter-codeviews project that gives flexibility to compose multiple code-views for multiple languages and visualize them. An example of generated codeviews can be seen Figure 2.
By combining different codeviews and analyzing the quantitative results for each downstream SE task, it is possible to understand and reason about which specific codeview has the most meaningful impact in each task. Using this as a clue, it is then possible to design custom task-specific models leveraging the best codeview