Code Representation Learning

An effort to improve the machine learning pipeline for representation learning


Machine learning (ML) pipelines for software engineering tasks rely heavily on accurately representing source code. These tasks range from code categorization and bug prediction to identifying code clones and summarization. The ongoing challenge is developing representations encapsulating all essential information for ML models.

While standard approaches to code representation have proven effective, opportunities for refinement exist, particularly in developing code views for diverse programming languages and their application in graph neural networks and transformers.

In partnership with academia, this project seeks to refine the ML pipeline for learning code representations, aiming for enhanced automation and efficiency in software engineering tasks.

The team has developed the open-source Tree Sitter Multi Codeview Generator, enabling the composition and visualization of multiple code views across various languages. It generates multi-code view graphs compatible with machine learning models, including sequence models and graph neural networks.

Combining and analyzing different code views allows the project to discern which views significantly impact each software engineering task. This insight facilitates the creation of custom, task-specific models optimized for each particular code view, enhancing the effectiveness and efficiency of ML in software engineering.