- NeurIPS 2021
Source-to-source translation is an old problem in the programming language and software engineering (PLSE) community. The goal here is to automatically translate a given source code in one high-level language, say C++, to another high-level language, say Python. This task is also known as transcompiler, transpiler, or simply code translation. There are several motivations behind solving this problem. In the olden days, this problem was of interest mainly due to the requirement of porting source code across different hardware platforms, for example, Intel v/s PowerPC. However, the interest in this task has revived in recent times due to following reasons -- i) A need to translate code from newly emerged programming languages such as Dart, Haxe, Go, Swift, etc. into the omnipresent languages such as Java and Python, ii) Need to translate code from ancient languages such as COBOL into modern languages such as Java and Python to migrate legacy mainframe application onto the hybrid cloud.
The traditional approaches for solving the code translation task are rule-based. These approaches typically perform a static analysis of the given code by converting it into structured objects such as Abstract Syntax Tree (AST), Control Flow Graph (CFG), etc. Next, they apply handcrafted rewrite rules to translate the source code into the desired language. Developing these rule-based approaches is quite expensive as they require lots of human effort, time, and expertise in both the source and target language. As an alternative, AI and NLP community has recently started toying around with the idea of harnessing the recent advances made in the field of Large Language models (LLMs) for solving this task. In a relatively short time, this task has gained huge traction in both AI/NLP and PLSE communities. This has become one of the key task under the broad umbrella of what they call as AI4Code.
LLMs are deep learning-based AI models which have shown to be quite effective in solving several tasks involving natural languages. For example, automatic translation from English to Hindi, summarizing a text document, etc. While natural and programming languages are quite different, they are also similar to each other in many ways. Therefore, this approach of using LLMs like models for code translation is worth exploring. LLMs are cheaper in terms of human efforts and time but they come with their own challenges — i) LLMs are typically data hungry, ii) LLMs are probabilistic models and hence no guarantees can be attached to the quality of the translated code. Both of these challenges are big hindrance in their applicability to the real life code translation task. Because, in practice, we typically have much less data for languages which are either recent (e.g., Dart, Haxe, Go, Swift, etc.) or ancient (e.g., COBOL, Pascal, et.) and moreover, these languages are precisely the ones where the need of translation arises the most.
Motivated by the above challenges, we at IBM Research India are working on this exciting project of code translation for low-resource programming languages where bi-lingual paired training data is much less and sometimes even mono-lingual data is also low. The peculiarities of programming languages throw some additional challenges and they are quite different than what one faces while designing solutions for natural language translation under low resources. For example, often, the perturbation of a single word or token in a natural language sentence would not alter its meaning much, but the same is not true for the source code. On the flip side, we have auxiliary information available at our disposal for source code which comes from the static program analysis. This includes Abstract Syntax Tree (AST), Control Flow Graph (CFG), etc. We are exploring ways to fuse such symbolic and structural information into the LLMs to make them more effective at less data regime.