Project CodeNet is a large dataset aimed at teaching AI to code.
"Software is eating the world,” US entrepreneur Marc Andreessen famously wrote in 2011. Fast-forward to today – software is in financial services and healthcare, smartphones and smart homes. Even cars now sport over 100 million lines of code.
Such large volumes of code, however, is a challenge to debug, maintain, and update, especially as enterprises aim to modernize their aging software infrastructure. As a result, we find ourselves in a new age where it’s essential to take advantage of today’s powerful technologies like artificial intelligence (AI) and hybrid cloud to create new solutions that can modernize processes across the information technologies (IT) pipeline.
Enter Project CodeNet. A large dataset aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.
But to understand this dataset’s significance, we must first take a step back.
The Next AI Frontier: The Language of Machines
Computer scientists have been long fascinated by the possibility of computers programming computers. Can AI make it easier to understand, develop, and deploy code - the language of machines? It can, but it hasn’t been easy to get it to do so.
The problem lies in rule-based systems.
Take programming language translation. If it was easy and the rule-based systems worked, early programming languages like COBOL would have been converted by now. But programming languages have context. The meaning of any statement is related to the context, and deriving it and making the translation, just like in human languages, is tricky and time-consuming.
The larger the program gets, the harder it becomes to translate. While in human language, the context may be limited to a paragraph or so, here the context can relate to multiple libraries of code. Context is a challenge for AI.
Roughly speaking, rule-based systems can be successful in translating somewhere between 50 to 60 percent of a program. While part of the program can be translated reasonably well, the rest typically has to be translated manually, involving complex rules.
Advancing AI for code
This is where AI can help – because it can act like humans.
Project CodeNet specifically can drive algorithmic innovation to extract this context with sequence-to-sequence models, just like what we have applied in human languages, to make a more significant dent in machine understanding of code as opposed to machine processing of code.
With code samples curated from open programming competitions over years, Project CodeNet is unique. It’s unique not just in its size and scale – but also in its high-quality metadata and annotations with a rich set of information, be it the code size, memory footprint, CPU run time, or status – which indicates acceptance or error types.
Over 90 percent of the problems come with the respective problem description that contains a concise problem statement, specification of the input format, and the output format. For over half of the coding problems (i.e., seven million code samples), we also curated sample input and output from the problem description, key to determining equivalence of two code samples in different languages, which can drive reinforcement learning techniques for code translation.
We provide them as part of the dataset – a handy feature of Project CodeNet. Users can execute the accepted codes samples to extract additional metadata and verify outputs from generative AI models for correctness. This will enable researchers to program intent equivalence when translating one programming language into another.
The rich metadata and the diversity of code samples and problems they solve open Project CodeNet to a myriad of uses cases. The dataset can be used for code search and clone detection. The code samples in Project CodeNet are labeled with their acceptance status, and we can explore AI techniques to distinguish correct codes from problematic ones.
Project CodeNet's metadata also enables the tracking of how a submission evolves from problematic to accepted, which could be used for exploring automatic code correction. Each code sample is labeled with CPU run time and memory footprint – useful for regression studies and prediction.
Given its wealth of programs written in a multitude of languages, we believe Project CodeNet can serve as a benchmark dataset for source-to-source translation and do for AI and code what the ImageNet dataset did years ago for computer vision.
Modernizing and operating software infrastructure is also essential from a business perspective. We touched on this last year when IBM announced several new capabilities – including IBM WatsonAIOps and Accelerator for Application Modernization – designed to automate the information technology (IT) pipeline.
For example, a large automotive client approached IBM to help update a $200 million asset consisting of 3,500, multi-generation Java files. These files consisted of more than one million lines of code, developed over a decade with multiple generations of Java technology.
It was a complex monolithic application code, not conducive with cloud-based environments. By applying our AI for Code stack, we reduced the business's year-long ongoing code migration process down to just four weeks, modernizing and generating over 25 new cloud-native microservices by refactoring the legacy monolithic application code.
Our team is excited to give researchers and developers a dataset and a set of technologies that is easy to use and understand, while simultaneously assisting in the development of algorithms that will advance AI for code. With Project CodeNet, we hope to produce lasting business value as enterprises embark on their IT modernization journeys.
Access Project CodeNet on GitHub.