At NeurIPS 2021, IBM Research presents its work on CodeNet, a massive dataset of code samples and problems. We believe that it has the potential to revitalize techniques for modernizing legacy systems, helping developers write better code, and potentially even enabling AI systems to help code the computers of tomorrow.
Chances are, if you’ve done just about anything today, you’ve interacted with a code language older than the invention of the desktop computer, the internet, and VHS tape.
Whether you’ve checked your bank account, used a credit card, gone to the doctor, booked a flight, paid your taxes, or bought something in a store, you likely have interacted with a system that relies on COBOL (Common Business Oriented Language) code. It’s a programming language that many mission-critical business systems around the world still rely on, even though it was first implemented over six decades ago. It’s estimated that some 80% of financial transactions use COBOL, and the U.S. Social Security Administration utilizes around 60 million lines of COBOL code.
As programmers and developers versed in COBOL have started to retire, organizations have struggled to keep their systems up and running, let alone modernize them for the realities of the always-on internet. And this is just one of a myriad of languages still in use that don’t reflect what modern coders feel most comfortable writing in, or what’s best suited for modern business applications.
Code language translation is one of many problems that we strive to address with CodeNet, which we first unveiled back in May. Essentially, CodeNet is a massive dataset that aims to help AI systems learn how to understand and improve code, as well as help developers code more efficiently, and eventually, allow an AI system to code a computer. It’s made up of around 14 million code samples, comprising some 500 million lines of code from more than 55 different languages. It includes samples from modern languages like C++, Java, Python, and Go, to legacy ones like Pascal, FORTRAN, and COBOL. Within a short span of three months, our GitHub received 1,070 stars and has been forked over 135 times.
This week at NeurIPS, we discuss our paper on CodeNet, and the work we’ve done to build out CodeNet,1 how we see it as different from anything else like it that’s available for anyone to download, and how we see it being used by the research community.
There has been a revolution in AI over the last decade. Language and image data, carefully curated and tagged in datasets like ImageNet, have given rise to AI systems that can complete sentences for writers, detect tumors for doctors, and automate myriad business and IT processes. But for code, the language of computers, crafting such a dataset that AI systems can be learn from has been a challenging task.
The end goal of CodeNet is to enable developers to create systems that can modernize existing codebases, as well as fix errors and security vulnerabilities in code. It’s something I recently discussed in a lightboard video: Can computers program computers?
We’ve carried out baseline experiments on CodeNet for code classification, code similarity, and code completion. These results serve as a reference for CodeNet users when they perform their own experiments. Some of our results also indicate that the models derived from CodeNet can generalize better across datasets than those derived from other datasets due to CodeNet’s high quality.
CodeNet can help create systems that can help determine what type of code a snippet is. We used a wide range of machine-learning methods for our experiments, including bag of tokens, sequence of tokens, BERT model, and graph neural networks (GNNs). We achieved upwards of 97% accuracy with some of our methods at matching code types to source code.
Code similarity determines if multiple pieces of code solve the same problem. It serves as the foundational technique for code recommendation, clone detection, and cross language transformations. We tested a wide spectrum of methods for code similarity (including MLP with bag of tokens, Siamese network with token sequence, a Simplified Parse Tree [SPT] with handcrafted feature extraction, and a GNN with SPT) against our benchmark datasets. The best similarity score comes from leveraging a sophisticated GNN with intra-graph and inter-graph attention mechanisms.
Generalization across datasets
We believe that models trained on the CodeNet benchmark datasets can benefit greatly from their high quality. For example, we took our benchmark, C++1000 and compared it against one of the largest publicly available datasets of its kind, GCJ-297, derived from problems and solutions in Google’s Code Jam. We trained the same MISIM neural code similarity system model on C++1000 and GCJ-297, and tested the two trained models on another independent dataset, POJ-104.
Our data suggests that the model trained on GCJ-297 has a 12% lower accuracy score than the model trained on C++1000. We believe C++1000 can better generalize because there’s less data bias than there is in GCJ-297 (where the top 20 problems with the greatest number of submissions account for 50% of all the submissions), and the quality of the cleaning and de-duplication of the data in CodeNet is superior.
We believe this to be a valuable use case for developers, where an AI system can predict what code should come next at a given position in a code sequence. To test this, we built a masked language model (MLM) that randomly masks out (or hides) tokens in an input sequence and tries to correctly predict what comes next in a set of tests it hasn’t seen yet. We trained a popular BERT-like attention model on our C++1000 benchmark, and achieved a top-1 prediction accuracy of 91.04% and a top-5 accuracy of 99.35%.
The rich metadata and language diversity open CodeNet to a plethora of interesting and practical use cases. The code samples in CodeNet are labeled with their anonymized submitter and acceptance status so we can readily extract realistic pairs of buggy and fixed code from the same submitter for automated code repair. A large percentage of the code samples come with inputs so that we can execute the code to extract the CPU run time and memory footprint, which can be used for regression studies and prediction. CodeNet may also be used for program translation, given its wealth of programs written in a multitude of languages. The large number of code samples written in popular languages (such as C++, Python, Java, and C) provide good training datasets for the novel and effective monolingual approaches invented in the past several years.
While CodeNet isn’t the only dataset aimed at tackling the world of AI for code, we believe it to have some key differences.
Large scale: To be useful, CodeNet needs to have a large number of data samples, with a broad variety of samples to match what users might encounter when trying to code. With its 500 million lines of code, we believe that CodeNet is the largest dataset in its class: It has approximately 10 times more code samples than GCJ, and its C++ benchmark is approximately 10 times larger than POJ-104.
Rich annotation: CodeNet also includes a variety of information on its code samples, such as whether a sample solves a specific problem (and the error categories it falls into if it doesn’t). It also includes a given task’s problem statement, as well as a sample input for execution and a sample output for validation, given that the code is supposed to be solving a coding problem. This additional information isn’t available in other similar datasets.
Clean samples: To help with CodeNet’s accuracy and performance ability, we analyzed the code samples for near-duplication (and duplication), and used clustering to find identical problems.
CodeNet contains a total of 13,916,868 code submissions, divided into 4,053 problems. Some 53.6% (7,460,588) of the submissions are accepted, meaning they can pass the test they’re prescribed to do, and 29.5% are marked with wrong answers. The remaining submissions were rejected due to their failure to meet runtime or memory requirements.
The problems in CodeNet are mainly pedagogical and range from simple exercises to sophisticated problems that require advanced algorithms, and the people submitting the code range from beginners to experienced developers. The dataset is primarily composed of code and metadata scraped from two online code judging sites, AIZU and AtCoder. These sites offer courses and contests where coding problems are posed and submissions are judged by an automated review process to see how correct they are. We only considered public submissions and manually merged the information from the two sources, creating a unified format from which we made a single dataset.
We ensured that we applied a consistent UTF-8 character encoding on all the raw data we collected, given that the data came from different sources. We also removed byte-order marks and use Unix-style line-feeds as the line ending.
We looked for duplicate problems, as much of these problems were compiled over many decades. We also identified near-duplicate code samples to facilitate extraction of benchmark datasets in which data independence is desirable.
We provided benchmark datasets for the dominant languages (C++, Python, and Java) for the convenience of the users. Each code sample and related problem are unique, and have passed through several pre-processing tools we’ve provided to ensure that code samples can effectively be converted into a machine learning model input. Users can also create benchmark datasets that are customized to their specific purposes using the data filtering and aggregation tools provided in our GitHub.
This is just the start for our vision of what CodeNet can offer to the world of AI for code. We hope to achieve widespread adoption of the dataset to spur on innovation in using AI to modernize the systems we all rely on every day.
In the near future, we will be launching a series of challenges based on the CodeNet data. The first is a challenge for data scientists to develop AI models using CodeNet that can identify code with similar functionality to another piece of code. This challenge was launched in partnership with Stanford’s Global Women in Data Science organization. We’ve organized workshops to introduce the topic, code similarity, and provide educational material. Every team that participates in these challenges is comprised of at least 50% women to encourage diversity in this exciting area of AI for Code.
We envision a future where a developer can build on legacy code in a language they’re accustomed to. They could write in Python and an AI system could convert it in completely executable COBOL, extending the life of the system they’re working on, as well as its reliability, indefinitely. We see the potential for AI systems that can evaluate a developer’s code, based on thousands of examples of past code, and suggest how to improve how they develop their system, or even write a more efficient piece of code itself. In short, we’ve begun to explore how computers can program the ones that succeed them. But for CodeNet to succeed, we need developers to start using what we’ve built.
For more information on how to download, read, and use the CodeNet dataset, please visit our GitHub.
Date06 Dec 2021
Puri, R., Kung, D., Janssen, G., et al. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. arXiv. (2021). ↩