Open sourcing IBM’s Granite code models

06 May 2024

Release

7 minute read

IBM is releasing a family of Granite code models to the open-source community. The aim is to make coding as easy as possible — for as many developers as possible.

Over the last several decades, software has been woven into the fabric of every aspect of our society. But for all the increased productivity that modern software has brought to how we work, the actual act of writing, testing, debugging, and shipping reliable software is still an arduous task. Even the most skilled developer needs to search for tips and shortcuts, code languages are constantly being updated, and new languages are released nearly every day.

This is why IBM Research first started exploring whether AI could make it easier to develop and deploy code. In 2021, we unveiled CodeNet, a massive, high-quality dataset with 500 million lines of code in over 50 programming languages, as well as code snippets, code problems and descriptions. We saw the value that could be unlocked in building a dataset that could train future AI agents — the ones that we envisioned would translate code from legacy languages to those that power enterprise today. Others, we saw, would teach developers how to fix issues in their code, or even write code from basic instructions written in plain English.

Large language models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LLM-based agents showing promise in handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more.

The tremendous potential with LLMs that emerged over the last few years fueled our desire to turn our vision into a reality. And that’s exactly what we’ve begun to do with the IBM watsonx Code Assistant (WCA) family of products, like WCA for Ansible Lightspeed for IT Automation, and WCA for IBM Z for application modernization. WCA for Z uses a combination of automated tooling and IBM’s own 20-billion parameter Granite large language code model which enterprises can use to transform monolithic COBOL applications into services optimized for IBM Z.

We’ve striven to find ways to make developers more productive, spending less of their time figuring out why their code won’t run, or how to get a legacy codebase to communicate to newer applications. And that’s why today we’re announcing that we’re open-sourcing four variations of the IBM Granite code model.

What we’re open sourcing

We’re releasing a series of decoder-only Granite code models for code generative tasks, trained with code written in 116 programming languages. The Granite code models family consists of models ranging in size from 3 to 34 billion parameters, in both a base model and instruction-following model variants. These models have a range of uses, from complex application modernization tasks to on-device memory-constrained use cases.

Evaluation on a comprehensive set of tasks has shown that these Granite code models consistently match state-of-the-art performance among open-source code LLMs currently available. The versatile model family was optimized for enterprise software development workflows and performs well across a range of coding tasks, including code generation, fixing, and explanation.

These models are available on Hugging Face, GitHub, watsonx.ai, and RHEL AI, Red Hat’s new foundation model platform for developing, testing, and deploying generative AI models. The underlying base code models are the same as the one used to train WCA for specialized domains.

All the models were trained on data that was collected in adherence with IBM’s AI ethics principles and with the IBM legal team’s guidance for trustworthy enterprise use. These Granite Code models are released today under the Apache 2.0 license.

We’re also releasing Granite Code Instruct models. This is the instruction methodology we used on the models to fine-tune, using a combination of Git commits paired with human instructions and open-source synthetically generated code instruction datasets.

Why we’re open sourcing the Granite code models

We believe in the power of open innovation, and to get to a future where writing code is as easy as talking to an always-on assistant, we want to reach as many developers as possible. No effective system is ever created by a single individual — the best work builds on the collective knowledge of those who have come before.

While the general popularity of generative AI models has skyrocketed in recent years, enterprise adoption has been slower — for good reason. In the wider world of LLM research and deployment, the major models have now grown to tens of billions of parameters, many with 70 billion or more. While that’s useful for organizations looking to build generalized chatbots that understand a wide range of subjects, these models are computationally expensive to train and run. For enterprises, massive models can become unwieldy for more specific tasks, full of irrelevant information and running up high inferencing costs.

Many enterprises have been reluctant to adopt LLMs for commercial purposes for several reasons beyond just the cost. The licensing of these models is often unclear, and how these models were trained, and how the data was cleaned and filtered for things like hate, abuse, and profanity are often unknown.

"We are transforming the generative AI landscape for software by releasing the highest performing, cost-efficient code LLMs, truly empowering the open community to innovate on top for many use cases, without any restrictions — for research, commercial use cases, and beyond," said Ruchir Puri, chief scientist at IBM Research, who leads IBM’s efforts to bring coding assistants to the world. "I am very excited about the future of software with generative AI."

Puri believes that for many enterprise use cases, the 8B Granite code model variant we’ve released will be the right combination of weight, cost to run, and capability. But we’re also offering lighter and weightier versions that anyone in the open-source community can try out and see if they better fit their needs.

What the Granite code models enable

For many developers, writing code is not actually what takes up most of their time. Instead, it’s testing what they’ve written, ensuring it runs as intended, and finding and fixing any bugs that arise. Right now, a developer’s workflow might see them constantly jumping between whatever code they’re working on, and various online forums to figure out answers to their issues. It’s syncopated and often time-consuming.

With tools built on the IBM Granite code models, we envision a myriad of enterprise use cases for developers. That could range from agents that could write code for developers, tools that can explain why code isn’t working, and how to fix it. Many of the other quotidian but essential tasks that are part of a developer’s day — from generating unit tests, to writing documentation or running vulnerability tests — could be automated with these models.

And we see value in using these models to modernize mission-critical applications that need to remain secure, resilient, and most importantly, online. With generative systems built on Granite models, developers can create new ways to translate legacy codebases like COBOL into more modern languages like Java. It’s one of the major uses for code models that IBM saw when first diving into the world of AI for code, and remains one of the most important.

Doubling down on parameters

For the 34B version of the model, we used a novel method called depth upscaling to train the model. First, we created a duplicated version of the 20B variant, which has 52 layers to it. We removed the final eight layers from the first version of the 20B, and the first eight from the second version. We then merged the two versions to create a new model with 88 layers. We used the same 8,192 token context window when pre-training both the 20B and 34B model.

Page 6 - Figure 2.png — An overview of depth upscaling for training the Granite-34B-Code model.

How the models perform

In testing against a range of other models, including those that have been opened under Apache 2.0 licenses, and more proprietary models, we found our models able to compete at a range of tasks. Testing on benchmarks including HumanEvalPack, HumanEvalPlus, and RepoBench, we saw strong performances on code synthesis, fixing, explanation, editing, and translation, across most major programming languages, including Python, JavaScript, Java, Go, C++, and Rust.

Performance — How Granite-8B-Code-Instruct performs, compared with Mistral-7B-Instruct-v0.2, Gemma-7B-IT, and Llama-3-8B-Instruct, on HumanEvalPack.

Our models can outperform some twice their size, such as with Code Llama, and while some other models may perform slightly better in some tasks like code generation, no one model could perform at a high level at generation, fixing, and explanation — apart from Granite.

Model	MATH	GSM8K	SAT	OCW	MATH+Py	GSM8K+Py
StarCoderBase-7B	2.4	3.8	18.7	2.2	18.2	15.6
CodeLlama-7B	4.1	11.9	12.5	2.9	20.8	26.8
StarCoder2-7B	10.4	27.2	37.5	4.8	28.7	39.4
CodeGemma-7B	21.8	49.0	53.1	6.9	31.1	60.9
Granite-8B-Code-Base	21.4	61.9	62.5	8.8	35.4	63.1
Gemma-7B	24.1	53.3	75.0	7.3	27.4	52.9
Mistral-7B-v0.2	12.8	37.2	53.1	5.8	25.7	45.6
Llama-3-8B	15.6	49.8	34.4	9.9	0.0*	2.4
Lemma-7B	17.3	33.7	59.4	7.0	25.6	40.8

* The researchers noticed that Llama-3-8B-Base tends to generate invalid programs given the same prompts as the other model, resulting in very low scores on MATH+Py and GSM8K+Py tasks.

The models have a unique blend of data sources that the team believes sets them apart. They used GitHub Code Clean, StarCoderData, and other public code repositories and issues on GitHub. Combined with the robust metadata in CodeNet, which outlines code issues in plain English, they mixed the code sources, natural language documentation, and code problems in a specific way to train the models.

Page 19 - Figure 4.png — How the Granite Code models perform on the Berkeley Function-Calling Leaderboard. Overall accuracy keeps increasing as the model size increases.

Page 20 - Figure 5.png — Granite-8B-Code vs CodeLlama-7B on Berkley Function-Calling Leaderboard. The Granite-8B-Code model (Base and Instruct) consistently outperform CodeLlama-7B (Base/Instruct) on all three metrics.

The base models were trained from scratch on between 3 and 4 trillion tokens from 116 programming languages, as well as 500 billion tokens with our carefully designed mixture of high-quality data from code and natural language, which improved the models’ ability to reason and its problem-solving skills, that are essential for code generation.

What’s next

With the Granite code models, we’re releasing models to the community that we think stack up to just about any comparable model. We’re excited to see what will be built with these models, whether that’s new code generation tools, state-of-the-art editing software, or anything in between.

And this is just one aspect of the wider Granite family of models from IBM that have been designed, incubated, and released from within IBM Research. In the coming weeks, we’ll share more about other models and modalities that we believe will help shape the future of computing in exciting new ways.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter