Evaluation on a comprehensive set of tasks has shown that these Granite code models consistently match state-of-the-art performance among open-source code LLMs currently available. The versatile model family was optimized for enterprise software development workflows and performs well across a range of coding tasks, including code generation, fixing, and explanation.
These models are available on Hugging Face, GitHub, watsonx.ai, and RHEL AI, Red Hat’s new foundation model platform for developing, testing, and deploying generative AI models. The underlying base code models are the same as the one used to train WCA for specialized domains.
All the models were trained on data that was collected in adherence with IBM’s AI ethics principles and with the IBM legal team’s guidance for trustworthy enterprise use. These Granite Code models are released today under the Apache 2.0 license.
We’re also releasing Granite Code Instruct models. This is the instruction methodology we used on the models to fine-tune, using a combination of Git commits paired with human instructions and open-source synthetically generated code instruction datasets.
We believe in the power of open innovation, and to get to a future where writing code is as easy as talking to an always-on assistant, we want to reach as many developers as possible. No effective system is ever created by a single individual — the best work builds on the collective knowledge of those who have come before.
While the general popularity of generative AI models has skyrocketed in recent years, enterprise adoption has been slower — for good reason. In the wider world of LLM research and deployment, the major models have now grown to tens of billions of parameters, many with 70 billion or more. While that’s useful for organizations looking to build generalized chatbots that understand a wide range of subjects, these models are computationally expensive to train and run. For enterprises, massive models can become unwieldy for more specific tasks, full of irrelevant information and running up high inferencing costs.
Many enterprises have been reluctant to adopt LLMs for commercial purposes for several reasons beyond just the cost. The licensing of these models is often unclear, and how these models were trained, and how the data was cleaned and filtered for things like hate, abuse, and profanity are often unknown.
"We are transforming the generative AI landscape for software by releasing the highest performing, cost-efficient code LLMs, truly empowering the open community to innovate on top for many use cases, without any restrictions — for research, commercial use cases, and beyond," said Ruchir Puri, chief scientist at IBM Research, who leads IBM’s efforts to bring coding assistants to the world. "I am very excited about the future of software with generative AI."
Puri believes that for many enterprise use cases, the 8B Granite code model variant we’ve released will be the right combination of weight, cost to run, and capability. But we’re also offering lighter and weightier versions that anyone in the open-source community can try out and see if they better fit their needs.
For many developers, writing code is not actually what takes up most of their time. Instead, it’s testing what they’ve written, ensuring it runs as intended, and finding and fixing any bugs that arise. Right now, a developer’s workflow might see them constantly jumping between whatever code they’re working on, and various online forums to figure out answers to their issues. It’s syncopated and often time-consuming.
With tools built on the IBM Granite code models, we envision a myriad of enterprise use cases for developers. That could range from agents that could write code for developers, tools that can explain why code isn’t working, and how to fix it. Many of the other quotidian but essential tasks that are part of a developer’s day — from generating unit tests, to writing documentation or running vulnerability tests — could be automated with these models.
And we see value in using these models to modernize mission-critical applications that need to remain secure, resilient, and most importantly, online. With generative systems built on Granite models, developers can create new ways to translate legacy codebases like COBOL into more modern languages like Java. It’s one of the major uses for code models that IBM saw when first diving into the world of AI for code, and remains one of the most important.
For the 34B version of the model, we used a novel method called depth upscaling to train the model. First, we created a duplicated version of the 20B variant, which has 52 layers to it. We removed the final eight layers from the first version of the 20B, and the first eight from the second version. We then merged the two versions to create a new model with 88 layers. We used the same 8,192 token context window when pre-training both the 20B and 34B model.
In testing against a range of other models, including those that have been opened under Apache 2.0 licenses, and more proprietary models, we found our models able to compete at a range of tasks. Testing on benchmarks including HumanEvalPack, HumanEvalPlus, and RepoBench, we saw strong performances on code synthesis, fixing, explanation, editing, and translation, across most major programming languages, including Python, JavaScript, Java, Go, C++, and Rust.
Our models can outperform some twice their size, such as with Code Llama, and while some other models may perform slightly better in some tasks like code generation, no one model could perform at a high level at generation, fixing, and explanation — apart from Granite.
Model | MATH | GSM8K | SAT | OCW | MATH+Py | GSM8K+Py |
---|---|---|---|---|---|---|
StarCoderBase-7B | 2.4 | 3.8 | 18.7 | 2.2 | 18.2 | 15.6 |
CodeLlama-7B | 4.1 | 11.9 | 12.5 | 2.9 | 20.8 | 26.8 |
StarCoder2-7B | 10.4 | 27.2 | 37.5 | 4.8 | 28.7 | 39.4 |
CodeGemma-7B | 21.8 | 49.0 | 53.1 | 6.9 | 31.1 | 60.9 |
Granite-8B-Code-Base | 21.4 | 61.9 | 62.5 | 8.8 | 35.4 | 63.1 |
Gemma-7B | 24.1 | 53.3 | 75.0 | 7.3 | 27.4 | 52.9 |
Mistral-7B-v0.2 | 12.8 | 37.2 | 53.1 | 5.8 | 25.7 | 45.6 |
Llama-3-8B | 15.6 | 49.8 | 34.4 | 9.9 | 0.0* | 2.4 |
Lemma-7B | 17.3 | 33.7 | 59.4 | 7.0 | 25.6 | 40.8 |
*
The researchers noticed that Llama-3-8B-Base tends to generate invalid programs given the same prompts as the other model, resulting in very low scores on MATH+Py and GSM8K+Py tasks.
The models have a unique blend of data sources that the team believes sets them apart. They used GitHub Code Clean, StarCoderData, and other public code repositories and issues on GitHub. Combined with the robust metadata in CodeNet, which outlines code issues in plain English, they mixed the code sources, natural language documentation, and code problems in a specific way to train the models.
The base models were trained from scratch on between 3 and 4 trillion tokens from 116 programming languages, as well as 500 billion tokens with our carefully designed mixture of high-quality data from code and natural language, which improved the models’ ability to reason and its problem-solving skills, that are essential for code generation.
With the Granite code models, we’re releasing models to the community that we think stack up to just about any comparable model. We’re excited to see what will be built with these models, whether that’s new code generation tools, state-of-the-art editing software, or anything in between.
And this is just one aspect of the wider Granite family of models from IBM that have been designed, incubated, and released from within IBM Research. In the coming weeks, we’ll share more about other models and modalities that we believe will help shape the future of computing in exciting new ways.