Explainer
6 minute read

How IBM Granite became a leader in responsible AI

Stanford recently singled out IBM’s family of LLMs for their transparency. Here’s a deep dive on the makings of Granite and how transparency figured into the process from the start.

IBM’s family of Granite language models recently earned a 95% score on Stanford’s Foundation Model Transparency Index, achieving the highest-ever mark in the benchmark’s history and outranking the next-best model by 23 percentage points.

The results stand in contrast to many of IBM’s tech industry peers whose transparency rankings have slipped in recent years, even as sensitive enterprise systems and workflows incorporate more generative AI.

Few of Granite's creators were surprised by the strong showing. Transparency was a guiding principle from the outset, they said, for ethical as much as business reasons. “Our pitch to customers all along has been, ‘Here is how the sausage is made,’” said Heiko Ludwig, an IBM researcher who manages Granite data platforms. “We’re very open about it.”

IBM was among the first tech companies to indemnify its models on watsonx against copyright claims, shielding customers from potential lawsuits arising from proprietary content that Granite might resurface. IBM was also early to open-source Granite and its weights under a permissive Apache 2.0 license, making it easier for people to customize and deploy the models without a restrictive commercial interface like Claude or ChatGPT.

Developing and deploying AI models is not without risk, and that goes for open-source, too. IBM established a set of policies and procedures from the start to ensure that its Granite models would be developed and open-sourced responsibly. This included vetting each piece of training data that went into the models, and open-sourcing guardrail models to detect harmful content that might break its safety controls.

Researchers also designed tools to systematically automate each stage of the development pipeline to make high-quality data gathering, filtering, and generation fast and consistent. IBM has since open-sourced two of these tools — Data Prep Kit for cleaning pre-training data and DiGiT (Data Generation and Transformation) for generating synthetic alignment data — allowing anyone to build full-scale trustworthy AI models.

Complementing these tools and best practices, IBM has worked with external partners to red team its models and to contribute data to reinforce specialized knowledge and skills. IBM’s newest collaborator, a company called NewsGuard, supplies a list of foreign, state-backed disinformation websites that researchers can cross-check when gathering new data from the web. NewsGuard research has found that commercial chatbots now repeat false claims that foreign influencer hae planted across the web about a third of the time.

By investing in governance, workflow automation, and outside partnerships, IBM has built a family of language models that not only excel at a range of enterprise tasks but are cost-efficient and can withstand adversarial attacks.

Transparency was one of Granite’s selling points from the beginning. And now, with updated technical documentation, an AI auditor’s seal-of-approval (an ISO 42001 certification), and external validation from two independent organizations, Granite has even further credentials to back up the sales pitch.

Toward an AI model that’s ‘fundamentally sound’

IBM Granite is named for one of nature’s sturdiest rocks, forged from magma cooled and crystallized over millions of years. Barely three years old, Granite shares some of its namesake’s durability.

Every conversation in AI starts with models but ends with data. Without good data, you won’t have a model that’s fundamentally sound.

Data is the foundation of any foundation model, and researchers took great care in gathering, vetting, and logging the 10 petabytes of data they ended up with. The models are based on only a tiny sliver of this, but all of it, wheat and chaff, is meticulously recorded and stored in a data management system purpose-built for model building.

“We track the full lineage and provenance of the exact documents used for training,” said Petros Zerfos, an IBM researcher who leads the data engineering for generative AI team. “We also keep a record of most of the data we ultimately discard.”

While collecting Granite’s training data, IBM researchers stuck to authoritative data sources and avoided copyrighted content and sites hosted outside of the US or EU, where misleading and deceiving information can be harder to identify. To ensure that Granite would be inclusive, researchers sought out additional language and cultural data to fill in gaps.

After selectively gathering Granite’s pre-training data, researchers filtered it for hate speech, profanity, bias, and pirated content. Each document is tagged by its modality, license status, and other criteria, and the system is directly integrated with the IBM-wide data governance review process.

“Every conversation in AI starts with models but ends with data,” said Zerfos. “Without good data, you won’t have a model that’s fundamentally sound.”

Automation for scaling ethical AI

When researchers began training Granite back in 2023, they had to figure out how to process petabytes of data at breakneck speed to bring production-grade models to market quickly.

IBM research scientist Syed Zawad was charged with designing the cloud platform that could do it. What began as a platform for processing Granite pre-training data is today an IBM Cloud offering called Serverless Fleets. The platform essentially allows non-engineers to more easily process large-scale data over long periods of time. One of its main selling points is that by not having to run Kubernetes or other dependencies, the platform can crunch large computing loads faster.

“All you have to do is create a Docker container that runs a specific function, which can then run on a single CPU or on hundreds of thousands of CPUs, from anywhere,” said Zawad.

Researchers used the platform to shrink petabytes of raw text, code, and html files to 700 terabytes. Significant savings came from extracting text content from hundreds of billions of raw html files, and converting it to Apache Parquet, an open-source format designed for efficient storage and rapid querying of large datasets.

Now a more manageable size, the data was ready to be cleaned. To do this efficiently, researchers built the Data Prep Kit, a library of optimized routines for data preparation at scale, from filtering out unwanted duplicates and low-quality documents, to breaking the documents down into “tokens,” corresponding to words or word parts, which are the ultimate atomic unit that language models consume and produce. To maximize the library’s flexibility, they designed it to run as easily on a laptop as in the cloud with Kubernetes-based scaling.

Researchers used both pre-processing tools to create a high-quality, 10-trillion-token dataset called GneissWeb, which makes up Granite’s core linguistic capabilities. Earlier this year, IBM donated both the Data Prep Kit library and the recipe for GneissWeb to the Linux Foundation. Common Crawl has since open-sourced Gneiss Web's annotations providing direct access to the data.

Evaluation is the last stop in the development pipeline. To automate the job of running Granite through its paces, researchers developed an evaluation framework called SAGE, which brings together an array of disparate evaluation codebases from the community under one common interface. SAGE provides a unified report of results and can run evaluations in parallel.

"Thousands of experiments went into producing the Granite models people are using today," said Aditya Prasad, an IBM researcher working on Granite. "Automation was essential to make that possible."

Safe, specialized knowledge on demand

Ethical data sourcing doesn’t end with pre-training. It’s also a major part of alignment, the stage at which an LLM model is taught how to interact with people and carry out specialized tasks. To generate alignment data at scale, IBM researchers built a set of synthetic data generation tools called DiGiT.

Initially developed to help researchers across IBM create standardized unstructured datasets for tuning and evaluating Granite, it proved so useful that IBM Cloud is now rolling out select tools that are compatible with Mixtral, Llama, and Granite models.

IBM researchers used DiGiT to create 2 trillion tokens of data to give Granite additional expertise in function calling, text-to-SQL prompting, and information retrieval, among other tasks. In two cases, the tools cut the time needed to fine-tune Granite from three months to three weeks.

“Once you have a hardened pipeline that works, anyone can generate data to extend any family of model they’re trying to train,” said Kshitij Fadnis, an IBM Research staff engineer who helped develop DiGiT.

This ability to quickly generate targeted training data has been especially useful to the team charged with ensuring Granite produces safe and unbiased reponses. IBM researcher Nathalie Baracaldo and her team run Granite through the latest safety benchmarks to find areas for improvement. Once they identify a gap, they use DiGiT to create synthetic data designed to address it.

IBM currently has about 80 policies, drafted with IBM's data governance experts, that tell the model how to answer questions dealing with sensitive topics. For each policy, the team used DiGiT to create a kind of question-and-answer teaching curriculum that shows the model how to behave in similar instances.

Through these synthetic examples, the model learns how to handle provocative questions. It also learns when to demur. “We create a very specific policy and then generate the data,” said Baracaldo. “It’s similar to how I’ve trained my child not to go with a stranger — even if they have a puppy.”

In addition to safety benchmarks, IBM uses third-party red-teaming tools to uncover other safety and security concerns.

Synthetic data, of course, has limitations. “It’s great when the data needs to vary only slightly,” said Marina Danilevsky, an IBM researcher focused on RAG applications. “If you need more diverse examples, you need a human.”

IBM worked with DataForce to collect additional safety data and Defined.ai to seed DiGiT with more varied examples. One result was MTRAG, a benchmark and dataset of realistic multi-turn conversations, that IBM has used for training its light-weight "activated" low-rank adapters, or aLoRAs to tailor AI models for information retrieval tasks.

Data created by both DiGiT and the humans at Defined.ai have also been used to train these lightweight adapters to rewrite user prompts for clarity, serve up citations for LLM responses, and estimate the accuracy of Granite’s response based on the source documents it fetched.

“What makes a model good? High quality answers,” said Danilevsky. “And to get that you need people as well as strong synthetic data generation tools.”

Together, the policies and new technologies behind Granite have made it one of the most robust and open family of language models out there. Enterprises can depend on them, whatever their business needs.

Related posts