Research
4 minute read

Reasoning in Granite 3.2 using inference scaling

Inference scaling is an active and exciting area of research in AI. Essentially, it’s the concept of using more compute at inference time to dramatically improve the performance of an LLM. At IBM, we've been developing innovative new inference scaling techniques to improve the performance of our models on reasoning tasks. 

Using inference scaling, we wanted to share some exciting results from applying these techniques to Granite-3.2, the newest generation of our Granite model series. In tests, we saw a significant boost in performance for code and math reasoning tasks.

These improvements, often upwards of 20 points on standard benchmark metrics, enable our 8B parameter Granite 3.2 model to exceed the performance of other strong proprietary models, such as GPT-4o-0513 and Claude-3.5-Sonnet. Using inference time compute to boost reasoning performance for math tasks complements the broader general-purpose, built-in reasoning capabilities of our Granite 3.2 model. 

What is inference scaling?

Anyone who has played with an LLM knows that if the “temperature” setting of the LLM is set to some value other than zero, they can obtain multiple different answers by simply asking the LLM to repeatedly generate answers. The quality of the answers will in general vary, and picking the best answer is one sure way to improve a model’s output at the cost of doing multiple generations. To choose the best answer in an automatic way one can rely on a reward model, which can be seen as a sort of “companion” model that can be used to score all the answers, from which we can then choose the answer with the highest score. 

Reasoning using inference scaling

In the context of reasoning tasks, this idea of scoring multiple answers to pick the best answer can be applied also to the “chain of thought” that often precedes answer generation. In fact, you don’t need to wait for the entire reasoning to be completed before deciding whether the reasoning was good or not. It’s possible to use a reward model to guide the reasoning process in steps, letting the LLM know when the reasoning may be taking a wrong turn. Reward models that are able to score partial generations in an LLM are often termed process reward models (or PRMs).

You can enable reasoning using inference scaling, by combining three ingredients; an LLM, a PRM, and a search algorithm to explore the space of possible reasoning paths. This is different from the approach popularized by DeepSeek using long chain of thought. Both approaches use inference compute to obtain better answers. However, the inference scaling method uses two models in tandem to conduct the search, whereas DeepSeek’s approach uses a single model that reflects on its own progress.

An advantage of using a combination of LLM, PRM, and a search technique is flexibility and modularity. You can reliably get better answers simply by increasing the inference time compute budget; we can switch out PRMs and use ones optimized for specific tasks, and innovations on the search algorithms can be plugged into the system without any changes to the underlying LLM. 

Granite 3.2 reasoning using inference scaling

At IBM and Red Hat, we've been developing and evaluating multiple inference scaling techniques to understand how Granite 3.2 can take advantage of runtime scaling to deliver advanced reasoning for code and math tasks. We have evaluated three different approaches to activate inference scaling: 

  • The first approach uses a novel search algorithm, that is inspired by ideas from classical probabilistic inferencing, and in particular, particle filtering. We use this technique, in conjunction with QWEN2.5-Math-PRM-7B as the Process Reward Model, directly on top of Granite 3.2.  
  • Our second approach applies a more conventional majority voting technique, where multiple answers are sampled from Granite and the most frequent one is chosen. To apply this technique, we leveraged the native capability of Granite 3.2 to generate longer chains of thought (they are about 50% longer than in Granite 3.1) on math problems. 
  • Our final approach uses the same majority voting technique but applied now to an experimental version of Granite 3.2, depicted as Granite 3.2 (primed), that has been encouraged to generate even longer chains of thought using supervised fine tuning and reinforcement learning techniques. 

To measure performance, we use two standard math reasoning focused benchmarks — MATH500 and AIME 2024. While the tasks in both benchmarks are challenging, they share the property that it is relatively easy to verify whether a given answer is correct. As reference points, we report performance numbers from other well-known models, including DeepSeek’s R1, OpenAI’s o1-mini and GPT-4o, Alibaba’s QwQ-32B-Preview, and Anthropic’s Claude3.5-Sonnet, which are depicted as horizontal lines.

Math500

In both figures, the x-axis is a measure of the inference time compute used to generate results, roughly corresponding to the number of particles in the particle filtering technique and the number of samples in the majority voting technique. Following standard convention, the metric on the y-axis is pass @1, an estimate of the probability that the model produces the correct answer in a single attempt. For the reference numbers, we only report average pass @1, as the point of our analysis is to contrast scaling inference on smaller models versus a single inference on more expensive models.

Screenshot 2025-02-25 at 6.33.46 PM 2.png

As both plots show, Granite 3.2 is able to take advantage of inference scaling techniques to dramatically boost performance on both MATH500 and AIME2024. For instance, using the particle filtering approach, Granite 3.2’s performance on MATH500 jumps by over 60%, and the performance on AIME2024 grows by a factor of 5. We also see that even simple techniques like majority voting result in dramatic performance improvements as they are able to exercise the native ability of the model to generate chains of thought for math. Furthermore, we do see noticeable improvements in performance with majority voting by “priming” Granite 3.2 to generate longer chains of thought. In both graphs, the primed Granite 3.2 model outperforms our Granite 3.2 model.

Finally, and most significantly, with these inference scaling techniques, our 8B parameter Granite 3.2 model is able to even exceed the performance of much larger models like GPT-4o-0513 and Claude3.5-Sonnet-1022 on both benchmarks.  

What’s next

Inference time scaling has emerged as a powerful technique that can be used to improve model performance, sometimes quite significantly. We have shown our early innovations from our labs that use inference scaling techniques to enable our 8B Granite 3.2 model to demonstrate state-of-the-art math reasoning capabilities, even surpassing those of other well-known proprietary frontier models. This is an active area of research and development for us as we continue to enhance our Granite models to deliver state-of-the-art capabilities — at the most effective and optimized cost point. 

Date

Authors

Topics

Share