IBM Granite led open-source LLMs on API calling

IBM’s Granite 20B model topped several benchmarks ranking large language models by how reliably they connect to external software tools.

Large language models are breaking out of chat. Now, they’re being used to help with tasks like finding software bugs, planning a business trip, and checking how much money is in the bank. Pulling off these kinds of tasks depends on calling upon the right API (or application programming interface), linking LLMs to software tools and the ability to interact with the outside world.

But calling an API from a conversational prompt is harder than it looks. An LLM must be able to identify the right API, slot in the correct information from the user query, and incorporate the answer that comes back into an engaging response.

Effective tool calling, which is also referred to as function calling, has quickly become an important measure of LLM competency. It’s also an essential skill on the path to deploying LLMs as full-fledged AI agents in an enterprise environment, capable of planning and acting on feedback from customers and the outside world.

Tool use allows LLMs to compensate for some of their shortcomings. On their own, LLMs have a hard time with basic math, current events, and knowing when to say they don’t know the answer to a question. In ambiguous situations, or when they don’t have the facts, LLMs have been known to ‘hallucinate’ information.

But with the help of software tools like calculators, retriever models to search the web, and other LLMs to check their work, language models can accomplish far more sophisticated tasks with greater accuracy and transparency.

The growing importance of software tools has led to the creation of several benchmarks for measuring how well LLMs can call and execute APIs. At the time of publishing, IBM recently cracked the top 10 of the most challenging among them, the University of California, Berkeley’s Function-Calling Leaderboard.

Large, proprietary models dominated Berkeley’s leaderboard when it launched in March. But today, three of its top 15 models are covered by an Apache 2.0 license, the gold standard for open-source software. IBM’s Granite-20B function-calling LLM is currently the top open-source model, and in ninth place overall, followed by Berkeley’s own Gorilla model, and Meta’s Llama-3-70B model.

“Translating a user query into a set of API calls may seem straightforward,” said Kinjal Basu, an IBM researcher focused on teaching LLMs how to handle API calls. “But it involves executing a set of precise steps. If the model makes so much as one mistake, none of it will work.”

Mimicking real-world API scenarios

Like humans, LLMs learn by example, but when it comes to API calling, finding data in the open to learn from and emulate is a challenge. Most of the current data related to API tasks has been synthetically generated by LLMs.

Though cheap to produce, much of this existing synthetic data fails to capture the richness of real life. When LLMs are trained on data lacking real-world variety, they typically struggle to generalize from what they’ve learned to new and unpredictable situations.

To inject more complexity into the data and improve LLM function calling, IBM researchers created API-Blend, a dataset containing tens of thousands of question-and-answer pairs, known as instructions, for detecting and executing APIs from a user query.

By recycling existing datasets of humans chatting with digital assistants, researchers used generative AI to insert API-related scenarios into the dialogue. They focused on recreating complex queries that require an LLM to identify and execute a series of APIs in the correct order. For example, calculating the driving time between Las Vegas and Flagstaff, via the Grand Canyon, involves calling an API twice to map both legs of the trip then passing those values to a calculator API to sum them up.

IBM’s Granite function-calling model also distinguished itself in internal testing across a collection of academic API-calling benchmarks, including API-Bank, ToolAlpaca, Toolbench Leaderboard, and Nexus Function Calling Leaderboard. Researchers evaluated the IBM model and several other top open-source LLMs on the benchmarks and found that IBM Granite performed as well as or better than its competitors, including much larger models like Meta’s Lllama-3-70B.

The IBM model also generated significantly fewer phantom APIs. Hallucinations are a particular concern with tool-calling LLMs with the ability to interact with the external world and cause real damage.

“It’s better not to call a function than make one up,” said Ibrahim Abdelaziz, an IBM researcher working on the project. “If the hallucinated function is executable, the user could end up receiving incorrect information or being open to a malicious attack.”

What’s next

IBM researchers continue to create new and more varied API data to improve the model’s tool-calling skills as well as its ability to reason through a problem and reflect on its actions. “The world is moving toward LLM-based agents, and functionI calling is fundamental to agents being able to interact with their environment,” said IBM’s Pavan Kapanipathi, a principal research scientist leading IBM’s function-calling team.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

How IBM’s Kush Varshney became the face of the modern ‘camera man’
Q & A
Kim Martineau
21 Jul 2025
IBM’s new benchmark puts industrial agents to the test
Research
Kim Martineau
15 Jul 2025
How AI could help stretch the life of industrial equipment
Research
Kim Martineau
02 Jul 2025
Lossless compression tailored for AI
Research
Kim Martineau
30 Jun 2025
- AI
- Generative AI

Mimicking real-world API scenarios

What’s next

Related posts

How IBM’s Kush Varshney became the face of the modern ‘camera man’

IBM’s new benchmark puts industrial agents to the test

How AI could help stretch the life of industrial equipment

Lossless compression tailored for AI