Release
3 minute read

Toucan: A new goldmine for tool-calling AI agents

The dataset of 1.5 million task scenarios, field-tested and open-sourced by IBM and University of Washington, is designed to improve how agents interact with the world and get things done.

Of all the capabilities that define an AI agent, tool-calling is perhaps the most essential. Without the ability to find and deploy ‘tools,’ which are basically applications on the web, a large language model is little more than a plain old chatbot.

Teaching LLMs to properly call and execute tools, however, is far from easy. They need a variety of high-quality examples to learn from, and that kind of data is hard to create, let alone find, on the internet.

That just changed with the release of Toucan, the largest, most comprehensive collection yet of publicly available, end-to-end, tool-calling scenarios. “Toucan changes everything,” one enthusiast wrote on LinkedIn. “This isn't another simulated dataset. It captures actual API executions in real environments. Complete interaction chains from start to finish.”

Created by researchers at IBM and the University of Washington (UW), Toucan features 1.5 million real-life, tool-calling task sequences, called trajectories, that collectively invoke 2,000 different web services. The scenarios cover everything from analyzing sales reports and drafting a business summary to scheduling a meeting with colleagues and sending out calendar invites.

Toucan seized the community’s attention almost immediately after it dropped last week on Hugging Face. It’s been a top trending dataset there ever since.

Toucan changes everything.

In a new pre-print study, the team behind Toucan showed that small, open-source models fine-tuned on the dataset can outperform frontier models many times larger on two leading benchmarks for agentic tool-use, Berkeley Function Calling Leaderboard version 3 (BFCLv3) and MCP-Universe.

“Tool-calling is central to AI agents,” said Rameswar Panda, the IBM researcher who led the team behind Toucan. “How can you train better agents? Through diverse, high-quality examples sourced from the real world.”

Lifelike scenarios

In recent years, LLMs have been quietly evolving from stand-alone chatbots to semi-autonomous agents that can reason through problems and interact with the world through application programming interfaces, or APIs.

In this new agentic world, APIs are accessed through an interface standardized and later open-sourced by Anthropic, called model context protocol (MCP). MCP servers today are the portals through which AI agents connect to existing APIs to carry out actual work.

Toucan was designed to specifically teach AI agents how to call APIs by connecting to MCP servers, which are a bit like topic-based software libraries. With a small fleet of LLMs, and MCP server metadata gathered from GitHub and Smithery.ai, IBM and UW researchers curated a dataset of unusual size, breadth, and difficulty.

toucan 3.jpg
Toucan’s 1.5 million tool-calling trajectories cover a wide range of complex tasks, represented here via the Embedding Atlas visualization tool. Plotted on this map are numerical embeddings of 50,000 randomly sampled Toucan trajectories.

They started by gathering MCP server metadata from GitHub and Smithery, filtering out servers with tools that returned error messages, or other problems. They then used the final set of 500 MCP servers, and their tools, as creative fodder to spin up plausible task scenarios.

Five open-source LLMs were used to generate a range of task scenarios, and three additional models and their corresponding frameworks were used to construct step-by-step agent trajectories. The tasks varied, but the basic script followed a formula: an LLM agent creates a plan, calls and executes tools, and finishes the job with a friendly summary.

Two more LLMs were used to rate each task trajectory for difficulty and quality, allowing the researchers to select the best examples for Toucan. The dataset is currently more than five times larger than the next largest open-source dataset— Nvidia’s Nemotron dataset, with 310,000 trajectories. With more than 2,000 tools represented, it’s also likely the most diverse.

“LLMs trained on Toucan essentially learn how to choose the right tools for the task, create engaging dialogue to keep humans in the loop, and recognize when a task can’t be solved with the available toolset,” said Adriana Meza, an IBM Research engineer who co-led the dataset’s creation.

A fifth of Toucan’s scenarios require models to call multiple tools at once, a feature researchers incorporated to teach AI agents how to run more economically.

“You can imagine how parallel calling improves efficiency, which can lower the cost of running agentic systems,” said Zhangchen Xu, a graduate student at University of Washington who helped build the dataset as an IBM intern last summer.

Models fine-tuned on Toucan data showed impressive performance gains. Open-source Qwen-2.5 models (7B, 14B, and 32B respectively) improved by up to seven percentage points on τ-Bench and τ²-Bench, which evaluate tool-calling in common retail, airline, and telecommunications environments.

How can you train better agents? Through diverse, high-quality examples sourced from the real world.

On the BFCL V3 benchmark, a Toucan-tuned Qwen-2.5-32B model improved nearly nine percentage points and narrowly outperformed OpenAI’s GPT-4.5-Preview, which is estimated by some to have at least a trillion parameters.

Toucan-tuned models also generally outperformed models of similar size on MCP-Universe, Salesforce’s benchmark for real tool-calling tasks involving financial analysis, 3D design, and web search, among other use cases.

In the coming months. the team plans to onboard new MCP servers with a wider range of tools that have come online since June, when they collected their seed data. They are also working to create a reinforcement learning gym and benchmark to give LLMs more experience with enterprise workflows.

“We’re repurposing part of the Toucan code for these new projects and of course building on all the tool-calling knowledge we acquired in putting the dataset together,” said Meza.

Related posts