AI agents have the potential to revolutionize work — and now you can measure if they actually are
A new set of benchmarks, called ITBench, from IBM Research aim to bring an objective, scientific approach to determining whether IT automation agents are actually making work easier for enterprises.
A new set of benchmarks, called ITBench, from IBM Research aim to bring an objective, scientific approach to determining whether IT automation agents are actually making work easier for enterprises.
The generative AI explosion we’ve seen in recent years has been astounding to witness. We have systems that can write us poems, help us with coding problems, and chat with us on any given topic. But when it comes to bringing these revolutionary systems to enterprises, adoption has been relatively limited.
Right now, it’s difficult to fundamentally compare the efficacy of these AI systems for reliably solving business problems, because tests to show efficacy don’t really exist. “You need to build trust in the systems,” Nick Fuller, IBM Research’s VP of AI and automation, said. “It’s even harder when you don’t have yardsticks to measure against.”
And taking just the field of IT automation, the landscape is considerably more complex than it was even a few years ago. There’s a shortage of personnel available at any given company to tackle incident management, the unending flow of compliance tasks, and all the other IT operations tasks employees carry out on any given day.
“The IT landscape is increasingly complex, and generative AI is making things tougher,” Daby Sow, the director of AI for IT automation on Fuller's team, said. And we’ve seen the sort of global damage that can happen when IT mistakes aren’t caught. “When you make mistakes, you pay the price,” Sow added.
These issues inspired IBM Research to open source a set of benchmarks to help measure automated solutions for these exact sorts of issues and make it easier for AI builders to bring generative AI to enterprise tasks. These new benchmarks, collectively known as ITBench, will offer AI practitioners a scientific way to measure how effective the agents they’re building are at solving real problems and how their agents compare to others on tasks that businesses carry out every day.
There are countless tools springing up from vendors that promise to bring generative solutions to IT problems like incident management, but how is a company supposed to objectively compare one set of tools to another? Or if you’re a new CIO, and you want to evaluate the tech estate that you inherited, where can you turn for impartial appraisals? In other AI domains, benchmarks have sprung up for all sorts of tasks, measuring things like how well AI agents and systems can debug code, how well they can converse, and even how they can plan you a nice vacation. But nothing like this has existed for solving complicated IT issues. That led Sow and team to working on ITBench.
From the start, there will be three benchmarks focused on site reliability engineering (SRE), FinOps cost management, and compliance assessment. At the highest level, the benchmarks are meant to be an open framework where users can see if their agents can solve problems efficiently. For SREs, can an agent recognize there is an alert in the system, and can it rapidly figure it out its provenance, and provide a fix? For compliance officers, when a new rule is introduced in your country, can the agent assess a system, understand the regulation, and determine whether the system is compliant? And for cost considerations, for managers looking to understand how they can launch a new product while staying within their budget constraints, for example, can an agent provide a solution?
In each of these cases, the agents are carrying out several processes to arrive at a result. For compliance, regulations are written in natural language. That means the agent needs to understand a document’s intent, translate that into actionable code in the language a given piece of software is in, find the relevant part of that software’s code that deals with the compliance issue, and then check whether its code matches up with the compliance issue. The benchmark would then rate the agent on whether it successfully handled the issue, and how quickly it did so.
These are not simple tasks, and it’s easy to see how mistakes can arise when seemingly small issues are overlooked. The benchmarks were built using real-world examples of major incidents, compliance controls and FinOps scenarios, including one where one bug led a company to a 20% data loss. These issues have become increasingly difficult to spot as more tools, with more features, are added atop hybrid IT infrastructures that AI engineers traditionally have little experience with.
The goal of these benchmarks, according to Sow, is to open up agentic workflows to a greater number of developers and end-users. AI practitioners don’t need to know the details of an Instana workflow now to test whether their IT agent handles the tasks it’s expected to. Long term, Sow and Fuller envision a future where agents can even be proactive, rather than reactive, rooting out potential issues in code, legal documents, or any other digital business process — before they reach the stage of international calamity.
IBM Research is already working on its own agents that Sow and team hope will top these benchmarks’ leaderboards. The team is also working on additional benchmarks that could start to test the efficacy of other areas of IT automation.
You can test your IT agents on ITBench on GitHub today.