IBM, MIT and Harvard release “Common Sense AI” dataset at ICML 2021

Researchers have developed a tool, inspired by developmental studies in infants, for benchmarking an AI model’s intuitive psychology.

Before we can build machines that make decisions based on common sense, the AI powering those machines must be capable of more than simply finding patterns in data. It must also consider the intentions, beliefs, and desires of others that people use to intuitively make decisions.

At the 2021 International Conference on Machine Learning (ICML), we are releasing a new dataset for benchmarking AI intuition, along with two machine learning models representing different approaches to the problem. The research has been done with our colleagues at MIT and Harvard University to accelerate the development of AI that exhibits common sense. These tools rely on testing techniques that psychologists use to study the behavior of infants.

Our work — the latest in IBM's larger efforts to advance more fluid, real-world AI systems — presents exciting opportunities for new research that uses intuitive psychology to improve machine commonsense, and possibly vice versa.

AGENT of change

Although it may seem like our work is advancing, literally, in baby steps, even before they are 18 months old, infants have the ability to recognize the impact of costs and rewards on decision-making and infer — even with incomplete information — when a situation places constraints on their choices. At that age, children also typically show the ability to predict future actions.

Such characteristics would be a considerable improvement over today’s AI, which would have to have some understanding of how and why humans make decisions if it is to successfully engage in social interactions.

Intuitive psychology — the ability to reason about hidden mental variables that drive observable actions — comes naturally even to pre-verbal infants. Machine learning algorithms have no such powers of perception. They require vast amounts of data to train AI models that can recognize objects in a photo, without really understanding what they are seeing.

The benchmark we unveiled at ICML is called AGENT (Action, Goal, Efficiency, coNstraint, uTility) and consists of 8,400 3D animations. These videos are organized into four categories — goal preferences, action efficiency, unobserved constraints, and cost or reward tradeoffs. They are designed to probe a machine learning model’s understanding of key concepts in intuitive psychology much the same way researchers evaluate an infant’s ability to intuit what others think.

Like experiments in many infant studies, each trial has two phases. In the familiarization phase, an AI model is fed videos demonstrating a particular agent’s behavior in certain physical environments. In the test phase, the model is shown a video of the behavior of the same agent in a new environment. The behavior is either “expected” or “surprising,” based on the agent’s behavior in the familiarization phase. The AI model must judge how surprising the agent’s behaviors in the test videos are, based on what the model has learned or inferred about the agent’s actions, goals, utilities and physical constraints from watching the familiarization videos.

The two machine learning approaches we introduced at ICML advance more real-world training of AI and machine learning models using traditional human psychology methods. They compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network.

Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics.

It’s become increasingly clear that the most effective way to develop AI that reasons the way people do is to approach the problem from multiple angles. In that way, our current work is an extension of our ongoing research into neurosymbolic AI, which combines the logic of symbolic AI algorithms with the deep learning capabilities found in neural networks.

Advancing machine common sense

The work we presented at ICML is the latest in our ongoing, multi-year project with the U.S. Department of Defense's Defense Advanced Research Projects Agency (DARPA). Launched in 2019, the Machine Common Sense (MCS) project aims to develop machine models using traditional developmental psychology methods to see whether AI can “learn” and reason in a manner similar to how we teach human infants.

Today’s AI models can achieve a superhuman level of capability on specific tasks. If you want AI to perform a new task at that same level, however, you have to start the learning process from scratch. One of our goals is to encourage AI researchers to make more versatile models capable of explaining their decisions, as well as how objects and ideas relate to one another.

In the near-term, we want to help create models that understand psychology, and marry that approach to models that understand physics the way people do. Looking further ahead, we want to test AI’s ability to make common sense decisions in social situations, where multiple agents are involved and can help or hinder the original agent.

Eventually, DARPA will want to evaluate our work in scenarios where an agent must rely on tools to accomplish its objective, such as using a key to open a door or a ramp to climb a wall.

It will take years to cultivate AI can accomplish all these tasks, of course. What’s important for now is that we’ve demonstrated that AGENT is a well-structured diagnostic tool for developing better models of intuitive psychology. Our hope is that, with such tools, researchers can help AI mature from infanthood to the toddler stage and beyond.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

Towards a generative future for computing
Release
Mike Murphy
15 Aug 2025
- AI
- Computer Science
How the IBM Research AI Hardware Center is building tomorrow’s processors
Deep Dive
Peter Hess
12 Aug 2025
All decisions have trade-offs. IBM’s Wei Sun is an expert at weighing them
Q & A
Kim Martineau
06 Aug 2025
IBM Storage Scale delivers real-world performance: an in-depth analysis
Technical note
Brian Belgodere, Chris Miller, John Lewars, Matthew Klos, Yukio Hayashi Leon, Mara Miranda Bautista, and Olaf Weiser
04 Aug 2025
- AI
- Hybrid Cloud Infrastructure

AGENT of change

Advancing machine common sense

Related posts

Towards a generative future for computing

How the IBM Research AI Hardware Center is building tomorrow’s processors

All decisions have trade-offs. IBM’s Wei Sun is an expert at weighing them

IBM Storage Scale delivers real-world performance: an in-depth analysis