07 Oct 2021

Deep Dive

15 minute read

Evaluating common sense in AI

We created AGENT, a benchmark for evaluating an AI model’s core psychological reasoning ability, or common sense, that will enable us to build and test AI models that reason and learn about other minds the same way humans do. This post is part of our paper explainer series.

To interact seamlessly with humans in the real world, AI agents must understand us and infer our mental states from observable actions. Understanding like this comes easily to humans. We can differentiate agents from objects, and expect agents to follow physical constraints and to act efficiently to achieve goals within those constraints. We can recognize costs and rewards, as well as infer hidden constraints when we can partially observe actions, and predict what we’ll need to do next to achieve a goal. This core psychological reasoning develops early in humans. Although infants only have limited experience, they can learn to generalize to novel agents and situations. This understanding forms the basis of what we call common sense.

We’re making progress toward building AI agents that can infer mental states, predict future actions, and even work with human partners. However, we lack a rigorous benchmark for evaluating an AI model’s core psychological reasoning ability — its common sense.

We created and validated such a benchmark, called AGENT (Action, Goal, Efficiency, coNstraint, uTility). We used this benchmark to challenge two baseline models and evaluated their performance using a generalization-focused protocol that we developed. The results demonstrate that the benchmark is useful for evaluating the core psychological reasoning ability of an AI model, giving us a sense of its social awareness and potential to interact with humans in real-world settings. Our work, which we published in a recent paper, was presented at ICML 2021.

The AGENT benchmark

AGENT (Action, Goal, Efficiency, coNstraint, uTility) is a benchmark for common-sense core psychology reasoning in AI models. It was inspired by experiments that probe cognitive development in young children. AGENT is a large-scale dataset of 3D animations of an agent moving under various physical constraints and interacting with various objects. The videos comprise distinct trials, each of which includes one or more ‘familiarization’ videos of an agent’s typical behavior in a certain physical environment, paired with ‘test’ videos of the same agent’s behavior in a new environment, which are labeled as either ‘expected’ or ‘surprising,’ given the behavior of the agent in the corresponding familiarization videos.

In an expected test video, the agent’s behavior is consistent with the familiarization videos, such as it pursues the same goal, acting efficiently and maximizing rewards. In a surprising test video, the agent pursues a goal inconsistent with that in the familiarization videos, achieves its goal inefficiently, or violates physics. A model’s task is to judge how surprising the agent’s behaviors in the test videos are, based on its actions in the familiarization videos and physical constraints present, which we call the surprise rating. We validated AGENT with large-scale human-rating trials, where, on average, adult human observers rated the surprising test videos as more surprising than the expected test videos.

The trials assess a minimal set of key common-sense concepts considered to be part of the core psychology in young children. The trials are grouped into four scenarios: goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs. Each scenario has several variants or types, with basic versions that replicate stimuli used in infant studies, and additional setups that are more diverse and more challenging. The key reasoning concepts tested and a sample trial in each of the four scenarios are detailed below:

Scenario 1: Goal preferences

Key concepts: An agent pursues a particular object based on its preferences, and pursues its preferred object despite varying physical conditions. Pursuing the same object could lead to different actions in new physical situations. Sample trial: The familiarization video shows an agent and two objects: a blue cone on one side and a yellow sphere on the other. The agent moves toward the cone. The test videos show the same agent and two objects, but the objects have switched positions. In the expected video, the agent moves toward the cone. In the surprising video, the agent moves toward the sphere. This is surprising because the agent showed a preference for the cone.

Scenario 2: Action efficiency

Key concepts: An agent is physically constrained by the environment and takes the most efficient action to reach its goal given the constraints. An agent may not follow the same path for the same goal if the physical environment is not the same. Sample trial: The familiarization video shows an agent separated from an object by a solid wall. The agent moves toward the object and jumps over the wall to reach it. The test videos show the agent, object, and wall in the same positions, but the wall now has a doorway. In the expected video, the agent moves through the doorway to reach the object. In the surprise video, the agent jumps over the wall to reach the object. This is surprising because jumping is not the most efficient way to reach the goal.

Scenario 3: Unobserved constraints

Key concepts: Given that an agent takes the most efficient action to reach its goal, costly actions must be caused by unobserved physical constraints. Sample trial: The familiarization video shows an agent and an object, with an occluder hiding the space between them. The agent moves toward the object and jumps high behind the occluder before reaching the object. The test videos show exactly the same sequence, and then the occluder is removed. In the expected video, a solid wall is revealed behind the occluder. In the surprise video, a wall with a doorway is behind the occluder. This is surprising because jumping is not the most efficient way to reach the goal if a doorway is available.

Scenario 4: Cost-reward trade-offs

Key concepts: An agent takes action based on utility, trading off the rewards of its goal against the costs of reaching it. An agent pursues a preferred object if it requires the same or less cost to reach as a less preferred object. Sample trial: There are four familiarization videos in this trial. The first two show an agent jumping over a gap to reach a yellow bowl but not pursuing the same bowl across a larger gap. Two more familiarization videos show the same agent jumping over a small gap to reach a blue diamond but not pursuing the same diamond across a larger gap. The test videos show the agent with both objects; the yellow bowl is on the ground and the blue diamond is atop a column with a ramp. In the expected video, the agent moves toward the bowl. In the surprise video, the agent moves toward the diamond and climbs the ramp to reach it. This is surprising because the agent showed a preference for the bowl and the bowl is easier to reach.

Screen Shot 2021-09-22 at 12.16.37 PM.png — Schematic of the four key scenarios of core intuitive psychology evaluated in AGENT. Solid arrows show the typical behavior of the agent in the familiarization video(s) or in the expected test video. Dashed arrows show agent behavior in the surprising test video.

Setting a baseline for AGENT

BIPaCK

Bayesian Inverse Planning and Core Knowledge (BIPaCK) is a generative model that combines a computational framework for understanding action using Bayesian inference with a core knowledge of physics powered by simulation. From a scene, we extract the entities (the agent, objects, and obstacles) and their rough-state information (3D bounding boxes and color codes), based either on the ground truth provided in AGENT or on results from a perception model. We then recreate an approximated physical scene in a physics engine that is different from the environment in the videos.

Obstacles are represented by cubes, while objects and the agent are recreated as spheres. As the model has no access to the ground-truth parameters of the physical simulation in the videos, and no prior knowledge of the mental state of the agent, it must propose a hypothesis of the physics parameters (including coordinate transformation, global forces such as gravity and friction, and densities of entities) and of the agent parameters (the rewards of objects and the cost function of the agent). Given these inferred parameters, the planner component samples a trajectory to jointly maximize the reward and minimize the cost. We then define the surprise rating of a test video by computing the distance between the trajectory predicted by BIPaCK and that observed in the test video.

ToMnet-G

ToMnet-G (Theory of Mind neural network extended with a graph neural network) encodes the familiarization videos to obtain a character embedding for a particular agent. We use a graph neural network to encode states, where we represent all entities (including obstacles) as nodes. The input of a node includes its entity class (agent, object, obstacle), bounding box, and color code. We pass the embedding of the agent node to the downstream modules to obtain the character embedding and the mental state embedding, which are combined with the embedding of the initial state to predict the expected trajectory of the agent. The surprise rating of a given test video is defined as the deviation between the trajectory predicted by ToMnet-G and that observed in the test video.

Model accuracy

We measure accuracy based on relative surprise ratings. We obtain two sets of surprise ratings for the pair of surprising and expected test videos that share the same familiarization videos. Model accuracy is defined as the percentage of correctly ordered pairs of ratings.

AGENT in action

All trials seen

First we trained and tested the models on all types of trials within all four scenarios when given ground-truth state information. BIPaCK performed well on all types of trials — as well as, or better, than humans did. ToMnet-G also had a high overall accuracy, but performed less evenly across types within a scenario than BIPaCK.

Generalization to unseen trials

We expect models to perform well not only when presented with trials similar to those from training but also when they must generalize to different physical configurations within a scenario or to other scenarios altogether. To evaluate their generalization ability, we conducted four tests in which the models were trained and tested on different sets of trials:

Generalization test	Training trials	Test trials
Leave one type out	All but one type in a scenario	The held-out type in the same scenario
Leave one scenario out	All but one scenario	The held-out scenario
Single type	A single type in a scenario	The remaining types in the same scenario
Single scenario	A single scenario	The remaining three scenarios

In general, we saw little change in BIPaCK’s performance in various generalization conditions, whereas ToMnet-G performed well under only a few generalization conditions. ToMnet-G faced two main challenges: predicting trajectories in unfamiliar physical situations, and reliably computing costs and rewards that are grounded to objects and physics. These and other findings about the performance of ToMnet-based models suggest that these methods have a limited capacity for inferring agents’ mental states from a small number of familiarization videos, and generalizing knowledge of agents to novel situations.

Performance summary

ToMnet-G achieves reasonably high accuracy when trained and tested on trials with similar configurations or within the same scenario, but struggles when generalizing to different physical situations or scenarios. In contrast, BIPaCK — with its built-in representations of planning, objects, and physics — performs strongly both within and when generalizing across scenarios.

Conclusions

Building common sense into AI models

Our results suggest that to demonstrate core psychological reasoning ability, an AI model must acquire or have built-in representations of how agents plan, combining cost-reward computations and core knowledge of objects and physics.

The AGENT benchmark identifies exciting opportunities for improvement of the two models we created. For instance, while BIPaCK outperforms ToMnet-G in almost all conditions, it requires an accurate reconstruction of the 3D state and a built-in model of the physical dynamics, which will not necessarily be available in real-world scenarios. It is an open question whether the generalizable inverse graphics and physics simulators on which BIPaCK rests can be learned. On the other hand, without many built-in priors, ToMnet-G demonstrates promising results when trained and tested on similar scenarios. It does lack, however, a strong generalization capacity both within and across scenarios. Generalization could be potentially improved with more advanced architectures or pre-training on a wider variety of physical scenes so that a more general-purpose simulator can be learned.

AGENT’s role in advancing AI

AGENT revealed these open areas for improvement, which suggests that it is a well-structured diagnostic tool for developing and evaluating common-sense in AI models. It also validates the potential use of traditional developmental psychology methods, similar to those we use to teach human infants, to create AI models. Building off this work, it could be possible to create AI models that can learn and reason, explain their decisions and how objects and ideas relate to one another, and even understand psychology and physics the way humans do. It could be possible for AI to successfully engage in social interactions, making common-sense decisions in social situations involving multiple agents (human or otherwise), and use tools to accomplish an objective, such as using a key to open a door or a ramp to climb a wall.

Although it will take years to cultivate such fluid, real-world AI systems, tools like AGENT will help us get there.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

Debugging LLMs to improve their credibility
Research
Kim Martineau
30 Jul 2025
How IBM’s Kush Varshney became the face of the modern ‘camera man’
Q & A
Kim Martineau
21 Jul 2025
IBM’s Mikhail Yurochkin wants to make AI’s “cool” factor tangible
Research
Kim Martineau
05 Mar 2025
Why we’re teaching LLMs to forget things
Explainer
Kim Martineau
07 Oct 2024