Google and MIT: More AI Agents Often Lead to Worse, Not Better, Results

On December 9, 2025, a research team from Google Research, Google DeepMind, and MIT announced the findings of a large experiment on AI agent systems. They tested how increasing the number of collaborating language models actually performs across various tasks and configurations; it turned out that in many cases adding more agents not only fails to improve results but even worsens them, with performance swings ranging from an 81% increase to a 70% drop, depending on the task itself multi-agent systems produced extremely inconsistent results, from an 81% increase to a 70% decrease^[1].

How We Know This: 180 Experiments, Three Model Families

The authors titled their work “Toward a Science of Scaling Agent Systems” and treated it as a series of laboratory experiments rather than a marketing showcase. They conducted 180 controlled tests comparing five architecture types for agent organization and three main families of language models: GPT from OpenAI, Gemini from Google, and Claude from Anthropic “Toward a Science of Scaling Agent Systems” conducted 180 controlled experiments^[4]. The goal was not to broadly claim that “multi-agent is better” but to identify specific conditions where coordination among many agents actually helps and those where it becomes a burden to isolate situations when multi-agent coordination helps and when it backfires.

The 45% Threshold: When Additional Agents Start to Interfere

The strongest conclusion concerns the clear efficiency threshold of a single agent. Researchers observed that when a single agent achieves about 45% accuracy on a task, adding more agents stops being beneficial; gains from division of labor and brainstorming are outweighed by coordination costs, context exchange, and answer merging when a single agent reaches around 45% accuracy on a task, adding further agents generally results in diminishing or negative returns due to coordination overhead. This is not just intuition but a result of statistical analysis—the beta coefficient at this threshold was -0.408 with p

This is particularly evident when comparing different task types. In financial analysis, where work can be broken down into nearly independent parts, the team observed significant benefits from a multi-agent approach; a centralized system where several agents concurrently handled different data aspects boosted efficiency by 80.9% Financial analysis tasks, breakable into independent parts, showed an 80.9% improvement under centralized multi-agent coordination. Individual agents separately analyzed sales trends, cost structures, and market data before merging results, reminiscent of a well-organized analyst team where each member has their niche but overall supervision exists different agents analyzed sales trends, cost structures, and market data in parallel before results consolidation.

Finance vs. Minecraft: When Many Agents Help and When They Harm

A completely different picture emerged from Minecraft planning tasks. Here every step changes the world state—after building something, consuming a resource, or changing inventory, subsequent decisions must consider the new state. In such environments, every tested multi-agent configuration reduced effectiveness between 39% and 70%, regardless of how researchers arranged collaboration Minecraft planning tasks revealed the opposite: every multi-agent configuration worsened performance by 39% to 70%. This is not a minor variation but a performance collapse—where a logically thinking human would expect “more heads” to do better, the system falls apart.

The key turned out to be sequential dependencies often overlooked in optimistic multi-agent narratives. If each step affects the state required for the next steps, like in manufacturing items where inventory changes impact subsequent actions, dividing work among many agents fragments context When every step in a task changes the state necessary for following steps—as in manufacturing, where inventory changes influence later actions—multi-agent systems struggle. Each agent sees a different piece of history, and the whole starts resembling a scenario where several people pass a notebook to each other without fully understanding what the predecessor already did and why multi-agent systems struggle because context becomes fragmented across agents.

Errors Grow Exponentially, and Tokens Go to Waste

In loosely controlled, independent multi-agent setups, another unpleasant effect arose: errors grew much faster than in simple single-agent systems. The Google and MIT team measured that such systems increased errors 17.2 times faster than single agents, meaning inaccuracies and wrong assumptions spread very rapidly across the entire arrangement Independent multi-agent systems increased errors 17.2 times faster than single agents through uncontrolled propagation^[6]. However, when centralized coordination with validation checkpoints was introduced, the error growth rate dropped to “only” 4.4 times that of single agents centralized coordination reduced this to 4.4 times by adding validation checkpoints.

Costs don’t end there. Researchers also analyzed token efficiency—that is, the effective “computational fuel” of language models, which translates directly into cloud expenses. A single agent performed clearly better: on average, it completed 67 successful tasks per 1,000 tokens Single agents completed on average 67 successful tasks per 1,000 tokens. In centralized multi-agent systems, this number fell to 21 tasks per 1,000 tokens—less than one-third of the single-agent efficiency despite all the additional complexity and coordination Centralized multi-agent systems achieved only 21—less than one-third. Hybrid agent teams combining different cooperation modes performed even worse, completing only 14 tasks per 1,000 tokens Hybrid teams performed only 14 tasks per 1,000 tokens.

New Scaling Rules and Breaking Away from the ‘More Agents’ Mantra

In response to these results, the scientists built predictive frameworks intended to serve as maps for agent system designers. Instead of guessing when to rely on a strong single agent or a team, one can measure several task features and accurately select a sensible strategy. According to the authors, these frameworks correctly identify the optimal coordination method in 87% of new configurations—nearly nine out of ten cases Predictive frameworks correctly identify the optimal coordination strategy for 87% of new configurations. An intriguing finding relates especially to tasks requiring around 16 different tools—there, a single agent or decentralized setup outperformed complex, tightly coordinated multi-agent structures, contradicting the intuition that “more tools = more agents” Tasks requiring about 16 different tools favored single agents or decentralized configurations over multi-agent coordination.

The whole set of results stands in stark contrast to last year’s popular paper loosely translated as “More Agents Is All You Need,” which suggested that scaling the number of cooperative models alone solves many problems These results contradict last year’s paper “More Agents Is All You Need”^[2]. This time, the team from Google and MIT proposes something completely different: a set of the first quantitative scaling rules for agent systems, based on hard data rather than slogans establishing what researchers call the “first quantitative scaling rules for agent systems”. For anyone designing AI-based tools, this means an uncomfortable question: does adding more agents in my system truly help, or does it just complicate things, increase computational costs, and accelerate error growth?

How We Know This: 180 Experiments, Three Model Families

The 45% Threshold: When Additional Agents Start to Interfere

Finance vs. Minecraft: When Many Agents Help and When They Harm

Errors Grow Exponentially, and Tokens Go to Waste

New Scaling Rules and Breaking Away from the ‘More Agents’ Mantra

Sources