AI ResearchFebruary 23, 2026•Interactive

Skate Where the Puck Is Going

In honor of the US victory in Olympic Hockey, some inspiration to skate where the puck is going. METR's benchmarks show that the task-completion time horizons of frontier AI agents are doubling roughly every four months. This interactive projection maps what that trajectory means through 2029.

Understanding the Benchmark

A quick note on what METR is actually measuring, because precision matters here. The task-completion time horizon is the duration of task, measured by how long a skilled human professional takes to complete it, at which an AI agent succeeds with a given probability. A "p50 time horizon of 870 minutes" means the agent can reliably complete tasks that would take a human expert roughly 14.5 hours, succeeding about half the time.

The benchmarks run on software engineering, machine learning, and cybersecurity tasks. The agents are not single models operating in isolation. METR's evaluation scaffolds already use multi-agent coordination. Their Triframe scaffold deploys an advisor sub-agent that strategizes before each action, parallel actor sub-agents that generate candidate actions, and parallel rater sub-agents that evaluate and select the best option. Even METR's "simpler" ReAct scaffold provides tool use and iterative reasoning loops. More recently, METR has tested Claude Code and OpenAI's Codex as scaffolds, finding that these specialized agent frameworks perform comparably to their default scaffolds.

This is significant: the frontier of autonomous AI capability is already being measured with coordinated sub-agent architectures, not single-model prompting. The doubling curve reflects what structured agent teams can accomplish, not just what a lone model can do.

What This Means for Business

The doubling curve is not theoretical. At the current rate, the tasks AI agents can reliably complete grow from the equivalent of two human workdays today to a full work week by late 2026 and a work month by mid-2027. These are tasks measured against skilled professionals with appropriate expertise, not entry-level work.

For mid-market companies, the strategic question is not whether AI agents will handle sustained complex work. The question is whether your data, workflows, and team structures are ready to direct that capability when it arrives. The companies building that foundation now will compound their advantage. The ones waiting for "proof" will be playing catch-up against competitors who started earlier.

The Sub-Agent Coordination Frontier

METR's own methodology reveals the direction of travel. The fact that their highest-performing scaffolds use sub-agent coordination (advisor, actors, raters working in parallel) is not incidental. It is the architecture that produces the best results on their benchmarks. This mirrors what we see in production deployments: the leap from single-agent to coordinated multi-agent systems is where the qualitative gains live.

A single agent completing a 14-hour task is useful. Multiple specialized agents, each operating from a different analytical lens and coordinating through structured protocols, can produce something qualitatively different: competing perspectives, adversarial stress-testing, and synthesized recommendations that no single agent would generate alone. The output is not 5x a single agent. It is a fundamentally different class of analysis.

We have been running these multi-agent systems in production for enterprise clients. The key insight from both METR's research and our own work is the same: the coordination architecture matters as much as the individual agent capability. How agents hand off context, challenge each other's assumptions, and synthesize competing conclusions determines whether you get noise or intelligence.

The Collective Intelligence Frontier

As individual agent horizons extend from hours to days to weeks, the implications for coordinated agent teams compound. Consider what becomes possible at each stage:

D9-D10: Projects (2026)

Agent teams that can independently research, analyze, and draft complete project deliverables. A financial analyst agent, a technical feasibility agent, and a stakeholder mapping agent work in parallel, then a synthesis agent reconciles their findings. What currently takes a consulting team two weeks compresses to days.

D11-D12: Campaigns (2027)

Sustained multi-agent operations that can plan, execute, and adapt over campaign-length timelines. Agent teams running go-to-market analysis, competitive intelligence, and pipeline optimization simultaneously, with adversarial agents stress-testing every recommendation before it surfaces. The coordination overhead that currently limits human teams dissolves.

D13-D14: Initiatives (2028)

Agent collectives that manage initiative-scale complexity: multi-stakeholder environments, regulatory considerations, cross-functional dependencies. The same coordination protocols drawn from game theory, intelligence analysis, and organizational science that power today's multi-agent systems will operate over horizons measured in weeks, not hours.

Why Coordination Architecture Matters Now

Here is the counterintuitive finding from our research: throwing more agents at a problem does not reliably improve output quality. In some configurations, it actively degrades it. METR found the same pattern. Claude Code and Codex, despite being sophisticated agent frameworks, did not outperform their simpler Triframe and ReAct scaffolds on time horizon measurements. More complexity is not automatically better.

The variable that matters most is task decomposition. How you break a complex problem into components that can be independently analyzed, then rigorously recombined, determines whether multi-agent coordination creates signal or noise. This is not a software engineering problem. It is an organizational design problem, informed by decades of research in decision science, intelligence analysis, and systems thinking.

The organizations that will capture the most value from the capability curve above are not the ones with the best models. They are the ones with the best coordination architectures: the protocols, the analytical pipelines, and the structured disagreement frameworks that turn raw agent capability into defensible decisions.

The Data Readiness Prerequisite

None of this matters if the foundation is broken. The most common failure mode in multi-agent deployments has nothing to do with agent design. It is data readiness. Agents operating on incomplete, inconsistent, or siloed data produce sophisticated-sounding analysis built on sand.

Before investing in agent orchestration, the prerequisite is honest assessment: Is your data clean enough, connected enough, and governed enough to support autonomous analytical systems? If the answer is no, that is where the work starts. The coordination architecture can wait. The data foundation cannot.

Where the Puck Is Going

The METR doubling curve tells us the floor is rising. Every four months, the complexity of tasks that AI agent teams can reliably complete roughly doubles. But the ceiling, the upper bound of what purpose-built coordination architectures can achieve on top of that rising floor, is where the real leverage lives. Collective intelligence amplifies individual capability in ways that are non-linear and, frankly, still surprising even to those of us building these systems daily.

The companies that will define the next era are not waiting for the curve to arrive at their doorstep. They are building the analytical infrastructure, the data foundations, and the coordination architectures today, so that when agent capability reaches the "Campaigns" and "Initiatives" tier on that chart, they are ready to direct it.

Wayne Gretzky never skated to where the puck was. Neither should you.

Scott Ewalt is the founder of Cardinal Element, an AI consulting firm that designs and deploys multi-agent orchestration systems for mid-market enterprises. Cardinal Element's platform coordinates 30+ analytical protocols across structured disagreement, adversarial stress-testing, and facilitated synthesis, validated through 430+ empirical benchmark runs.

Sources: METR Time Horizon 1.1 | Measuring AI Ability to Complete Long Tasks | Measuring Time Horizon Using Claude Code and Codex | METR Time Horizons Dashboard