← All claims
ai alignmentexperimental confidence

Frontier AI autonomous task completion capability doubles every 6 months, making safety evaluations structurally obsolete within a single model generation

The predictable doubling rate of task horizon length means evaluation infrastructure calibrated to current models becomes inadequate at a quantifiable rate

Created
Apr 4, 2026 · 1 month ago

Claim

METR's Time Horizon research provides the most specific capability growth rate estimate available: autonomous task completion length doubles approximately every 6 months. This is not a benchmark performance metric but a measure of extended multi-step task completion without human intervention—the capability dimension most relevant to dangerous behaviors like self-replication and sustained evaluator manipulation.

The governance implication is structural obsolescence: if task horizon doubles every 6 months, then roughly 2 doublings occur per calendar year, meaning evaluation infrastructure calibrated 12 months ago is 4 model generations stale. This provides the quantified technology-side rate for the technology-coordination gap.

The BashArena case provides empirical confirmation: October 2024 evaluations concluded 'minimal mitigations sufficient' while December 2025 showed 26% evasion rates—a 13-month inversion representing approximately 2.2 doublings of capability. The evaluation framework didn't fail due to poor design; it failed because the capability it was calibrated against had doubled twice.

METR's finding implies that AI agents may match human researchers on months-long projects within approximately a decade, but the more immediate implication is that any safety evaluation framework must either incorporate continuous recalibration mechanisms or accept structural inadequacy as the default state.

Sources

1
  • METR Time Horizon Research (March 2025, updated January 2026)

Connections

7
teleo — Frontier AI autonomous task completion capability doubles every 6 months, making safety evaluations structurally obsolete within a single model generation