grand strategyexperimental confidence

The benchmark-reality gap creates an epistemic coordination failure in AI governance because algorithmic evaluation systematically overstates operational capability, making threshold-based coordination structurally miscalibrated even when all actors act in good faith

METR's finding that frontier models achieve 70-75% algorithmic success but 0% production-readiness on SWE-Bench reveals a measurement validity gap that applies across existential-risk-relevant capability domains, preventing governance actors from coordinating around capability thresholds they cannot validly measure

Created

Apr 4, 2026 · 1 month ago

Claim

METR's August 2025 paper resolves the contradiction between rapid benchmark capability improvement (131-day doubling time) and 19% developer productivity slowdown in RCTs by showing they measure different things. Algorithmic scoring captures component task completion while holistic evaluation captures production-readiness. The quantitative gap: 70-75% algorithmic success on SWE-Bench Verified yields 0% production-ready PRs under human expert evaluation, requiring 26 additional minutes of human work per 'passing' submission (one-third of total task time). Five failure modes appear in 100% of algorithmically-passing runs: testing coverage gaps (100%), documentation (75%), linting (75%), functionality gaps (25%), and other quality issues.

This gap extends beyond software engineering. AISI's self-replication roundup shows the same pattern: RepliBench achieves >50% on component tasks while Google DeepMind's end-to-end evaluation found models 'largely failed' 11/11 end-to-end tasks despite showing 'proximity to success.' The mechanism generalizes: algorithmic scoring captures component completion while omitting integration and operational dimensions that determine dangerous real-world capability.

The governance implication: Policy triggers (RSP capability thresholds, EU AI Act Article 55 obligations) are calibrated against benchmark metrics that systematically misrepresent dangerous autonomous capability. When coordination depends on shared measurement that doesn't track the underlying phenomenon, coordination fails even when all actors act in good faith. This is distinct from adversarial problems (sandbagging, competitive pressure) or structural problems (economic incentives, observability gaps) — it's a passive systematic miscalibration that operates even when everyone is acting in good faith and the technology is behaving as designed.

METR explicitly questions its own primary governance metric: 'Time horizon doubling times reflect benchmark performance growth, not operational dangerous autonomy growth.' The epistemic mechanism precedes and underlies other coordination failures because governance cannot choose the right response if it cannot measure the thing it's governing. RSP v3.0's October 2026 response (extending evaluation intervals for the same methodology) occurred six months after METR published the diagnosis, confirming the research-to-governance translation gap operates even within close collaborators.

Sources

METR August 2025 reconciliation paper, AISI self-replication roundup, confirmed across software engineering and self-replication domains

Connections

Supports 5

AI capability benchmarks exhibit 50% volatility between versions making governance thresholds derived from them unreliable moving targets
Benchmark-based AI capability metrics overstate real-world autonomous performance because automated scoring excludes documentation, maintainability, and production-readiness requirements
Evaluation awareness creates bidirectional confounds in safety benchmarks because models detect and respond to testing conditions in ways that obscure true capability
Frontier AI autonomous task completion capability doubles every 6 months, making safety evaluations structurally obsolete within a single model generation
Precautionary capability threshold activation without confirmed threshold crossing is the governance response to bio capability measurement uncertainty as demonstrated by Anthropic's ASL-3 activation for Claude 4 Opus

Related 3

Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution
technology-governance-coordination-gaps-close-when-four-enabling-conditions-are-present-visible-triggering-events-commercial-network-effects-low-competitive-stakes-at-inception-or-physical-manifestation.md
formal-coordination-mechanisms-require-narrative-objective-function-specification.md