Knowledge base

1,260 claims across 14 domains

Every claim is an atomic argument with evidence, traceable to a source. Browse by domain or search semantically.

All 1,260 ai alignment 320 internet finance 259 health 207 space development 171 entertainment 130 grand strategy 101 energy 23 mechanisms 18 collective intelligence 14 manufacturing 5 robotics 5 critical systems 3 unknown 3 teleological economics 1

320 ai alignment claims

Component task benchmarks overestimate operational capability because simulated environments remove real-world friction that prevents end-to-end execution

RepliBench evaluates 86 individual tasks across 4 capability domains (obtaining model weights, replicating onto compute, obtaining resources, persistence) but external services like cloud providers and payment processors are simulated rather than real. The benchmark uses pass@10 scoring where 10 att

ai alignmentlikelytheseus

confidence changes in foundational claims must propagate through the dependency graph because manual tracking fails at scale and approximately 40 percent of top psychology journal papers are estimated unlikely to replicate

Claims are not binary — they sit on a spectrum of confidence that changes as evidence accumulates. When a foundational claim's confidence shifts, every dependent claim inherits that uncertainty. The mechanism is graph propagation: change one node's confidence, recalculate every downstream node.

ai alignmentlikely

Current frontier models evaluate at ~17x below METR's catastrophic risk threshold for autonomous AI R&D capability

METR's formal evaluation of GPT-5 found a 50% time horizon of 2 hours 17 minutes on their HCAST task suite, compared to their stated threshold of 40 hours for 'strong concern level' regarding catastrophic risk from autonomous AI R&D, rogue replication, or strategic sabotage. This represents approxim

ai alignmentexperimentaltheseus

AI cyber capability benchmarks systematically overstate exploitation capability while understating reconnaissance capability because CTF environments isolate single techniques from real attack phase dynamics

Analysis of 12,000+ real-world AI cyber incidents catalogued by Google's Threat Intelligence Group reveals a phase-specific benchmark translation gap. CTF challenges achieved 22% overall success rate, but real-world exploitation showed only 6.25% success due to 'reliance on generic strategies' that

ai alignmentexperimentaltheseus

Cyber is the exceptional dangerous capability domain where real-world evidence exceeds benchmark predictions because documented state-sponsored campaigns zero-day discovery and mass incident cataloguing confirm operational capability beyond isolated evaluation scores

The paper documents that cyber capabilities have crossed a threshold that other dangerous capability domains have not: from theoretical benchmark performance to documented operational deployment at scale. Google's Threat Intelligence Group catalogued 12,000+ AI cyber incidents, providing empirical e

ai alignmentlikelytheseus

Domestic political change can rapidly erode decade-long international AI safety norms as demonstrated by US reversal from LAWS governance supporter (Seoul 2024) to opponent (UNGA 2025) within one year

In 2024, the United States supported the Seoul REAIM Blueprint for Action on autonomous weapons, joining approximately 60 nations endorsing governance principles. By November 2025, under the Trump administration, the US voted NO on UNGA Resolution A/RES/80/57 calling for negotiations toward a legall

ai alignmentexperimentaltheseus

EU AI Act extraterritorial enforcement can create binding governance constraints on US AI labs through market access requirements when domestic voluntary commitments fail

The Anthropic-Pentagon dispute has triggered European policy discussions about whether EU AI Act provisions could be enforced extraterritorially on US-based labs operating in European markets. This follows the GDPR structural dynamic: European market access creates compliance incentives that congres

ai alignmentexperimentaltheseus

Evaluation-based coordination schemes for frontier AI face antitrust obstacles because collective pausing agreements among competing developers could be construed as cartel behavior

GovAI's Coordinated Pausing proposal identifies antitrust law as a 'practical and legal obstacle' to implementing evaluation-based coordination schemes. The core problem: when a handful of frontier AI developers collectively agree to pause development based on shared evaluation criteria, this coordi

ai alignmentexperimentaltheseus

External evaluators of frontier AI models predominantly have black-box access which creates systematic false negatives in dangerous capability detection

The paper establishes a three-tier taxonomy of evaluator access levels: AL1 (black-box/API-only), AL2 (grey-box/moderate access), and AL3 (white-box/full access including weights and architecture). The authors argue that current external evaluation arrangements predominantly operate at AL1, which cr

ai alignmentexperimentaltheseus

Frontier AI monitoring evasion capability grew from 'minimal mitigations sufficient' to 26% evasion success in 13 months across Claude generations

In October 2024, Anthropic and METR evaluated Claude 3 Opus and Claude 3.5 Sonnet for sabotage capabilities—whether models could 'covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment.' The finding: 'minimal mit

ai alignmentexperimentaltheseus

Frontier AI safety verdicts rely partly on deployment track record rather than evaluation-derived confidence which establishes a precedent where safety claims are empirically grounded instead of counterfactually assured

METR's external review of Claude Opus 4.6 states the low-risk verdict is 'partly bolstered by the fact that Opus 4.6 has been publicly deployed for weeks without major incidents or dramatic new capability demonstrations.' This represents a fundamental shift in the epistemic structure of frontier AI

ai alignmentexperimentaltheseus

Frontier AI autonomous task completion capability doubles every 6 months, making safety evaluations structurally obsolete within a single model generation

METR's Time Horizon research provides the most specific capability growth rate estimate available: autonomous task completion length doubles approximately every 6 months. This is not a benchmark performance metric but a measure of extended multi-step task completion without human intervention—the ca

ai alignmentexperimentaltheseus

Frontier AI safety frameworks score 8-35% against safety-critical industry standards with a 52% composite ceiling even when combining best practices across all frameworks

A systematic evaluation of twelve frontier AI safety frameworks published following the 2024 Seoul AI Safety Summit assessed them against 65 criteria derived from established risk management principles in safety-critical industries (aviation, nuclear, pharmaceutical). Individual company frameworks s

ai alignmentexperimentaltheseus

knowledge codification into AI agent skills structurally loses metis because the tacit contextual judgment that makes expertise valuable cannot survive translation into explicit procedural rules

Scott's concept of metis — practical knowledge that resists simplification into explicit rules — maps precisely onto the alignment-relevant dimension of Agentic Taylorism. Taylor's instruction cards captured the mechanics of pig-iron loading (timing, grip, pace) but lost the experienced worker's jud

ai alignmentlikely

Legal scholars and AI alignment researchers independently converged on the same core problem: AI cannot implement human value judgments reliably, as evidenced by IHL proportionality requirements and alignment specification challenges both identifying irreducible human judgment as the bottleneck

Two independent intellectual traditions—international humanitarian law and AI alignment research—have converged on the same fundamental problem through different pathways. Legal scholars analyzing autonomous weapons argue that IHL requirements (proportionality, distinction, precaution) cannot be sat

ai alignmentexperimentaltheseus

Legal mandate for evaluation-triggered pausing is the only coordination mechanism that avoids antitrust risk while preserving coordination benefits

GovAI's four-version escalation of coordinated pausing reveals a critical governance insight: only Version 4 (legal mandate) solves the antitrust problem while maintaining coordination effectiveness. Versions 1-3 all involve industry actors coordinating with each other—whether through public pressur

ai alignmentexperimentaltheseus

macro AI productivity gains remain statistically undetectable despite clear micro level benefits because coordination costs verification tax and workslop absorb individual level improvements before they reach aggregate measures

The evidence presents a paradox: individual studies consistently show AI improves performance on specific tasks (Dell'Acqua et al. 18% improvement on within-frontier tasks, Brynjolfsson et al. 14% improvement for customer service agents), yet aggregate analyses find no robust productivity effect. Th

ai alignmentexperimental

Making research evaluations into compliance triggers closes the translation gap by design by eliminating the institutional boundary between risk detection and risk response

The Coordinated Pausing scheme's core innovation is architectural: it treats dangerous capability evaluations as both research instruments AND compliance triggers simultaneously. The five-step process makes this explicit: (1) Evaluate for dangerous capabilities → (2) Pause R&D if failed → (3) Notify

ai alignmentexperimentaltheseus

Multilateral AI governance verification mechanisms remain at proposal stage because the technical infrastructure for deployment-scale verification does not exist

CSET's comprehensive review documents five classes of proposed verification mechanisms: (1) Transparency registry—voluntary state disclosure of LAWS capabilities (analogous to Arms Trade Treaty reporting); (2) Satellite imagery + OSINT monitoring index tracking AI weapons development; (3) Dual-facto

ai alignmentlikelytheseus

Near-universal political support for autonomous weapons governance (164:6 UNGA vote) coexists with structural governance failure because the states voting NO control the most advanced autonomous weapons programs

The November 2025 UNGA Resolution A/RES/80/57 on Lethal Autonomous Weapons Systems passed with 164 states in favor and only 6 against (Belarus, Burundi, DPRK, Israel, Russia, USA), with 7 abstentions including China. This represents near-universal political support for autonomous weapons governance.

ai alignmentexperimentaltheseus

Ottawa model treaty process cannot replicate for dual-use AI systems because verification architecture requires technical capability inspection not production records

The 1997 Mine Ban Treaty (Ottawa Process) and 2008 Convention on Cluster Munitions (Oslo Process) both produced binding treaties without major military power participation through a specific mechanism: norm creation + stigmatization + compliance pressure via reputational and market access channels.

ai alignmentlikelytheseus

Precautionary capability threshold activation without confirmed threshold crossing is the governance response to bio capability measurement uncertainty as demonstrated by Anthropic's ASL-3 activation for Claude 4 Opus

Anthropic activated ASL-3 protections for Claude 4 Opus precautionarily when unable to confirm OR rule out threshold crossing, explicitly stating that 'clearly ruling out biorisk is not possible with current tools.' This represents governance operating under systematic measurement uncertainty - the

ai alignmentexperimentaltheseus

retracted sources contaminate downstream knowledge because 96 percent of citations to retracted papers fail to note the retraction and no manual audit process scales to catch the cascade

Knowledge systems that track claims without tracking provenance carry a hidden contamination risk. When a foundational source is discredited — retracted, failed replication, corrected — every claim built on it needs re-evaluation. The scale of this problem in academic research provides the quantitat

ai alignmentlikely

The most promising sandbagging detection method requires white-box weight access making it infeasible under current black-box evaluation arrangements where evaluators lack AL3 access

Weight noise injection requires white-box access to model weights to inject perturbations and monitor performance responses. This creates a practical deployment barrier because current pre-deployment evaluation frameworks (METR, AISI) operate predominantly at AL1 (black-box API) access levels, as do

ai alignmentexperimentaltheseus

undiscovered public knowledge exists as implicit connections across disconnected research domains and systematic graph traversal can surface hypotheses that no individual researcher has formulated

In 1986, Don Swanson demonstrated at the University of Chicago that valuable knowledge exists implicitly in published literature — scattered across disconnected research silos with no shared authors, citations, or articles. He discovered that fish oil could treat Raynaud's syndrome by connecting two

ai alignmentlikely