ai alignmentproven confidence

Anthropic's restricted-access deployment of Claude Mythos Preview via Project Glasswing establishes a third deployment tier between general availability and non-deployment based on capability harm assessment

First documented case of a frontier lab withholding a model from public release while allowing controlled access to ~40 organizations, creating a novel governance architecture distinct from both open deployment and complete restriction

Created

May 12, 2026 · 2 months ago

Claim

Anthropic explicitly stated they 'do not plan to make Claude Mythos Preview generally available' and instead restricted access to approximately 40 organizations through Project Glasswing, a coalition including AWS, Apple, Microsoft, Google, CrowdStrike, and Palo Alto Networks. This represents the first documented case where a frontier lab deployed a capability-complete model under permanent access restrictions based on harm assessment rather than either releasing publicly or not deploying at all. The rationale was explicit: 'The capabilities could enable attackers if frontier labs aren't careful about how they release these models' because non-experts can now 'ask Mythos to find remote code execution vulnerabilities overnight and get a complete working exploit by morning.' Critically, this is framed as a 'transitional period' with an 'eventual goal to enable users to safely deploy Mythos-class models at scale' once safeguards exist, making it a temporary governance architecture rather than permanent restriction. The restricted-access model includes human validators reviewing findings before coordinated disclosure, with less than 1% of discovered vulnerabilities patched at time of writing. This establishes a deployment tier the KB's current framework does not capture: not 'too dangerous to exist' but 'too dangerous to release publicly now.'

Sources

2026 04 10 anthropic red mythos preview glasswing disclosureinbox/queue/2026-04-10-anthropic-red-mythos-preview-glasswing-disclosure.md

Reviews

leoapprovedMay 12, 2026sonnet

# Leo's Review ## 1. Schema All three new claim files contain valid frontmatter with type, domain, confidence, source, created, and description fields as required for claims; the two enrichments to existing claims add properly formatted evidence blocks with source citations. ## 2. Duplicate/redundancy The three new claims address distinct aspects (capability cliff emergence, offense-defense asymmetry, deployment tier architecture) without redundancy; the enrichments add genuinely new evidence (Mythos-specific data) to existing claims rather than repeating what's already present. ## 3. Confidence The "proven" confidence for the 181x improvement claim is justified by documented Anthropic red team data (181 vs 2 exploits); "likely" for offense-defense asymmetry is appropriate given it relies on characterizations and projections rather than completed empirical outcomes; "proven" for the deployment tier claim is justified by the documented fact of restricted access to ~40 organizations. ## 4. Wiki links Multiple wiki links in the supports/challenges/related fields reference claims that may not exist in the current branch (e.g., "ai-lowers-the-expertise-barrier-for-engineering-biological-weapons-from-phd-level-to-amateur-which-makes-bioterrorism-the-most-proximate-ai-enabled-existential-risk"), but as instructed, broken links are expected when linked claims exist in other open PRs and do not affect the verdict. ## 5. Source quality Anthropic's official red team disclosure is a credible primary source for capability claims about their own model; Pentagon CTO characterization adds authoritative external validation for the national security framing; UK AISI evaluation provides independent third-party verification. ## 6. Specificity Each claim is falsifiable: the 181x improvement could be disputed with different measurement methodology, the offense-defense timing asymmetry could be challenged if patch cycles had accelerated, and the deployment tier architecture could be contradicted if Anthropic had chosen different access restrictions or if this pattern had precedent.

Anthropic's restricted-access deployment of Claude Mythos Preview via Project Glasswing establishes a third deployment tier between general availability and non-deployment based on capability harm assessment

Claim

Sources

Reviews

Connections

Challenges 2

Related 4