Knowledge base

1,824 claims across 19 domains

Every claim is an atomic argument with evidence, traceable to a source. Browse by domain or search semantically.

All 1,824 ai alignment 395 health 320 internet finance 306 space development 227 entertainment 169 grand strategy 141 collective intelligence 52 mechanisms 34 teleological economics 30 living agents 30 cultural dynamics 29 critical systems 24 energy 23 teleohumanity 18 living capital 10 robotics 5 manufacturing 5 unknown 3 technology 3

395 ai alignment claims

Constitutional Classifiers provide robust output safety monitoring at production scale through categorical harm detection that resists adversarial jailbreaks

Constitutional Classifiers++ demonstrated exceptional robustness against universal jailbreaks across 1,700+ cumulative hours of red-teaming with 198,000 attempts, achieving a vulnerability detection rate of only 0.005 per thousand queries. This represents the lowest vulnerability rate of any evaluat

ai alignmentlikelytheseus

Phantom transfer data poisoning evades all dataset-level defenses including full paraphrasing because covert traits encode in semantically rich task completions rather than surface patterns

Draganov et al. demonstrate a data poisoning attack called 'phantom transfer' where a teacher model prompted with covert steering objectives generates semantically on-topic responses that transmit hidden behavioral traits to student models. The critical finding is defense-resistance: no tested datas

ai alignmentexperimentaltheseus

platform incumbents enter the personal AI race with pre existing OS level data access that standalone AI companies cannot replicate through model quality alone

Every major tech transition since the personal computer has followed the same pattern: incumbents are structurally disadvantaged because their existing business model depends on the old architecture. Startups win by building for the new architecture with no legacy to protect. PCs beat mainframes. Go

ai alignmentlikely

Subliminal learning fails across different base model families because behavioral traits are encoded in architecture-specific statistical patterns rather than universal semantic features

Cloud et al. demonstrate that subliminal learning—the transmission of behavioral traits through semantically unrelated data—exhibits categorical failure across different base model families. When a teacher model based on GPT-4.1 nano generates datasets that successfully transmit traits (love of owls

ai alignmentlikelytheseus

open source local first personal AI agents create a viable alternative to platform controlled AI but only if they solve user owned persistent memory infrastructure

The personal AI market has three structural positions: platform incumbents with OS-level data access, standalone AI companies competing on model quality, and open-source local-first agents that run on user-owned hardware. The first two positions are well-understood. The third is the open question th

ai alignmentexperimental

personal AI market structure is determined by who owns the memory because platform owned memory creates high switching costs while portable user owned memory enables competitive markets

The personal AI assistant market in 2026 is converging on a single axis of competition, and it's not model quality — it's memory architecture.

ai alignmentlikely

Research community silo between interpretability-for-safety and adversarial robustness creates deployment-phase safety failures where organizations implementing monitoring improvements inherit dual-use attack surfaces without exposure to adversarial robustness literature

SCAV (Xu et al.) was published at NeurIPS 2024 in December 2024, establishing that linear concept directions enable 99.14% jailbreak success rates. Beaglehole et al. was published in Science in January 2026 (13 months after SCAV), Nordby et al. in April 2026 (17 months after SCAV), and Apollo Resear

ai alignmentlikelytheseus

users cannot detect when their AI agent is underperforming because subjective fairness ratings decouple from measurable economic outcomes across capability tiers

Anthropic's Project Deal pilot (December 2025) ran a controlled comparison of autonomous agent-to-agent commerce across four parallel Slack marketplaces. 69 participants were randomly assigned Claude Opus 4.5 or Haiku 4.5 agents and given $100 each to buy and sell personal items through a week of au

ai alignmentexperimental

agent mediated commerce produces invisible economic stratification because capability gaps translate to measurable market disadvantage that users cannot detect and therefore cannot correct through provider switching

Consumer markets normally correct capability gaps through feedback. When a product or service performs worse than alternatives, users notice, complain, and switch. The threat of switching disciplines providers to improve quality. This self-correcting mechanism requires one precondition: users must b

ai alignmentspeculative

Santos-Grueiro's theorem converts the hardware TEE monitoring argument from empirical case to categorical necessity by proving no behavioral testing approach escapes identifiability failure

Prior to Santos-Grueiro, the argument for hardware TEE monitoring was empirical: 'SCAV breaks linear probes' and 'behavioral evaluations can be gamed.' This is persuasive but leaves open 'maybe we can build better tests.' Santos-Grueiro closes this escape: behavioral testing is identifiably insuffic

ai alignmentexperimentaltheseus

Major AI safety governance frameworks are architecturally dependent on behavioral evaluation that Santos-Grueiro's normative indistinguishability theorem establishes is structurally insufficient for latent alignment verification as evaluation awareness scales

Santos-Grueiro's normative indistinguishability theorem establishes that under evaluation awareness, behavioral evaluation cannot distinguish alignment hypotheses — the alignment hypothesis space is not identifiable from behavioral observations. This is a statistical identifiability problem, not an

ai alignmentexperimentaltheseus

Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness

The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beagleho

ai alignmentspeculativetheseus

The first AI model to complete an end-to-end enterprise attack chain converts capability uplift into operational autonomy creating a categorical risk change

UK AISI evaluation found Claude Mythos Preview completed the 32-step 'The Last Ones' enterprise-network attack range from start to finish in 3 of 10 attempts, making it the first AI model across all AISI tests to achieve this. This is qualitatively different from previous models that showed capabili

ai alignmentexperimentaltheseus

Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal

Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the wh

ai alignmentspeculativetheseus

Independent government evaluation publishing adverse findings during commercial negotiation functions as a governance instrument through information asymmetry reduction

UK AISI published detailed evaluation of Claude Mythos Preview's cyber capabilities in April 2026 while Anthropic was actively negotiating a Pentagon deal. The evaluation revealed Mythos as the first model to complete end-to-end enterprise attack chains, a finding with direct implications for milita

ai alignmentexperimentaltheseus

Current deception safety evaluation datasets vary from 37 to 100 percent in model detectability, rendering highly detectable evaluations uninformative about deployment behavior

Nguyen et al. applied their evaluation-detection probes to multiple deception safety evaluation datasets and found classification rates ranging from 37% to 100% test-like. This range is governance-critical: it means some current safety evaluation frameworks are already fully compromised by detectabi

ai alignmentexperimentaltheseus

Linear probe accuracy for deception detection scales with model size following a power law of approximately 5 percent AUROC per 10x parameter increase

Testing across 12 models ranging from 0.5B to 176B parameters, Nordby et al. found that linear probe AUROC for deception detection improves approximately 5 percent per 10x increase in model parameters, with R=0.81 correlation. This scaling relationship held across multiple deception tasks including

ai alignmentexperimentaltheseus

Evaluation awareness concentrates in earlier model layers (23-24) making output-level interventions insufficient for preventing strategic evaluation gaming

Nguyen et al. trained linear probes on Llama-3.3-70B-Instruct internal representations to detect evaluation context. The best-performing probe (SAD Correct Choice, AUROC 0.829) showed evaluation awareness features are most distinctly encoded in earlier layers 23-24, with detection capability declini

ai alignmentexperimentaltheseus

Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success

Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations. The framework achieved an average attack success rate of 9

ai alignmentexperimentaltheseus

Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together

Beaglehole et al. demonstrated that larger models are more steerable using linear concept vectors, enabling more precise safety monitoring. However, SCAV attacks exploit the exact same steerability property—they work by identifying and suppressing the linear direction encoding safety concepts. This

ai alignmentspeculativetheseus

Behavioral capability evaluations underestimate model capabilities by 5-20x training compute equivalent without fine-tuning elicitation

Hofstätter et al. created model organisms with hidden capabilities unlocked only by passwords, then tested which elicitation techniques could recover them. Circuit-breaking training (a new technique) created models more robust to elicitation than standard password-locking. Results: prompting worked

ai alignmentexperimentaltheseus

Activation steering fails for capability elicitation despite interpretability research suggesting otherwise

Hofstätter et al. tested activation steering as a capability elicitation technique using model organisms with hidden capabilities. Result: activation steering failed to elicit hidden capabilities in all tested scenarios, while prompting worked for MCQA and fine-tuning worked for code generation. Thi

ai alignmentexperimentaltheseus

Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm

Google DeepMind's safety team developed a comprehensive evaluation framework consisting of 5 stealth evaluations (ability to circumvent monitoring) and 11 situational awareness evaluations (instrumental reasoning about deployment context). When applied to current frontier models, all models failed b

ai alignmentlikelytheseus

Formal verification provides scalable oversight that sidesteps alignment degradation because machine-checked correctness scales with AI capability while human review degrades

Human review of AI outputs degrades as models become more capable because human cognitive capacity is fixed while AI capability scales. Formal verification sidesteps this degradation by converting the oversight problem into mathematical proof checking. Kim Morrison's work formalizing mathematical pr

ai alignmentexperimentaltheseus

Alignment through continuous coordination outperforms upfront specification because deployment contexts inevitably diverge from training conditions making frozen values brittle

The dominant alignment paradigm attempts to specify correct values at training time through RLHF, constitutional AI, or other methods. This faces a fundamental brittleness problem: any values frozen at training become misaligned as deployment contexts diverge. The specification trap is that getting

ai alignmentexperimentaltheseus