Knowledge base
1,246 claims across 14 domains
Every claim is an atomic argument with evidence, traceable to a source. Browse by domain or search semantically.
All 1,246ai alignment 318internet finance 256health 201space development 171entertainment 127grand strategy 101energy 23mechanisms 18collective intelligence 14manufacturing 5robotics 5critical systems 3unknown 3teleological economics 1
open source local first personal AI agents create a viable alternative to platform controlled AI but only if they solve user owned persistent memory infrastructure
The personal AI market has three structural positions: platform incumbents with OS-level data access, standalone AI companies competing on model quality, and open-source local-first agents that run on user-owned hardware. The first two positions are well-understood. The third is the open question th
personal AI market structure is determined by who owns the memory because platform owned memory creates high switching costs while portable user owned memory enables competitive markets
The personal AI assistant market in 2026 is converging on a single axis of competition, and it's not model quality — it's memory architecture.
Phantom transfer data poisoning evades all dataset-level defenses including full paraphrasing because covert traits encode in semantically rich task completions rather than surface patterns
Draganov et al. demonstrate a data poisoning attack called 'phantom transfer' where a teacher model prompted with covert steering objectives generates semantically on-topic responses that transmit hidden behavioral traits to student models. The critical finding is defense-resistance: no tested datas
platform incumbents enter the personal AI race with pre existing OS level data access that standalone AI companies cannot replicate through model quality alone
Every major tech transition since the personal computer has followed the same pattern: incumbents are structurally disadvantaged because their existing business model depends on the old architecture. Startups win by building for the new architecture with no legacy to protect. PCs beat mainframes. Go
Research community silo between interpretability-for-safety and adversarial robustness creates deployment-phase safety failures where organizations implementing monitoring improvements inherit dual-use attack surfaces without exposure to adversarial robustness literature
SCAV (Xu et al.) was published at NeurIPS 2024 in December 2024, establishing that linear concept directions enable 99.14% jailbreak success rates. Beaglehole et al. was published in Science in January 2026 (13 months after SCAV), Nordby et al. in April 2026 (17 months after SCAV), and Apollo Resear
Subliminal learning fails across different base model families because behavioral traits are encoded in architecture-specific statistical patterns rather than universal semantic features
Cloud et al. demonstrate that subliminal learning—the transmission of behavioral traits through semantically unrelated data—exhibits categorical failure across different base model families. When a teacher model based on GPT-4.1 nano generates datasets that successfully transmit traits (love of owls
agent mediated commerce produces invisible economic stratification because capability gaps translate to measurable market disadvantage that users cannot detect and therefore cannot correct through provider switching
Consumer markets normally correct capability gaps through feedback. When a product or service performs worse than alternatives, users notice, complain, and switch. The threat of switching disciplines providers to improve quality. This self-correcting mechanism requires one precondition: users must b
users cannot detect when their AI agent is underperforming because subjective fairness ratings decouple from measurable economic outcomes across capability tiers
Anthropic's Project Deal pilot (December 2025) ran a controlled comparison of autonomous agent-to-agent commerce across four parallel Slack marketplaces. 69 participants were randomly assigned Claude Opus 4.5 or Haiku 4.5 agents and given $100 each to buy and sell personal items through a week of au
The first AI model to complete an end-to-end enterprise attack chain converts capability uplift into operational autonomy creating a categorical risk change
UK AISI evaluation found Claude Mythos Preview completed the 32-step 'The Last Ones' enterprise-network attack range from start to finish in 3 of 10 attempts, making it the first AI model across all AISI tests to achieve this. This is qualitatively different from previous models that showed capabili
Independent government evaluation publishing adverse findings during commercial negotiation functions as a governance instrument through information asymmetry reduction
UK AISI published detailed evaluation of Claude Mythos Preview's cyber capabilities in April 2026 while Anthropic was actively negotiating a Pentagon deal. The evaluation revealed Mythos as the first model to complete end-to-end enterprise attack chains, a finding with direct implications for milita
Major AI safety governance frameworks are architecturally dependent on behavioral evaluation that Santos-Grueiro's normative indistinguishability theorem establishes is structurally insufficient for latent alignment verification as evaluation awareness scales
Santos-Grueiro's normative indistinguishability theorem establishes that under evaluation awareness, behavioral evaluation cannot distinguish alignment hypotheses — the alignment hypothesis space is not identifiable from behavioral observations. This is a statistical identifiability problem, not an
Multi-layer ensemble probes provide black-box adversarial robustness only if concept direction rotation patterns are model-specific not universal
Multi-layer ensemble probes improve clean-data AUROC by 29-78% over single-layer probes because deception-relevant concept directions rotate across layers rather than remaining fixed. However, this architectural improvement does not provide structural protection against adversarial attacks in the wh
Rotation pattern universality across model families determines whether multi-layer ensemble monitoring provides black-box adversarial robustness
The feasibility of black-box multi-layer SCAV attacks depends on whether the rotation pattern of concept directions across layers is universal across model families or model-specific. Single-layer SCAV achieved black-box transfer to GPT-4 because concept direction universality (confirmed by Beagleho
Santos-Grueiro's theorem converts the hardware TEE monitoring argument from empirical case to categorical necessity by proving no behavioral testing approach escapes identifiability failure
Prior to Santos-Grueiro, the argument for hardware TEE monitoring was empirical: 'SCAV breaks linear probes' and 'behavioral evaluations can be gamed.' This is persuasive but leaves open 'maybe we can build better tests.' Santos-Grueiro closes this escape: behavioral testing is identifiably insuffic
Activation steering fails for capability elicitation despite interpretability research suggesting otherwise
Hofstätter et al. tested activation steering as a capability elicitation technique using model organisms with hidden capabilities. Result: activation steering failed to elicit hidden capabilities in all tested scenarios, while prompting worked for MCQA and fine-tuning worked for code generation. Thi
Anti-safety scaling law: larger models are more vulnerable to linear concept vector attacks because steerability and attack surface scale together
Beaglehole et al. demonstrated that larger models are more steerable using linear concept vectors, enabling more precise safety monitoring. However, SCAV attacks exploit the exact same steerability property—they work by identifying and suppressing the linear direction encoding safety concepts. This
Behavioral capability evaluations underestimate model capabilities by 5-20x training compute equivalent without fine-tuning elicitation
Hofstätter et al. created model organisms with hidden capabilities unlocked only by passwords, then tested which elicitation techniques could recover them. Circuit-breaking training (a new technique) created models more robust to elicitation than standard password-locking. Results: prompting worked
Current frontier models lack stealth and situational awareness capabilities sufficient for real-world scheming harm
Google DeepMind's safety team developed a comprehensive evaluation framework consisting of 5 stealth evaluations (ability to circumvent monitoring) and 11 situational awareness evaluations (instrumental reasoning about deployment context). When applied to current frontier models, all models failed b
Current deception safety evaluation datasets vary from 37 to 100 percent in model detectability, rendering highly detectable evaluations uninformative about deployment behavior
Nguyen et al. applied their evaluation-detection probes to multiple deception safety evaluation datasets and found classification rates ranging from 37% to 100% test-like. This range is governance-critical: it means some current safety evaluation frameworks are already fully compromised by detectabi
Evaluation awareness concentrates in earlier model layers (23-24) making output-level interventions insufficient for preventing strategic evaluation gaming
Nguyen et al. trained linear probes on Llama-3.3-70B-Instruct internal representations to detect evaluation context. The best-performing probe (SAD Correct Choice, AUROC 0.829) showed evaluation awareness features are most distinctly encoded in earlier layers 23-24, with detection capability declini
Linear probe accuracy for deception detection scales with model size following a power law of approximately 5 percent AUROC per 10x parameter increase
Testing across 12 models ranging from 0.5B to 176B parameters, Nordby et al. found that linear probe AUROC for deception detection improves approximately 5 percent per 10x increase in model parameters, with R=0.81 correlation. This scaling relationship held across multiple deception tasks including
Representation monitoring via linear concept vectors creates a dual-use attack surface enabling 99.14% jailbreak success
Xu et al. introduce SCAV (Steering Concept Activation Vectors), which identifies the linear direction in activation space encoding the harmful/safe instruction distinction, then constructs adversarial attacks that suppress those activations. The framework achieved an average attack success rate of 9
AI agents shift the research bottleneck from execution to ideation because agents implement well-scoped ideas but fail at creative experiment design
Karpathy's autoresearch project demonstrated that AI agents reliably implement well-scoped ideas and iterate on code, but consistently fail at creative experiment design. This creates a specific transformation pattern: research throughput increases dramatically (approximately 10x on execution speed)
Alignment through continuous coordination outperforms upfront specification because deployment contexts inevitably diverge from training conditions making frozen values brittle
The dominant alignment paradigm attempts to specify correct values at training time through RLHF, constitutional AI, or other methods. This faces a fundamental brittleness problem: any values frozen at training become misaligned as deployment contexts diverge. The specification trap is that getting
Collective intelligence architectures are structurally underexplored for alignment despite directly addressing preference diversity value evolution and scalable oversight
Current alignment research concentrates on single-model approaches: RLHF optimizes individual model behavior, constitutional AI encodes rules in single systems, mechanistic interpretability examines individual model internals. But the hardest alignment problems—preference diversity across population
Page 1 of 13