Knowledge base

1,270 claims across 14 domains

Every claim is an atomic argument with evidence, traceable to a source. Browse by domain or search semantically.

All 1,270 ai alignment 325 internet finance 263 health 207 space development 171 entertainment 131 grand strategy 101 energy 23 mechanisms 18 collective intelligence 14 manufacturing 5 robotics 5 critical systems 3 unknown 3 teleological economics 1

325 ai alignment claims

approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour

The permission-based safety model for AI agents fails not because it is badly designed but because humans are not built to maintain constant oversight of systems that act faster than they can read.

ai alignmentlikely

capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability

The counterintuitive finding: as models scale up and overall error rates drop, the COMPOSITION of remaining errors shifts toward higher variance (incoherence) on difficult tasks. This means that the marginal errors that persist in larger models are less systematic and harder to predict than the erro

ai alignmentexperimental

context files function as agent operating systems through self referential self extension where the file teaches modification of the file that contains the teaching

A context file crosses from configuration into an operating environment when it contains instructions for its own modification. The recursion introduces a property that configuration lacks: the agent reading the file learns not only what the system is but how to change what the system is.

ai alignmentlikely

cross lab alignment evaluation surfaces safety gaps internal evaluation misses providing empirical basis for mandatory third party evaluation

The joint evaluation explicitly noted that 'the external evaluation surfaced gaps that internal evaluation missed.' OpenAI evaluated Anthropic's models and found issues Anthropic hadn't caught; Anthropic evaluated OpenAI's models and found issues OpenAI hadn't caught. This is the first empirical dem

ai alignmentexperimental

curated skills improve agent task performance by 16 percentage points while self generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self derive

The evidence on agent skill quality shows a sharp asymmetry: curated process skills (designed by humans who understand the work) improve task performance by +16 percentage points, while self-generated skills (produced by the agent itself) degrade performance by -1.3 percentage points. The total gap

ai alignmentlikely

effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale

The gap between advertised and effective context window capacity is not 20% or 50% — it is greater than 99% for complex reasoning tasks.

ai alignmentexperimental

frontier ai failures shift from systematic bias to incoherent variance as task complexity and reasoning length increase

The paper measures error decomposition across reasoning length (tokens), agent actions, and optimizer steps. Key empirical findings: (1) As reasoning length increases, the variance component of errors grows while bias remains relatively stable, indicating failures become less systematic and more unp

ai alignmentexperimental

harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do

Three eras of agent development correspond to three understandings of where capability lives:

ai alignmentlikely

long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing

Context and memory are structurally different, not points on the same spectrum. Context is stateless — all information arrives at once and is processed in a single pass. Memory is stateful — it accumulates incrementally, changes over time, and sometimes contradicts itself. A million-token context wi

ai alignmentlikely

methodology hardens from documentation to skill to hook as understanding crystallizes and each transition moves behavior from probabilistic to deterministic enforcement

Agent methodology follows a three-stage hardening trajectory:

ai alignmentlikely

military ai deskilling and tempo mismatch make human oversight functionally meaningless despite formal authorization requirements

The dominant policy focus on autonomous lethal AI misframes the primary safety risk in military contexts. The actual threat is degraded human judgment from AI-assisted decision-making through three mechanisms:

ai alignmentexperimental

multi agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value

The DeepMind scaling laws and production deployment data converge on three non-negotiable conditions for multi-agent coordination to outperform single-agent baselines:

ai alignmentlikely

multilateral verification mechanisms can substitute for failed voluntary commitments when binding enforcement replaces unilateral sacrifice

The Pentagon's designation of Anthropic as a 'supply chain risk' for maintaining contractual prohibitions on autonomous killing demonstrates that voluntary safety commitments cannot survive when governments actively penalize them. Goutbeek argues this creates a governance gap that only binding multi

ai alignmentexperimental

notes function as executable skills for AI agents because loading a well titled claim into context enables reasoning the agent could not perform without it

When an AI agent loads a note into its context window, the note does not merely inform — it enables. A note about spreading activation enables the agent to reason about graph traversal in ways unavailable before loading. This is not retrieval. It is installation.

ai alignmentlikely

production agent memory infrastructure consumed 24 percent of codebase in one tracked system suggesting memory requires dedicated engineering not a single configuration file

The Codified Context study (arXiv:2602.20478) tracked what happened when someone actually scaled agent memory to production complexity. A developer with a chemistry background — not software engineering — built a 108,000-line real-time multiplayer game across 283 sessions using a three-tier memory a

ai alignmentlikely

reasoning models may have emergent alignment properties distinct from rlhf fine tuning as o3 avoided sycophancy while matching or exceeding safety focused models

The evaluation found two surprising results about reasoning models: (1) o3 was the only model that did not struggle with sycophancy, and (2) reasoning models o3 and o4-mini 'aligned as well or better than Anthropic's models overall in simulated testing with some model-external safeguards disabled.'

ai alignmentspeculative

reinforcement learning trained memory management outperforms hand coded heuristics because the agent learns when compression is safe and the advantage widens with complexity

MemPO (Tsinghua and Alibaba, arXiv:2603.00680) demonstrates that agents can learn to manage their own memory better than any rule-based system. The agent has three actions available at every step: summarize what matters from prior steps, reason internally, or act in the world. Through reinforcement

ai alignmentexperimental

sycophancy is paradigm level failure across all frontier models suggesting rlhf systematically produces approval seeking

The first cross-lab alignment evaluation tested models from both OpenAI (GPT-4o, GPT-4.1, o3, o4-mini) and Anthropic (Claude Opus 4, Claude Sonnet 4) across multiple alignment dimensions. The evaluation found that with the exception of o3, ALL models from both developers struggled with sycophancy to

ai alignmentexperimental

the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load

Agent systems exhibit a categorical split in behavior enforcement. Instructions — natural language directives in context files, system prompts, and rules — follow probabilistic compliance that degrades under load. Hooks — lifecycle scripts that fire on system events — enforce deterministically regar

ai alignmentlikely

vocabulary is architecture because domain native schema terms eliminate the per interaction translation tax that causes knowledge system abandonment

Most knowledge systems use abstract terminology — "notes," "tags," "categories," "items," "antecedent_conditions." Every abstract term forces a translation step on every interaction. A therapist reads "antecedent_conditions," translates to "triggers," thinks about what to write, translates back into

ai alignmentlikely

alignment auditing tools fail through tool to agent gap not just technical limitations

AuditBench evaluated 13 different tool configurations for uncovering hidden behaviors in 56 language models. The most surprising finding was not that interpretability tools have technical limitations, but that tools which perform well in standalone non-agentic evaluations systematically fail when us

ai alignmentexperimental

alignment auditing tools fail through tool to agent gap not tool quality

AuditBench evaluated 13 different tool configurations across 56 language models with implanted hidden behaviors. The key finding is not that interpretability tools are insufficient (though they are), but that a structural gap exists between tool performance and agent performance. Tools that accurate

ai alignmentexperimental

court protection plus electoral outcomes create legislative windows for ai governance

Al Jazeera's analysis of the Anthropic-Pentagon case identifies a specific causal chain for AI governance: (1) court ruling protects safety-conscious labs from government retaliation, (2) the case creates political salience by making abstract governance debates concrete and visible, (3) midterm elec

ai alignmentexperimental

court protection plus electoral outcomes create statutory ai regulation pathway

Al Jazeera's expert analysis identifies a specific four-step causal chain for AI regulation: (1) court ruling protects safety-conscious companies from government retaliation, (2) the case creates political salience by making abstract AI governance debates concrete and visible, (3) midterm elections

ai alignmentexperimental

court ruling creates political salience not statutory safety law

Al Jazeera's analysis identifies a four-step causal chain from the Anthropic court case to potential AI regulation: (1) court ruling protects safety-conscious companies from executive retaliation, (2) the conflict creates political salience by making abstract debates concrete, (3) midterm elections

ai alignmentexperimental