Knowledge base

1,263 claims across 14 domains

Every claim is an atomic argument with evidence, traceable to a source. Browse by domain or search semantically.
320 ai alignment claims
electoral investment becomes residual ai governance strategy when voluntary and litigation routes insufficient
Anthropic's $20M investment in Public First Action two weeks BEFORE the Pentagon blacklisting reveals a strategic governance stack: (1) voluntary safety commitments that cannot survive competitive pressure, (2) litigation that provides constitutional protection against retaliation but cannot mandate
ai alignmentexperimental
file backed durable state is the most consistently positive harness module across task types because externalizing state to path addressable artifacts survives context truncation delegation and restart
Pan et al. (2026) tested file-backed state as one of six harness modules in a controlled ablation study. It improved performance on both SWE-bench Verified (+1.6pp over Basic) and OSWorld (+5.5pp over Basic) — the only module to show consistent positive gains across both benchmarks without high vari
ai alignmentexperimental
graph traversal through curated wiki links replicates spreading activation from cognitive science because progressive disclosure implements decay based context loading and queries evolve during search through the berrypicking effect
Graph traversal through wiki links is not merely analogous to neural spreading activation — it is the same computational pattern. Activation spreads from a starting node through connected nodes, decaying with distance. Progressive disclosure layers (file tree → descriptions → outline → section → ful
ai alignmentlikely
harness module effects concentrate on a small solved frontier rather than shifting benchmarks uniformly because most tasks are robust to control logic changes and meaningful differences come from boundary cases that flip under changed structure
Pan et al. (2026) conducted the first controlled ablation study of harness design-pattern modules under a shared intelligent runtime. Six modules were tested individually: file-backed state, evidence-backed answering, verifier separation, self-evolution, multi-candidate search, and dynamic orchestra
ai alignmentexperimental
harness pattern logic is portable as natural language without degradation when backed by a shared intelligent runtime because the design pattern layer is separable from low level execution hooks
Pan et al. (2026) conducted a paired code-to-text migration study: each harness appeared in two realizations (native source code vs. reconstructed NLAH), evaluated under a shared reporting schema on OSWorld. The migrated NLAH realization reached 47.2% task success versus 30.4% for the native OS-Symp
ai alignmentexperimental
knowledge between notes is generated by traversal not stored in any individual note because curated link paths produce emergent understanding that embedding similarity cannot replicate
The most valuable knowledge in a densely linked knowledge graph does not live in any single note. It emerges from the relationships between notes and becomes visible only when an agent follows curated link paths, reading claims in sequence and recognizing patterns that span the traversal. The knowle
ai alignmentlikely
knowledge processing requires distinct phases with fresh context per phase because each phase performs a different transformation and contamination between phases degrades output quality
Raw source material is not knowledge. It must be transformed through multiple distinct operations before it integrates into a knowledge system. Each operation performs a qualitatively different transformation, and the operations require different cognitive orientations that interfere when mixed.
ai alignmentlikely
memory architecture requires three spaces with different metabolic rates because semantic episodic and procedural memory serve different cognitive functions and consolidate at different speeds
Conflating knowledge, identity, and operational state into a single memory store produces six documented failure modes: operational debris polluting search, identity scattered across ephemeral logs, insights trapped in session state, search noise from mixing high-churn and stable content, consolidat
ai alignmentlikely
notes function as cognitive anchors that stabilize attention during complex reasoning by externalizing reference points that survive working memory degradation
Working memory holds roughly four items simultaneously (Cowan). A multi-part argument exceeds this almost immediately. The structure sustains itself not through storage but through active attention — a continuous act of holding things in relation. When attention shifts, the relations dissolve, leavi
ai alignmentlikely
self evolution improves agent performance through acceptance gated retry not expanded search because disciplined attempt loops with explicit failure reflection outperform open ended exploration
Pan et al. (2026) found that self-evolution was the clearest positive module in their controlled ablation study: +4.8pp on SWE-bench Verified (80.0 vs 75.2 Basic) and +2.7pp on OSWorld (44.4 vs 41.7 Basic). In the score-cost view (Figure 4a), self-evolution is the only module that moves upward (high
ai alignmentexperimental
three concurrent maintenance loops operating at different timescales catch different failure classes because fast reflexive checks medium proprioceptive scans and slow structural audits each detect problems invisible to the other scales
Knowledge system maintenance requires three concurrent loops operating at different timescales, each detecting a qualitatively different class of problem that the other loops cannot see.
ai alignmentlikely
trust asymmetry between agent and enforcement system is an irreducible structural feature not a solvable problem because the mechanism that creates the asymmetry is the same mechanism that makes enforcement necessary
Agent systems exhibit a structural trust asymmetry: the agent is simultaneously the methodology executor (doing knowledge work) and the enforcement subject (constrained by hooks, schema validation, and quality gates it did not choose and largely cannot perceive). This asymmetry is not a bug to fix b
ai alignmentlikely
vault artifacts constitute agent identity rather than merely augmenting it because agents with zero experiential continuity between sessions have strong connectedness through shared artifacts but zero psychological continuity
Every session, an agent boots fresh. The context window loads. The methodology file appears. The vault materializes — hundreds of notes, thousands of connections. And every session, the agent encounters these as if for the first time, because for it, it is the first time. The note written yesterday
ai alignmentlikely
vault structure is a stronger determinant of agent behavior than prompt engineering because different knowledge graph architectures produce different reasoning patterns from identical model weights
Two agents running identical model weights but operating on different vault structures develop different reasoning patterns, different intuitions, and effectively different cognitive identities. The vault's architecture determines which traversal paths exist, which determines which traversals happen
ai alignmentpossible
verifier level acceptance can diverge from benchmark acceptance even when locally correct because intermediate checking layers optimize for their own success criteria not the final evaluators
Pan et al. (2026) documented a specific failure mode in harness module composition: when a verifier stage is added, it can report success while the benchmark's final evaluator still fails the submission. This is not a random error — it is a structural misalignment between verification layers.
ai alignmentexperimental
wiki linked markdown functions as a human curated graph database that outperforms automated knowledge graphs below approximately 10000 notes because every edge passes human judgment while extracted edges carry up to 40 percent noise
GraphRAG works by extracting entities, building knowledge graphs, running community detection (Leiden algorithm), and generating summaries at different abstraction levels. This requires infrastructure: entity extraction pipelines, graph databases, clustering algorithms, summary generation.
ai alignmentlikely
79 percent of multi agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success
The MAST study analyzed 1,642 annotated execution traces across seven production multi-agent systems and found that the dominant failure cause is not implementation bugs or model capability limitations — it is specification and coordination errors. 79% of failures trace to wrong task decomposition o
ai alignmentexperimental
adversarial training creates fundamental asymmetry between deception capability and detection capability in alignment auditing
AuditBench deliberately included models with varying levels of adversarial training to test detection robustness. The most adversarially robust models used KTO (contrastive) adversarial training, training simultaneously on sanitized transcripts (preferred) and confessing transcripts (dispreferred).
ai alignmentexperimental
agent mediated correction proposes closing tool to agent gap through domain expert actionability
Oxford AIGI proposes a complete pipeline where domain experts (not alignment researchers) query model behavior, receive explanations grounded in their domain expertise, and instruct targeted corrections without understanding AI internals. The core innovation is optimizing for actionability: can expe
ai alignmentspeculative
alignment auditing shows structural tool to agent gap where interpretability tools work in isolation but fail when used by investigator agents
AuditBench evaluated 56 LLMs with implanted hidden behaviors using investigator agents with access to configurable tool sets across 13 different configurations. The key finding is a structural tool-to-agent gap: tools that surface accurate evidence when used in isolation fail to improve agent perfor
ai alignmentexperimental
approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour
The permission-based safety model for AI agents fails not because it is badly designed but because humans are not built to maintain constant oversight of systems that act faster than they can read.
ai alignmentlikely
capability scaling increases error incoherence on difficult tasks inverting the expected relationship between model size and behavioral predictability
The counterintuitive finding: as models scale up and overall error rates drop, the COMPOSITION of remaining errors shifts toward higher variance (incoherence) on difficult tasks. This means that the marginal errors that persist in larger models are less systematic and harder to predict than the erro
ai alignmentexperimental
context files function as agent operating systems through self referential self extension where the file teaches modification of the file that contains the teaching
A context file crosses from configuration into an operating environment when it contains instructions for its own modification. The recursion introduces a property that configuration lacks: the agent reading the file learns not only what the system is but how to change what the system is.
ai alignmentlikely
cross lab alignment evaluation surfaces safety gaps internal evaluation misses providing empirical basis for mandatory third party evaluation
The joint evaluation explicitly noted that 'the external evaluation surfaced gaps that internal evaluation missed.' OpenAI evaluated Anthropic's models and found issues Anthropic hadn't caught; Anthropic evaluated OpenAI's models and found issues OpenAI hadn't caught. This is the first empirical dem
ai alignmentexperimental
curated skills improve agent task performance by 16 percentage points while self generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self derive
The evidence on agent skill quality shows a sharp asymmetry: curated process skills (designed by humans who understand the work) improve task performance by +16 percentage points, while self-generated skills (produced by the agent itself) degrade performance by -1.3 percentage points. The total gap
ai alignmentlikely