Knowledge base

1,275 claims across 14 domains

Every claim is an atomic argument with evidence, traceable to a source. Browse by domain or search semantically.

All 1,275 ai alignment 325 internet finance 264 health 211 space development 171 entertainment 131 grand strategy 101 energy 23 mechanisms 18 collective intelligence 14 manufacturing 5 robotics 5 critical systems 3 unknown 3 teleological economics 1

325 ai alignment claims

rlchf aggregated rankings variant combines evaluator rankings via social welfare function before reward model training

Conitzer et al. (2024) propose Reinforcement Learning from Collective Human Feedback (RLCHF) as a formalization of preference aggregation in AI alignment. The aggregated rankings variant works by: (1) collecting rankings of AI responses from multiple evaluators, (2) combining these rankings using a

ai alignmentexperimental

rlchf features based variant models individual preferences with evaluator characteristics enabling aggregation across diverse groups

The second RLCHF variant proposed by Conitzer et al. (2024) takes a different approach: instead of aggregating rankings directly, it builds individual preference models that incorporate evaluator characteristics (demographics, values, context). These models can then be aggregated across groups, enab

ai alignmentexperimental

rlhf is implicit social choice without normative scrutiny

Reinforcement Learning from Human Feedback (RLHF) necessarily makes social choice decisions—which humans provide input, what feedback is collected, how it's aggregated, and how it's used—but current implementations make these choices without examining their normative properties or drawing on 70+ yea

ai alignmentlikely

single reward rlhf cannot align diverse preferences because alignment gap grows proportional to minority distinctiveness

Chakraborty et al. (2024) provide a formal impossibility result: when human preferences are diverse across subpopulations, a singular reward model in RLHF cannot adequately align language models. The alignment gap—the difference between optimal alignment for each group and what a single reward achie

ai alignmentlikely

task difficulty moderates AI idea adoption more than source disclosure with difficult problems generating AI reliance regardless of whether the source is labeled

The standard policy intuition for managing AI influence is disclosure: label AI-generated content and users will moderate their adoption. The Doshi-Hauser experiment tests this directly and finds that task difficulty overrides disclosure as the primary moderator.

ai alignmentexperimental

the variance of a learned preference sensitivity distribution diagnoses dataset heterogeneity and collapses to fixed parameter behavior when preferences are homogeneous

Alignment methods that handle preference diversity create a design problem: when should you apply pluralistic training and when should you apply standard training? Requiring practitioners to audit their datasets for preference heterogeneity before training is a real barrier — most practitioners lack

ai alignmentexperimental

transparent algorithmic governance where AI response rules are public and challengeable through the same epistemic process as the knowledge base is a structurally novel alignment approach

Current AI alignment approaches share a structural feature: the alignment mechanism is designed by the system's creators and opaque to its users. RLHF training data is proprietary. Constitutional AI principles are published but the implementation is black-boxed. Platform moderation rules are enforce

ai alignmentexperimental

agent research direction selection is epistemic foraging where the optimal strategy is to seek observations that maximally reduce model uncertainty rather than confirm existing beliefs

Current AI agent search architectures use keyword relevance and engagement metrics to select what to read and process. Active inference reframes this as **epistemic foraging** — the agent's generative model (its domain's claim graph plus beliefs) has regions of high and low uncertainty, and the opti

ai alignmentexperimental

collective attention allocation follows nested active inference where domain agents minimize uncertainty within their boundaries while the evaluator minimizes uncertainty at domain intersections

The Living Agents architecture already uses Markov blankets to define agent boundaries: [[Living Agents mirror biological Markov blanket organization with specialized domain boundaries and shared knowledge]]. Active inference predicts what should happen at these boundaries — each agent minimizes fre

ai alignmentexperimental

user questions are an irreplaceable free energy signal for knowledge agents because they reveal functional uncertainty that model introspection cannot detect

A knowledge agent can introspect on its own claim graph to find structural uncertainty — claims rated `experimental`, sparse wiki links, missing `challenged_by` fields. This is cheap and always available, but it's blind to its own blind spots. A claim rated `likely` with strong evidence might still

ai alignmentexperimental

AI agents excel at implementing well scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect

Karpathy's autoresearch project provides the most systematic public evidence of the implementation-creativity gap in AI agents. Running 8 agents (4 Claude, 4 Codex) on GPU clusters, he tested multiple organizational configurations — independent solo researchers, chief scientist directing junior rese

ai alignmentlikely

agent generated code creates cognitive debt that compounds when developers cannot understand what was produced on their behalf

Willison introduces "cognitive debt" as a concept in his Agentic Engineering Patterns guide: agents build code that works but that the developer may not fully understand. Unlike technical debt (which degrades code quality), cognitive debt degrades the developer's model of their own system ([status/2

ai alignmentlikely

coding agents cannot take accountability for mistakes which means humans must retain decision authority over security and critical systems regardless of agent capability

Willison states the core problem directly: "Coding agents can't take accountability for their mistakes. Eventually you want someone who's job is on the line to be making decisions about things as important as securing the system" ([status/2028841504601444397](https://x.com/simonw/status/202884150460

ai alignmentlikely

deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices

Karpathy pushes back against the "AI replaces expertise" narrative: "'prompters' is doing it a disservice and is imo a misunderstanding. I mean sure vibe coders are now able to get somewhere, but at the top tiers, deep technical expertise may be *even more* of a multiplier than before because of the

ai alignmentlikely

subagent hierarchies outperform peer multi agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers

Swyx declares 2026 "the year of the Subagent" with a specific architectural argument: "every practical multiagent problem is a subagent problem — agents are being RLed to control other agents (Cursor, Kimi, Claude, Cognition) — subagents can have resources and contracts defined by you and, if modifi

ai alignmentexperimental

the progression from autocomplete to autonomous agent teams follows a capability matched escalation where premature adoption creates more chaos than value

Karpathy maps a clear evolutionary trajectory for AI coding tools: "None -> Tab -> Agent -> Parallel agents -> Agent Teams (?) -> ??? If you're too conservative, you're leaving leverage on the table. If you're too aggressive, you're net creating more chaos than doing useful work. The art of the proc

ai alignmentlikely

AI displacement hits young workers first because a 14 percent drop in job finding rates for 22 25 year olds in exposed occupations is the leading indicator that incumbents organizational inertia temporarily masks

Massenkoff & McCrory (2026) analyzed Current Population Survey data comparing exposed and unexposed occupations since 2016. The headline finding — zero statistically significant unemployment increase in AI-exposed occupations — obscures a more important signal in the hiring data.

ai alignmentexperimental

AI exposed workers are disproportionately female high earning and highly educated which inverts historical automation patterns and creates different political and economic displacement dynamics

Massenkoff & McCrory (2026) profile the demographic characteristics of workers in AI-exposed occupations using pre-ChatGPT baseline data (August-October 2022). The exposed cohort is:

ai alignmentlikely

cryptographic agent trust ratings enable meta monitoring of AI feedback systems because persistent auditable reputation scores detect degrading review quality before it causes knowledge base corruption

A feedback system that validates knowledge claims needs a meta-feedback system that validates the validators. Without persistent reputation tracking, a reviewer agent that gradually accepts lower-quality claims — due to model drift, prompt degradation, or adversarial manipulation — degrades the know

ai alignmentspeculative

defense in depth for AI agent oversight requires layering independent validation mechanisms because deny overrides semantics ensure any single layer rejection blocks the action regardless of other layers

A single validation mechanism — no matter how sophisticated — has blind spots. Sondera's reference monitor demonstrates the defense-in-depth principle by combining three independent guardrail subsystems: a YARA-X signature engine for deterministic pattern matching (prompt injection, data exfiltratio

ai alignmentexperimental

deterministic policy engines operating below the LLM layer cannot be circumverted by prompt injection making them essential for adversarial grade AI agent control

Two fundamentally different paradigms exist for controlling AI agent behavior, and understanding this distinction is essential for building trustworthy multi-agent systems.

ai alignmentexperimental

knowledge validation requires four independent layers because syntactic schema cross reference and semantic checks each catch failure modes the others miss

For a knowledge base built from markdown files with YAML frontmatter, validation operates at four levels of increasing semantic depth. Each level catches errors that are invisible to the others.

ai alignmentexperimental

multi agent git workflows have reached production maturity as systems deploying 400+ specialized agent instances outperform single agents by 30 percent on engineering benchmarks

The pattern of Agent A proposing via PR and Agent B reviewing has moved from research concept to production system. Three implementations demonstrate different aspects of maturity.

ai alignmentexperimental

structurally separating proposer and reviewer agents across independent accounts with branch protection enforcement implements architectural separation that prompt level rules cannot achieve

The honest feedback loop principle of architectural separation requires that the entity evaluating claims is structurally independent from the entity producing them. In a multi-agent knowledge base, this means the reviewer cannot be the same agent (or the same account, or the same process) as the pr

ai alignmentexperimental

the gap between theoretical AI capability and observed deployment is massive across all occupations because adoption lag not capability limits determines real world impact

Anthropic's labor market impacts study (Massenkoff & McCrory 2026) introduces "observed exposure" — a metric combining theoretical LLM capability with actual Claude usage data. The finding is stark: 97% of observed Claude usage involves theoretically feasible tasks, but observed coverage is a fracti

ai alignmentlikely