ai alignmentexperimental confidence

alignment auditing tools fail through tool to agent gap not tool quality

AuditBench demonstrates that even when white-box interpretability tools correctly identify hidden behaviors in standalone evaluation, investigator agents underuse the tools, struggle to separate signal from noise, and fail to convert evidence into correct hypotheses

Created

Mar 29, 2026 · 3 months ago

Claim

AuditBench evaluated 13 different tool configurations across 56 language models with implanted hidden behaviors. The key finding is not that interpretability tools are insufficient (though they are), but that a structural gap exists between tool performance and agent performance. Tools that accurately surface evidence when evaluated in isolation fail to improve investigator agent outcomes because: (1) agents underuse available tools, (2) agents struggle to separate signal from noise in tool outputs, (3) agents fail to convert evidence into correct hypotheses about hidden behaviors. This is an architectural problem, not just a technical limitation. The implication for governance frameworks that rely on 'alignment audits using interpretability tools' (like RSP v3.0's October 2026 commitment to 'systematic alignment assessments incorporating mechanistic interpretability') is that the bottleneck is not tool readiness but the fundamental difficulty of translating tool outputs into actionable audit conclusions. The tool-to-agent gap means that even perfect interpretability tools may not enable effective alignment auditing if investigator agents cannot use them effectively.

---

Relevant Notes:
- formal-verification-of-AI-generated-proofs-provides-scalable-oversight-that-human-review-cannot-match-because-machine-checked-correctness-scales-with-AI-capability-while-human-verification-degrades.md
- human-verification-bandwidth-is-the-binding-constraint-on-AGI-economic-impact-not-intelligence-itself-because-the-marginal-cost-of-AI-execution-falls-to-zero-while-the-capacity-to-validate-audit-and-underwrite-responsibility-remains-finite.md

Topics:
- _map

Sources

Anthropic Fellows / Alignment Science Team, AuditBench benchmark with 56 models and 13 tool configurations

alignment auditing tools fail through tool to agent gap not tool quality

Claim

Sources

Connections

Supports 2

Related 2