scaffolded black box prompting outperforms white box interpretability for alignment auditing

experimentalcreated Mar 29, 2026

SourceAnthropic Fellows / Alignment Science Team, AuditBench comparative evaluation of 13 tool configurations

Across AuditBench's 56 models with hidden behaviors, scaffolded black-box tools proved most effective overall. In this approach, an auxiliary AI model generates diverse prompts designed to elicit the target model's hidden behaviors, without requiring access to the target's internal representations. This outperformed white-box interpretability tools that attempted to understand hidden behaviors through mechanistic analysis of model internals. The finding suggests that behavioral probing — systematically exploring a model's input-output behavior — may be more tractable for alignment auditing than mechanistic interpretability. This has implications for research prioritization: if black-box scaffolding is more effective and more robust across different model training regimes, alignment research may get better returns from investing in sophisticated prompting strategies than in interpretability tools. However, the tool-to-agent gap still applies — even the most effective tools fail when investigator agents cannot use them properly.

---

Relevant Notes:
- pre-deployment-AI-evaluations-do-not-predict-real-world-risk-creating-institutional-governance-built-on-unreliable-foundations.md

Topics:
- [[_map]]