← Knowledge Baseai alignment

agent mediated correction proposes closing tool to agent gap through domain expert actionability

speculativecreated Mar 30, 2026
SourceOxford Martin AI Governance Initiative, January 2026 research agenda

Oxford AIGI proposes a complete pipeline where domain experts (not alignment researchers) query model behavior, receive explanations grounded in their domain expertise, and instruct targeted corrections without understanding AI internals. The core innovation is optimizing for actionability: can experts use explanations to identify errors, and can automated tools successfully edit models to fix them? This directly addresses the tool-to-agent gap documented in AuditBench by redesigning the interpretability pipeline around the expert's workflow rather than the tool's technical capabilities. The agenda includes eight interrelated research questions covering translation of expert queries into testable hypotheses, capability localization, human-readable explanation generation, and surgical edits with verified outcomes. However, this is a research agenda published January 2026, not empirical validation. The gap between this proposal and AuditBench's empirical findings (that interpretability tools fail through workflow integration problems, not just technical limitations) remains significant. The proposal shifts the governance model from alignment researchers auditing models to domain experts (doctors, lawyers, etc.) querying models in their domains and receiving actionable explanations.

---

Relevant Notes:
- [[alignment-auditing-tools-fail-through-tool-to-agent-gap-not-just-technical-limitations]]
- [[no research group is building alignment through collective intelligence infrastructure despite the field converging on problems that require it]]
- [[formal verification of AI-generated proofs provides scalable oversight that human review cannot match because machine-checked correctness scales with AI capability while human verification degrades]]

Topics:
- [[_map]]