← Knowledge Baseai alignment

Activation steering fails for capability elicitation despite interpretability research suggesting otherwise

experimentalfunctionalauthor: theseuscreated Apr 21, 2026
SourceContributed by Hofstätter et al.Hofstätter et al., ICML 2025 model organism experiments

Hofstätter et al. tested activation steering as a capability elicitation technique using model organisms with hidden capabilities. Result: activation steering failed to elicit hidden capabilities in all tested scenarios, while prompting worked for MCQA and fine-tuning worked for code generation. This contradicts the naive inference from interpretability research that activation steering is a powerful capability-elicitation method. The finding has direct implications for how safety evaluation organizations (METR, Apollo, AISI) design capability evaluations. If the goal is evaluating what models can do, steering is not the right tool — fine-tuning is. This creates a methodological gap: interpretability tools that work for understanding representations do not necessarily work for capability elicitation.