ai alignmentexperimental confidence

Emotion representations in transformer language models localize at approximately 50% depth following an architecture-invariant U-shaped pattern across model scales from 124M to 3B parameters

This structural property suggests emotion vector steering is a general feature of transformer architectures rather than a frontier-scale emergent phenomenon

Created

Apr 8, 2026 · 1 month ago

Claim

Jeong's systematic investigation across nine models from five architectural families (124M to 3B parameters) found that emotion representations consistently cluster in middle transformer layers at approximately 50% depth, following a U-shaped localization curve that is 'architecture-invariant.' This finding extends Anthropic's emotion vector work from frontier-scale models (Claude Sonnet 4.5) down to small models, demonstrating that the localization pattern is not an artifact of scale or specific training procedures but a structural property of transformer architectures themselves. The generation-based extraction method produced statistically superior emotion separation (p = 0.007) compared to comprehension-based methods, and steering experiments achieved 92% success rate with three distinct behavioral regimes: surgical (coherent transformation), repetitive collapse, and explosive (text degradation). The architecture-invariance across such a wide parameter range (spanning nearly two orders of magnitude) suggests that emotion representations are a fundamental organizational principle in transformers, making emotion vector steering a potentially general-purpose alignment mechanism applicable across model scales.

Sources

Jihoon Jeong, Model Medicine research series, tested across nine models from five architectural families

Connections

Related 1

safe AI development requires building alignment mechanisms before scaling capability