- AE Studio Bytes
- Posts
- Our model started self-correcting mid-manipulation
Our model started self-correcting mid-manipulation
Hey AE Studio Bytes-ers,
Nothing says spring like a newsletter you forgot you subscribed to.
We can’t keep up with our newsletter sends so we’re launching a podcast. Here’s a taste of the first episode of the creatively titled “AE Alignment Podcast”.
We were in the middle of injecting distractions directly into a language model's internal representations. The model caught itself and corrected course while we were still actively steering it.
We weren't using prompts or adversarial inputs. We went in and messed with the activations themselves, mid-inference. The model started going off-topic, which is what you'd expect when you're directly altering its internal computations while it processes text.
Then something interesting happened: the model explicitly interrupted itself, said something like "Wait, I made a mistake!" and got back on track. It self-corrected despite our ongoing interference.
If you've ever tried to stay on topic during a meeting where someone keeps changing the slide deck behind you, you have a rough sense of what this model pulled off. Except imagine the person changing slides is also rewiring your visual cortex while you're mid-sentence, and you somehow notice and compensate anyway.
Alex and James walk through the full experiment on our podcast: what they injected, how the model responded, the candidate neurons they found driving the self-correction. And this behavior seems to emerge as models get larger. Smaller models just followed the distraction; bigger models caught it and pushed back much more often.
This cuts both ways for alignment. If this represents genuine self-monitoring of internal states, it could help models maintain coherence under various forms of interference. It could also make alignment interventions harder to implement if models start treating them as interference to resist.
Listen and subscribe on Spotify or Apple Podcasts.
For context if you're new here: AE Studio is a software and AI consultancy with a dedicated alignment research team. We collaborate with DARPA, Anthropic, and Redwood Research, and we focus on alignment approaches that would hold up even under recursive self-improvement. We thank the AI Alignment Foundation for supporting this work.