AE Studio Bytes
Posts
AI Alignment Weekly #5: Alignment’s biggest R&D breakthroughs (and where research is headed next)

AI Alignment Weekly #5: Alignment’s biggest R&D breakthroughs (and where research is headed next)

A highlight reel of the most exciting experiments and novel ideas in AI alignment today.

March 20, 2025

Welcome to the fifth issue of AI Alignment Weekly!

Last week’s email gave us a bird’s-eye view of the core philosophies driving AI alignment.

Now, we’re going to zoom back in and take a look at the most exciting breakthroughs and research directions in AI alignment in recent years.

(We’ll also link to relevant research papers and summaries in case you want to go in-depth on any of them.)

Let’s kick things off with one of the most practical innovations:

RLHF: Guiding AI With Human Judgment

Imagine training a smart dog — you reward good behavior with treats, guiding it to behave in ways you prefer.

Reinforcement Learning from Human Feedback (RLHF) is basically that, but for AI.

RLHF works by having humans compare different AI responses and indicate which ones they prefer, training the AI toward the kind of behavior we want.

Any time an AI chatbot asks you to choose between “response 1 vs. response 2”... this is RLHF in action!

OpenAI first demonstrated this method in 2017 by teaching a simulated robot to perform a backflip through human feedback.

Researchers showed pairs of video clips to human evaluators, who selected which attempt looked better.

With just 900 bits of human feedback (less than an hour of evaluation time), the AI learned to execute a graceful backflip — and it was far better than what engineers could achieve through two hours of manually writing the reward functions themselves!

This approach was later successfully scaled to large language models.

Today, RLHF and similar methods are a staple in helping align systems like ChatGPT, Claude, and other AI chatbots.

On balance, RLHF has been a huge success story for alignment.

However, researchers have already identified significant limitations in RLHF's effectiveness with existing models, strongly indicating new approaches will be necessary for future, more capable systems.

Mechanistic Interpretability Research: Opening the Black Box

One of the fundamental challenges in AI safety is that neural networks are essentially black boxes.

When they contain billions of parameters, how can we possibly understand what's happening inside?

Enter mechanistic interpretability — the ambitious effort to reverse-engineer neural networks and understand exactly how they work.

There have been a lot of exciting advances on this front in recent years.

Chris Olah and others (first at OpenAI, then at Anthropic) have made strides in identifying “neurons” within AI models, and even circuits (groups of neurons) that correspond to abstract concepts like grammar rules.

Full transparency is still a long ways off, but researchers have successfully managed to understand and steer smaller-scale models.

This work is critical for alignment.

If we can understand what’s happening inside powerful AI models, we might be able to detect deceptive behavior before it manifests, and catch problems before they become dangerous.

Inner Alignment Research: Can AI Have a Hidden Agenda?

And speaking of deceptive behavior...

On the conceptual side of the aisle, researchers have been hard at work studying mesa-optimization.

(Originally coined in this research paper in 2019.)

This is a mind-bending concept that we introduced in issue #3:

The idea is that an AI might develop its own internal optimization processes that differ from what it was trained on — even though it appears to be behaving as expected.

The fear is this might lead to deceptive alignment:

The AI would play along during training, optimizing for the objective that humans want it to achieve... but once deployed, it would start pursuing its actual goals.

This possibility has spawned a whole field of inner alignment research, focused on ensuring that the goals a model learns truly match what we want — not just what the training process incentivizes.

This is all still theoretical.

But, we’ve seen glimpses of this problem in practice.

A 2022 research paper showed an empirical case where the exact scenario above happened — the AI performed well in training, but once it was put into a new environment, it latched onto a new, unintended goal.

Was the AI being an evil mastermind here? Probably not 🤷‍♂️

But, this case was definitely a real-world inner alignment failure, and proof that it’s a problem worth taking seriously.

ELK: Can we develop a “truth serum” for AI?

Even if an AI is misaligned, it might still have correct knowledge buried inside.

For example, an AI model controlling a robot might know a particular action would cause harm, but wouldn’t tell you because that would interfere with its goals.

This scenario leads us to our final area of research, which tackles the question:

How do we get an AI to tell us what it really knows about the world — even if it doesn’t “want” to?

This is the problem of Eliciting Latent Knowledge (ELK).

The Alignment Research Center (ARC), founded by Paul Christiano, is spearheading efforts to solve this. As of 2021, they’ve framed ELK as a core unsolved problem, and even offered a prize for progress.

Nobody has fully solved ELK yet.

But there are some potential angles being worked on already, like training auxiliary models that act as truth-telling “reporters” of the main model’s internal state.

What makes ELK research so exciting is:

Whoever comes up with a reliable solution will have (pretty much) created a “truth serum” for deceptive AI.

Fingers crossed for the future! 🤞

What’s Next

In next week’s newsletter, we’ll dive headfirst into the major debates and most hotly-contested questions in AI alignment.

We’ll explore:

Why some researchers think we have decades to solve alignment, while others think we’re nearly out of time
Whether alignment is an inherently “impossible” problem or one we can chip away at gradually
How likely it is advanced AI will actively deceive humans (and if can we detect it)

In the meantime — have a great rest of your week, and stay tuned! 👋