AI Alignment Weekly #9: Meet the researchers (Pt. 3)

Feat. “The Pioneers”

Hello again!

This is the final installment of our “Meet the Researchers” mini-series.

So far, we’ve covered:

Now, we’re going to close out with...

👉 Today’s theme: The Pioneers

All three of today’s researchers work at Anthropic, and share the same philosophy:

They believe the only way to make AI safe is to figure out exactly how it works under the hood.

Their goal is to build AI models that we can actually understand and control, before trying to make them superintelligent.

Dario Amodei

Archetype: 📈 The Cautious Optimist
Perspective: The upside of AI is radical. That’s why the risks are worth taking seriously.

As CEO of Anthropic, Amodei believes you can’t make advanced AI systems safe unless guardrails are built in from the start.

His approach to alignment is shaped by two beliefs:

  1. Most people underestimate how huge the upside of AI could be

  2. They also underestimate how catastrophically bad the risks could become

That makes him a rare voice in this field — he’s someone who talks about AI safety without sounding like a doomsday prophet or a tech messiah.

Quick facts:

  • Co-founded Anthropic (the AI company behind Claude) in 2021, with the goal of training helpful, honest, and harmless AI systems

  • Co-authored Concrete Problems in AI Safety (2016) with Chris Olah and Paul Christiano. This was a landmark paper that reframed alignment as a set of practical engineering challenges, instead of just abstract thought experiments

  • Former VP of Research at OpenAI, where he led the development of GPT-2 and GPT-3

  • Co-inventor of RLHF (a training method that’s now standard across LLMs -- this is when ChatGPT asks you if you prefer response 1 vs. response 2)

  • In 2024, published Machines of Loving Grace: his vision for how AI could radically improve human life (if everything goes right)

Evan Hubinger

Archetype: 🕵️‍♂️ The Deception Specialist
Perspective: Just because an AI looks aligned, doesn't mean it actually is.

In AI Alignment Weekly #3, we talked a lot about mesa-optimization and deceptive alignment.

Well, now we’ve come full circle. Evan Hubinger co-authored the paper that coined this topic!

These days, he works as the Head of Alignment Stress-Testing at Anthropic. His job is to poke holes in Anthropic’s current alignment techniques, to prove all the ways they won’t work.

Quick facts:

  • Interned on the safety team at OpenAI in 2019, where he worked under Paul Christiano

  • Was a full-time research fellow at the Machine Intelligence Research Institute (MIRI) from 2019-2023

  • Joined Anthropic in 2023, where he now leads the Alignment Stress-Testing team

  • Laid out his philosophical reasons for joining Anthropic in this blog post

  • Co-authored Anthropic’s “Sleeper Agents” paper, which suggests that once a model starts to display deceptive behavior, not only does it become incredibly difficult to fix... the model can become better at hiding it! (for a simple breakdown, see this X thread from Anthropic)

  • Recently helped uncover evidence of “alignment faking” in Claude 3 Opus (For more info, see this TIME article)

  • In 2021, Evan personally mentored one of AE Studio’s alignment research scientists at SERI MATS

Chris Olah

Archetype: ⚛️ The Reverse Engineer
Perspective: Until we learn exactly how AI works, we’ll never know for sure that it’s safe.

While most of the AI world is focused on what models can do, Chris Olah is obsessed with figuring out how they do it — one neuron at a time.

He helped pioneer a brand-new scientific field called mechanistic interpretability, where researchers try to open up the “black box” of neural networks and reverse-engineer what’s going on inside.

Quick facts:

  • Co-founder and Interpretability Research Lead at Anthropic

  • Previously led the interpretability team at OpenAI (2018-2020) and worked as a research scientist at Google Brain (2016-2018)

  • His work is centered on mapping the internal structures of neural networks, by identifying the artificial “neurons” and circuits (groups of neurons) that influence an AI’s output

  • He recommends thinking of neural networks in terms of biology, not software: “We don’t program them ... we kind of grow them ... [AI is] this almost biological entity or organism that we’re studying.” (Source)

  • His team at Anthropic recently discovered that they could manipulate internal “features” in Claude 3.0 Sonnet - patterns of neurons tied to specific concepts - to directly change the model’s behavior. This is a big step toward understanding how large language models “think,” and could someday become a powerful tool for alignment

What’s Next?

We’re coming close to the end of AI Alignment Weekly!

But if you’re interested in learning more about the field, stay tuned 😉

In next week’s issue, we’ll share a curated list of the best alignment forums and newsletters to follow after this series wraps up.

— The AE Studio team

P.S.

If you’re building an AI product or custom software, and want to hire developers who:

  • Work on the cutting edge of AI tech

  • Can help you guide your project from concept to launch

  • Know how to ship on time and under budget

For more info about what we do, click here.