AE Studio Bytes
Posts
AI Alignment Weekly #2: Who’s Saving The Future From Rogue AI?

AI Alignment Weekly #2: Who’s Saving The Future From Rogue AI?

A quick and dirty overview of the major players in AI alignment

February 20, 2025

Hello again!

Last week, we talked about how a (hypothetical) paperclip-obsessed AI might accidentally turn the world into office supplies.

Today, we’re going to meet the brave, big-brained humans working to prevent that scenario.

The following is a quick and dirty overview of the main players in the field of AI alignment, their research, and where each of them fit into the big picture:

The AI Alignment Landscape

In the last couple decades, AI alignment has grown from a tiny niche field into a proper movement with some serious institutional muscle behind it.

So, who better to start with but...

MIRI (The OG Theorists)

Back in 2000, while most tech companies were still figuring out what to do with the internet, Eliezer Yudkowsky founded the Machine Intelligence Research Institute (MIRI). Originally called the Singularity Institute (very sci-fi), they were the first to say, “Hey, maybe we should figure out how to make AI friendly before it gets too powerful.”

Their work is super theoretical. While everyone else is iterating on current AI systems, MIRI is asking deeper questions like “Do we need entirely new mathematical frameworks to align AI going forward?”

The Industry Giants

The major AI companies have jumped into the alignment game, too:

OpenAI (creator of ChatGPT) has a dedicated team working on alignment research. They're the folks who pioneered techniques like “reinforcement learning from human feedback” (RLHF), which means training AI by having humans rate its responses. However, it’s worth noting that in late 2023 and early 2024, several key members of their safety team resigned, citing concerns about the company’s commitment to prioritizing safety over commercial interests. Despite these departures, OpenAI continues to publicize work on ways to help humans evaluate AI systems... using other AI systems.
DeepMind (now part of Google/Alphabet, and creator of Gemini) has an alignment team tackling both immediate and long-term issues. For instance, they study how AI systems might exploit loopholes in their reward functions or pursue goals in unintended ways—like a robot that optimizes for “cleaning up” by shoving everything under the rug.
Anthropic (creator of Claude) is the new kid on the block, founded in 2021 by former OpenAI researchers. Their focus is on building AI systems that are reliable, interpretable, and steerable from the ground up. Techniques like “Constitutional AI” let them embed alignment measures directly into large models, with the goal of scaling these alignment techniques as AI grows more powerful.

Academic and Non-Profit Institutes

UC Berkeley’s Center for Human-Compatible AI (CHAI) takes a two-pronged approach to AI alignment. They combine theoretical research with practical applications, working on things like inverse reinforcement learning (getting AI to infer what humans want by observing our behavior) and assistance games (studying how AI and humans can achieve their goals in a cooperative way).
Oxford's Future of Humanity Institute (FHI) unfortunately shut down in 2024, but it’s still notable and worth mentioning. Their work focused on examining the strategic and ethical implications of advanced AI. Their founder, Nick Bostrom, helped put AI risk on the map with his 2014 book Superintelligence.

We also saw the birth of several new AI alignment organizations and research groups in 2021 and 2022, like:

The Center for AI Safety (CAIS) focuses on reducing catastrophic AI risks through technical research. In 2023, they got hundreds of top AI experts to sign a statement saying AI extinction risk should be a global priority - a huge deal.
Conjecture tries to understand how AI “thinks” by looking at its inner workings. This helps make AI systems more see-through and safer.
Redwood Research solves real-world alignment problems. They focus on stopping language models from saying harmful things and keeping AI systems well-behaved as they get smarter.
The Alignment Research Center (ARC), started by Paul Christiano after he left OpenAI, builds tools to spot when AI might be trying to deceive us and creates ways to measure how well-aligned systems actually are.
A super cool org called AE Studio 😉

And of course, the nonprofit Future of Life Institute is out there advocating for AI risk reduction (they organized a well-known open letter calling to pause giant AI experiments).

The bigger picture

The diversity of approaches here is actually encouraging. We've got:

Deep theorists working on fundamental frameworks
Practical engineers testing alignment techniques
Academic researchers exploring philosophical angles
And advocacy groups pushing for responsible development

Metaphorically, we’ve got multiple teams trying different approaches to prevent a potential asteroid impact... while building the asteroid at the same time.

An ongoing debate in the field is: how much of this is a technical problem (making AI itself safe) versus a social one (managing who builds AI and how)?

Some argue that without global governance, even aligned AI could be dangerous, or unsafe AI might be built by less careful actors. Others believe no policy will hold up against the allure of advanced AI, so we need to make the AI itself safe.

Ultimately, it’s a complex dance of research and policy... and we’ll likely need both to succeed.

What's Next

There are many other federal and international organizations out there working on AI alignment that we didn’t mention.

But for now, this overview covers everything you need to know before we dig into the rest of this series.

In next week’s newsletter, we’ll dive deeper into more of the classic problems in AI alignment and alignment — like how to avoid accidentally creating a power-hungry AI that refuses to be shut off. (Which is just a theory, thankfully... for now.)

See you then!