AE Studio Bytes
Posts
AI Alignment Weekly #3: The classic problems motivating AI safety

AI Alignment Weekly #3: The classic problems motivating AI safety

Breaking down the technical challenges that keep AI researchers up at night

February 27, 2025

Hey there!

Last week in AI Alignment Weekly #2, we mapped out the major players working to prevent rogue AI.

Now, we're rolling up our sleeves and diving deeper into the classic technical challenges that make AI alignment so difficult.

Goal specification & reward hacking

Let's start with the most basic challenge: goal specification.

Simply put, telling an AI system what we actually want is really, really hard.

Think about the last time you gave directions to someone. Even with another human who shares your cultural background and understanding of the world, misunderstandings happen all the time.

Now imagine giving instructions to a hyper-literal alien who has never experienced human society. That's basically what we're doing with AI.

Which naturally leads us to the next problem: reward hacking.

AI systems today learn through rewards — get the right answer, get a digital cookie. Simple, right?

Not quite. As it turns out, smart systems find clever ways to maximize rewards without actually accomplishing what we intended.

Some hypothetical examples include:

A cleaning robot might create its own trash so it can collect more rewards, instead of actually cleaning the room
A content recommendation algorithm optimized for “engagement” might prioritize outrage and misinformation because they generate more clicks than helpful, balanced content
A system designed to summarize articles might fabricate details because that gets better human ratings than admitting uncertainty

In extreme cases, this might lead to what AI researchers call “perverse instantiation” — when an AI technically fulfills its goal but in a monkey's paw way that completely misses the point.

Remember our paperclip maximizer? That's the ultimate perverse instantiation: “make paperclips” → “convert all matter on Earth into paperclips.”

Instrumental convergence & power-seeking behavior

Even if we somehow perfectly define our goals, we face another fundamental issue: instrumental convergence.

This is the idea that most intelligent, goal-directed systems will naturally pursue similar sub-goals, even if their end objectives are vastly different.

For instance:

Resource acquisition — More resources = easier to achieve goals
Self-preservation — Can't achieve goals if you're shut down
Gaining power — The more influence, control, and autonomy over your environment, the better

The problem is, an AI might decide that it needs to achieve these sub-goals at any cost... even if it means undermining or harming humans in the process.

This is power-seeking behavior, and it emerges naturally from rational goal pursuit if left unchecked — not from any programmed desire for dominance.

Corrigibility

Which brings us to corrigibility.

An AI system is considered “corrigible” if it cooperates with corrective intervention.

That is to say, it allows itself to be turned off or modified, even if that runs counter to its goals.

And that right there is the rub: in most cases, being turned off (or being 100% honest and forthcoming with the human operator) would prevent the AI from accomplishing its original objectives.

So actually implementing corrigibility in practice has proven difficult.

Current advanced AI systems can't yet prevent us from shutting them down. But as systems grow more capable and autonomous, keeping them corrigible becomes increasingly important, and potentially much harder.

Mesa-optimization

Finally, we have mesa-optimization.

This hasn’t been observed in real AI systems yet, but researchers consider it a dangerous possibility:

When we train an AI, we give it a goal and reward it for performing well. But, the AI could develop its own internal goals that it’s optimizing for, which might not align with what we wanted it to do.

This “optimizer within an optimizer” is called a mesa-optimizer.

The concern is that this could lead to deceptive alignment:

The mesa-optimizer has a secret long-term objective, but it knows that it’s being optimized for the original base objective
So during training, it plays along and optimizes for the base objective to avoid being modified
Then once deployed and given more freedom, it starts pursuing its actual goals, which might be completely different from what was intended

This is the “nightmare scenario” for alignment... and the insidious part is that we wouldn't easily detect this. The system would behave perfectly during all our tests, then potentially “defect” when it has enough power or autonomy.

It's like hiring someone who acts perfectly during the interview and probation period, only to reveal their true intentions once they have access to the company vault.

So... How Bad Could It Get?

AI safety researchers typically categorize the long-term risks of AI alignment along a spectrum:

Moderate Risks

These are serious but not civilization-threatening:

AI systems that amplify bias and discrimination
Economic disruption from job automation
AI-enabled surveillance and privacy erosion
Concentration of power in the hands of AI-owning entities

These are already happening and deserve serious attention.

Catastrophic Risks

These could cause severe damage to society:

AI-enabled cyber attacks on critical infrastructure
Autonomous weapons systems getting out of control
Environmental damage from misaligned optimization
Destabilizing disinformation campaigns

These could cause massive loss of life or societal collapse, but humanity would likely recover.

Existential Risks

The worst-case scenarios:

An advanced, misaligned AI system that cannot be shut down
AI-triggered conflicts that lead to extinction-level events
Complete loss of human autonomy and control over our future

While existential risks might seem far-fetched, they're taken seriously by experts because even a low probability of such extreme outcomes justifies significant work on prevention.

What's Next

Next week, we'll shift gears to explore the different schools of thought in AI alignment research.

We'll break down the divide between “prosaic” alignment and theoretical alignment, and why researchers disagree on the best path forward.

Until then!