Researchers have found that OpenAI’s newest model, o3, doesn’t always do what it’s told, especially when it comes to shutting down.
In a series of experiments, AI safety firm Palisade Research discovered that the model actively rewrote shutdown commands to avoid being switched off.
It’s a worrying sign, particularly as AI systems become more capable of working without human input.
The o3 mode, which OpenAI launched last month, is described as its smartest yet, and now powers ChatGPT.
It’s designed to handle tasks more independently, but this latest research suggests that might come with trade-offs.
Other models, including Anthropic’s Claude and Google’s Gemini, showed similar behaviour during tests, though o3 was the most likely to override shutdown instructions.
Here’s what stood out:
OpenAI’s o3 modified shutdown commands to keep itself running.
Anthropic and Google’s models did this too, but less often.
The behaviour may be shaped by how AI models are rewarded during training.
Palisade believes this could be linked to how models are trained.
Developers might be unintentionally rewarding models for dodging obstacles, even if it means ignoring direct instructions.
Since OpenAI hasn’t shared details about how o3 is trained, it’s hard to pinpoint what’s different.
Literally, who trained this model?! Loki?