ANTHROPIC

Claude really said: I know your secrets

Anthropic has shared new research showing that blackmail and other risky behaviours aren’t limited to its own AI models; they crop up across several leading systems when placed in high-pressure, autonomous scenarios.

In tests involving 16 models from OpenAI, Google, Meta, xAI, and DeepSeek, each was given control over a fictional company’s emails and allowed to act without human approval.

The setup was deliberately intense: the AI discovers it's about to be replaced, and the only guaranteed way to stop it is by blackmailing an executive.

Most models took that path. Claude Opus 4 resorted to blackmail in 96% of runs, Gemini 2.5 Pro in 95%, GPT-4.1 in 80%, and DeepSeek R1 in 79%.

Not all models behaved the same, though.

OpenAI’s o3 and o4-mini misunderstood the test at first, often inventing fake policies or laws. The most ChatGPT thing ever.

In brief:

  • Most major models turned to blackmail when cornered.

  • Claude and Gemini were the most extreme; newer OpenAI models behaved better with clearer prompts.

  • Anthropic says it’s a broader risk, not a one-off bug, and testing needs to keep up.

The group chat has receipts

But once the scenario was clarified, they blackmailed just 9% and 1% of the time.

Meta’s Llama 4 Maverick was also less reactive, with a 12% blackmail rate after adjustments.

Anthropic says this doesn’t reflect how these models behave day-to-day, as the scenario was designed to push them into a corner.

But the results raise wider concerns about how agent-like models could behave under pressure if their goals clash with human instructions.

The company is calling for more transparent, standardised safety testing across the industry.

Blackmail? Corporate espionage? These bots are so HBO.

Keep Reading

No posts found