Numbers looking kinda suspicious

OPENAI

Numbers looking kinda suspicious

When OpenAI introduced its o3 model last December, it claimed the model could solve over 25% of questions in FrontierMath, a notoriously tough benchmark for AI math skills.

That was a big leap, considering most models don’t even manage 2%.

But fresh tests from research group Epoch AI tell a different story.

They found the public version of o3 scored closer to 10%.

The difference likely comes down to testing conditions, OpenAI’s original result was based on a much more powerful version of the model than what was actually made publicly available.

OpenAI has since said the o3 model released last week was built for speed and practical use, rather than peak benchmark performance.

What to keep in mind:

OpenAI’s public o3 scored 10% on FrontierMath, much lower than the 25% internal figure.
The public version was designed for speed and real-world use, not benchmark wins.
Benchmark gaps like this are becoming more common as AI firms compete for attention.

Who gave these scores the aux?

A member of their team even pointed out that making the model faster and more cost-effective was a priority, which may explain the drop in scores.

None of this is new in the AI world.

Benchmark numbers often vary depending on how and where models are tested, and lately, more companies are being called out for making bold claims that don’t quite line up with what’s actually released.

Someone call MythBusters, because o3’s 25% isn’t adding up.

Numbers looking kinda suspicious

OPENAI

Numbers looking kinda suspicious

Who gave these scores the aux?

Keep Reading

Mindstream