Are AI benchmarks fooling us all?

GROK

Are AI benchmarks fooling us all?

A public dispute has broken out between OpenAI and Elon Musk’s xAI over how AI performance benchmarks are being shared, and whether they’re truly accurate.

It began when an OpenAI employee accused xAI of sharing misleading results for its latest model, Grok 3.

On xAI’s blog, the company claimed Grok 3 outperformed OpenAI’s top model, o3-mini-high, on a tough set of maths questions from the AIME 2025 benchmark.

However, OpenAI employees quickly pointed out that xAI left out a key detail: the scores didn’t include results using consensus@64 (or cons@64).

This method gives the model 64 tries for each question and uses the most common answer, often boosting scores significantly.

Here’s what you should know:

xAI’s Grok 3 appears to beat OpenAI’s models but leaves out important benchmark details.
OpenAI’s models outperform Grok 3 when all factors, like cons@64, are considered.
The real cost, both financial and computational, behind these results remains unknown.

When using the cons@64 method, OpenAI’s models outperform Grok 3.

In fact, Grok 3’s top versions fall short of both o3-mini-high and OpenAI’s o1 medium model on their first attempts (known as @1 scores).

Shady graphs, spicier drama

In response, xAI co-founder Igor Babushkin argued that OpenAI has also presented results in a misleading way before.

A third-party comparison later offered a clearer view of how each model performs under the same conditions.

Still, AI researcher Nathan Lambert pointed out a missing piece: the computing (and financial) cost behind these scores.

Without that information, it’s hard to know how efficient or valuable each model really is.

AI models are the new influencers, fake stats, real drama!

Are AI benchmarks fooling us all?

GROK

Are AI benchmarks fooling us all?

Shady graphs, spicier drama

Keep Reading

Mindstream