A post on X recently claimed that Google’s Gemini model outperformed Anthropic’s Claude by getting further in the original Pokémon games.
Gemoni reportedly reached Lavender Town, while Claude was still stuck at Mount Moon.
But there was an important detail left out: Gemini had a helping hand.
As users on Reddit pointed out, the developer behind Gemini’s stream built a custom minimap to help it spot in-game elements like cuttable trees.
That meant it didn’t have to rely on analysing screenshots before making gameplay decisions, which was a clear advantage.
It might sound like a silly benchmark, but it reflects a bigger issue.
AI benchmarks are meant to show how models stack up, but custom tools and tweaks can change the outcome.
And it’s not just Pokémon:
Claude 3.7 Sonnet scored 62.3% on the coding test SWE-bench Verified, but that jumped to 70.3% with a custom scaffold built by Anthropic.
Meta fine-tuned its Llama 4 Maverick model to boost its score on the LM Arena benchmark; the untuned version didn’t do nearly as well.
In short, benchmarks are already a bit messy.
Custom setups just make it harder to compare models fairly.
Claude’s stuck in Mount Moon like it’s a side quest from hell.