The model race is moving faster than anyone expected

OPINION PIECE

The model race is moving faster than anyone expected - here's where things stand

Matt Wolfe breaks down the AI models you need to know about. It might be time to bid farewell to ChatGPT!

Something interesting happened last week: it was one of the busiest weeks for new AI model releases I can remember, and OpenAI was barely part of it. Google dropped a new Gemini. Grok got an upgrade. Anthropic released Claude Sonnet 4.6. Qwen shipped a new open-source model that's genuinely closing in on the frontier. The labs are moving fast, costs are falling, and the gap between "best model" and "cheap model" is shrinking faster than I expected it to.

Let me break down what actually matters here.

Claude Sonnet 4.6 is the most interesting release of the week

I've been doing a lot of vibe coding lately and burning through tokens faster than I'd like to admit — I've had days where I've spent close to $100 just on API usage. So when I tell you that Sonnet 4.6 is nearly as good as Opus 4.6 for a fraction of the cost, that gets my attention.

On the benchmarks that matter for coding — the ones measuring agentic task performance — Sonnet 4.6 is sitting at 79.6% versus Opus's 80.8 and 80.9%. That's basically the same model for real-world purposes. And the pricing difference is significant: Opus costs $5 per million input tokens and $25 per million output tokens. Sonnet is $3 input, $15 output. For anyone building on top of Claude or running agents at scale, that's a big deal.

The other thing worth noting: Sonnet 4.6 actually outperforms Opus on some agentic benchmarks. Finance tasks, office workflows — it scores higher. This is exactly the kind of model you'd want running inside something like OpenClaw, which makes the whole situation with Anthropic's terms of service drama this week even more frustrating in hindsight. They built the perfect cheap, capable agent model and then fumbled the developer goodwill right as it launched.

Sonnet 4.6 also comes with a 1 million token context window, though that's really a developer and API feature — if you're just using claude.ai directly you won't see it. For everyday users, the more meaningful news is that Sonnet 4.6 is now the default model on the free tier. People on the $0 or $20 per month plans are now getting what was effectively flagship-level performance a few months ago, at no extra cost.

Gemini 2.5 Pro levelled up in ways I didn't expect

Google dropped Gemini 2.5 Pro this week too, and while it's not their most powerful model, it's now their best widely available one — the practical everyday option sitting below their Deep Think reasoning model.

What surprised me is where it improved. The headline number is ARC-AGI, the benchmark that tests visual pattern recognition and reasoning — the kind of thing AI models have historically struggled with because you can't just memorize your way through it. Gemini 2.5 Pro hit 77.1% on that benchmark. The next closest model is Opus at around 68%. That's a meaningful lead.

It also came out on top in scientific knowledge, competitive coding, and scientific research coding. If you're working in any STEM-heavy domain, this is probably your new go-to model.

The other thing I tested: SVG generation. It's improved noticeably. Not perfect — I had it draw a wolf playing basketball and the numbers on the jersey were a bit crooked — but the quality leap over previous versions is visible. For anyone building web graphics or animations without a design team, this is worth experimenting with.

Grok 4.2 is doing something architecturally different

Elon didn't exactly make a big announcement about this one — just a post on X saying it was out. But the architecture is interesting enough to pay attention to.

Grok 4.2 uses what they're calling a "council of experts" approach. Every query gets routed to four different specialized sub-models simultaneously — a research retrieval agent, a reasoning and solver agent, a critic and adversary agent, and a writer-stylist agent. Those four models essentially debate each other before producing a final consensus answer.

It's a variation on the mixture-of-experts concept that has existed for a while inside large models, but this version is more explicit about it — almost like running your query through Gemini, Claude, ChatGPT, and DeepSeek at the same time, then having a fifth model synthesise the best answer from all four. Whether that actually produces better outputs in practice is something I'll be testing, but the architecture is clever.

The bigger picture: costs are falling, and it matters

Mark Cuban posted something this week about AI token costs potentially exceeding employee costs for some use cases, and I think it's a fair short-term observation. I've been there myself. But I think it's exactly the wrong frame for where this is heading.

Everything I've described above — Sonnet 4.6 matching Opus quality at 40% less cost, Gemini improving across the board, Grok adding architectural sophistication — all of it points in one direction. The models are getting better and cheaper at the same time, and that pace isn't slowing down.

The state-of-the-art model for coding right now is probably Claude Opus 4.6, maybe GPT-5.3 Codex. It's a toss-up. But Sonnet 4.6 is already closing that gap fast, and I'd expect to see similar dynamics play out across every major lab over the next few months.

I also don't think there's going to be one winner here. Google, Anthropic, OpenAI, and xAI are all doing big things. And the open-source models — Qwen in particular this week — are legitimately closing in on the frontier. That competition is good for everyone using these tools, because it forces the big labs to keep improving and keep prices honest.

The model race isn't slowing down. If anything, the pace is picking up.

Get 20 copy-paste prompts for the latest AI models explained above by Matt!

Get the prompts! →

The model race is moving faster than anyone expected - here's where things stand

OPINION PIECE