The Overfittening

By Ben Ranford

July 02, 2025

Leaderboards, Lies, and the Death of General Purpose AI

TL;DR (Click to expand)

Today's leaderboard-driven AIs excel at passing tests but fail at real-world tasks outside their training data. Leaderboards reward models that sound smart to technical users, not ones that are actually useful, so we end up with overfitted models that can code but can't do basic things like summarise text without skipping key details. The path forward is more dynamic and doesn't treat human preference as the be all end all, through incorporating techniques like real-world evaluation and diverse feedback.

As of writing, Gemini 2.5 Pro (currently) sits at the top of LMArena Leaderboard.

Ask it to convert a markdown table to plain text as I did last week and instead of a meaningful result, you get a Chain-of-Thought step that confides "I now understand the user despises Markdown tables." before spitting back out the same markdown table into the Canvas.

Gemini bashing its two last braincells together.

It's as if it mistook basic text formatting for some kind of Markdown-related therapy session. We're told by the (admittedly way smarter than me) boffins at Google that this is what now qualifies as a best-in-class general purpose LLM.

Through my totally non-scientific testing, when faced with anything outside the syllabus, Gemini will quite happily sit in the corner and devour a whole pack of crayons while you eventually figure out what you asked it by yourself.

How did we get into this mess?

Leaderboards like LMArena were originally designed to measure model progress. Now they mostly measure how well they can appease us.

Meta reportedly ran 27 versions of Llama 4 through private trials before picking their "winner" for public display. Google and OpenAI, meanwhile, have managed to get their models into up to 20% more Arena battles than the competition, inflating scores by over 100%.

If you're wondering why this is starting to sound less like a fair and equitable measure and more like Eurovision, that's because it's exactly that. Norway will also somehow find a way to lose here too.

LMArena's "democratic" voting system is a masterclass in self-selection bias. The most active users are overwhelmingly technical and overwhelmingly obsessed with programming minutiae.

The result? A model that can ace a coding problem but can't reliably summarise a news article or, heaven forbid, reformatting text without altering it. Since users prefer verbose, confident answers regardless of accuracy, the leaderboard rewards models that sound smart, not ones that are smart.

Back to Gemini 2.5 Pro and its markdown table meltdown. This kind of wacky behaviour isn't a one-off. Ask any of these leaderboard toppers to do something outside their explicit training data and you'll get:

Confusion
Hallucination
Or, if you're lucky like me, a totally dumbfounding "thinking" step

We've bred a generation of models that are world-class at passing the test and hopeless at everything else.

The Way Out

If we want AI that's actually useful for general purpose tasks, we need to stop treating human preference as the finish line.

Dynamic evaluation (preferably with real-world tasks and problems) and more diversity of feedback might help... Or we could keep handing out medals for sycophantic models and delay the singularity indefinitely.

The Big But

There's cause for optimism. LRMs (Large Reasoning Models) excel in verifiable domains—tasks where answers can be checked objectively like mathematics or structured coding problems.

Much of the latest research is creatively expanding which domains can be made verifiable, using techniques like RL (Reinforcement Learning) to bring more real-world tasks into the "checkable" category. This is already producing models that can reason more reliably across a broader range of subjects, not just math or code.

Vendors know the limitations already and are playing Jenga (other tabletop games are available) in the dark to moderate success. Approaches like:

Peer learning, where models verify each other's reasoning
Reinforcement learning, using diverse, real-world feedback
Expanded verification making more domains objectively checkable

Are showing promise in making models less of a one-horse town and more broadly useful. Here's hoping the next generation of models are better teachers and not just stochastic parrots.

Until then, enjoy the Overfittening. If you need help with markdown, try the top of a mountain in Mongolia. At least the view is guaranteed to be verifiable.

I would like to acknowledge & thank Richard Nichol for his valued editorial advice & riffing on this post.

TL;DR (Click to expand)

How did we get into this mess?#

The Blind Leading the Blind#

The Way Out#

The Big But#

How did we get into this mess?

The Blind Leading the Blind

The Way Out

The Big But