The Overfittening 2

By Ben Ranford

July 14, 2025

Electric Benchaloo

TL;DR (Click to expand)

AI benchmark scores have become almost worthless, since rampant test set contamination and industry-wide gaming mean we’re just building machines that memorise answers and optimise for flashy numbers—while real intelligence and practical usefulness are left behind.

Remember when we thought teaching to the test was just a problem in secondary schools?

Turns out the AI industry looked at standardised testing's worst practices and said "hold my mountains of VC cash."

The Great Contamination Cover-Up

Contamination of these services is the industry's worst-kept secret. Between 3.6% and 20.8% of HumanEval coding benchmark solutions are likely present in the training data of major models. When tested on decontaminated variants, GPT-3.5 and GPT-4o drop by 4.75–6.76 percentage points, while Claude models plummet by up to 11.27 points. GSM8K? Models score up to 13% better on the original problems than on similar, freshly-written ones. Now is probably a good time to remind you that LLMs, as they exist today, are effectively lottery tumblers with weighted balls. Even synthetic data generated by these models contains benchmark answers, creating a sort of AI ouroboros where newer models with less and less human-supervised training and more synthetic data are effectively eating their ancestor's vomit.

The contamination runs deep, though. Analysis found that GPT-4's suspicious familiarity with MMLU questions correlates directly with how often those exact questions appear on GitHub.

There’s a massive financial incentive for everyone to cheat, so that’s exactly what happens. With billions in market share at stake over being the hot new thing, why wouldn’t you? It’s laughably easy to swap out a few names or tweak a prompt, and voilà - models that should ace the test suddenly bomb it.

Who wants to be a bench-illionare?

Here's the scorecard for MMLU:

MMLU Benchmark Scores for Various Models
Model	MMLU Score (Most Recent)
GPT-4.1	90.2%
GPT-4o	85.7%
Gemini Ultra	90.0%
Claude 3.5 Sonnet	95.2% (ANLS)

Impressive? Not when you realise these models still can't perform basic arithmetic with unusual number formats or understand that "all roses are flowers" means "some flowers are roses"

Research shows that even the best large language models, despite scoring above 85% on reasoning benchmarks, successfully complete fewer than 30% of expert-level real world tasks. Not 60% success rate. Not even 40%. It's like claiming you're fluent in French because you memorised a phrasebook, then walking into a patisserie only to trot out “Je voudrais un crois-sans, s’il vous plaît.” and not realise you’ve just asked for an atheist pastry.

Everyone's a winner baby, that's no lie

Top models now score close to or just above 90% on MMLU. GSM8K and HellaSwag scores hit the high 80s to low 90s—though despite what the hype machine wants you to believe, 95%+ across all three benchmarks isn't typical. The tests are so thoroughly gamed that meaningful comparison is impossible. Benchmark creators respond by inventing ever more obscure challenges, while developers focus on gaming the new metrics. Microsoft researchers recently demonstrated that rephrasing benchmark questions typically drops model performance by 10–20% for strong models. While the most dramatic drops can reach 35%, they're rare and context-dependent. Still, that's not a sign of true intelligence or sentience, it's moreso an indicator of a very expensive y = mx + b getting confused by synonyms.

The Benchmark-Industrial Complex

Labs now employ entire teams dedicated to "benchmark optimisation" - a euphemism for gaming the system.

Consider this timeline:

2019: HellaSwag released as "impossible for current models"
2023: Multiple models achieve >95% accuracy
2024: Researchers discover training sets contain paraphrased HellaSwag examples
2025: Everyone pretends to be shocked when models come out that can ace it

Stanford's recent analysis found that benchmark performance improvements correlate more strongly with "optimisation effort" than actual capability gains. To put it simply - models are getting better at cheating, not thinking.

Human Evaluation Theatre

The Uncomfortable Truth

Fact of the matter is that we've created a multibillion-dollar industry optimising for the wrong target. It's Goodhart's Law incarnate.

The latest results suggest that genuine capability improvements have slowed dramatically while benchmark scores continue climbing. At this point we're burning 10's of millions (Conservative guess) effectively building benchmark-passing machines that occasionally LARP as useful tools.

The real takeaway here for me is that benchmarks are made up and the scores don't matter. You should judge LLMs entirely off vibes.

I would like to extend another huge thank you to Richard Nichol for providing editorial advice on this post.

TL;DR (Click to expand)

The Great Contamination Cover-Up#

Who wants to be a bench-illionare?#

Everyone's a winner baby, that's no lie#

The Benchmark-Industrial Complex#

Human Evaluation Theatre#

See also: The Overfittening#

The Uncomfortable Truth#