TL;DR (Click to expand)
AI benchmark scores have become almost worthless, since rampant test set contamination and industry-wide gaming mean we’re just building machines that memorise answers and optimise for flashy numbers—while real intelligence and practical usefulness are left behind.
Remember when we thought teaching to the test was just a problem in secondary schools?
Turns out the AI industry looked at standardised testing's worst practices and said "hold my mountains of VC cash."
The Great Contamination Cover-Up
Contamination of these services is the industry's worst-kept secret. Between 3.6% and 20.8% of HumanEval coding benchmark solutions are likely present in the training data of major models. When tested on decontaminated variants, GPT-3.5 and GPT-4o drop by 4.75–6.76 percentage points, while Claude models plummet by up to 11.27 points. GSM8K? Models score up to 13% better on the original problems than on similar, freshly-written ones. Now is probably a good time to remind you that LLMs, as they exist today, are effectively lottery tumblers with weighted balls. Even synthetic data generated by these models contains benchmark answers, creating a sort of AI ouroboros where newer models with less and less human-supervised training and more synthetic data are effectively eating their ancestor's vomit.
The contamination runs deep, though. Analysis found that GPT-4's suspicious familiarity with MMLU questions correlates directly with how often those exact questions appear on GitHub.
There’s a massive financial incentive for everyone to cheat, so that’s exactly what happens. With billions in market share at stake over being the hot new thing, why wouldn’t you? It’s laughably easy to swap out a few names or tweak a prompt, and voilà - models that should ace the test suddenly bomb it.
Who wants to be a bench-illionare?
Here's the scorecard for MMLU:
| Model | MMLU Score (Most Recent) |
|---|---|
| GPT-4.1 | 90.2% |
| GPT-4o | 85.7% |
| Gemini Ultra | 90.0% |
| Claude 3.5 Sonnet | 95.2% (ANLS) |
Impressive? Not when you realise these models still can't perform basic arithmetic with unusual number formats or understand that "all roses are flowers" means "some flowers are roses"
Research shows that even the best large language models, despite scoring above 85% on reasoning benchmarks, successfully complete fewer than 30% of expert-level real world tasks. Not 60% success rate. Not even 40%. It's like claiming you're fluent in French because you memorised a phrasebook, then walking into a patisserie only to trot out “Je voudrais un crois-sans, s’il vous plaît.” and not realise you’ve just asked for an atheist pastry.
Everyone's a winner baby, that's no lie
Top models now score close to or just above 90% on MMLU. GSM8K and HellaSwag scores hit the high 80s to low 90s—though despite what the hype machine wants you to believe, 95%+ across all three benchmarks isn't typical. The tests are so thoroughly gamed that meaningful comparison is impossible. Benchmark creators respond by inventing ever more obscure challenges, while developers focus on gaming the new metrics.
Microsoft researchers recently demonstrated that rephrasing benchmark questions typically drops model performance by 10–20% for strong models. While the most dramatic drops can reach 35%, they're rare and context-dependent. Still, that's not a sign of true intelligence or sentience, it's moreso an indicator of a very expensive y = mx + b getting confused by synonyms.
The Benchmark-Industrial Complex
Labs now employ entire teams dedicated to "benchmark optimisation" - a euphemism for gaming the system.
Consider this timeline:
- 2019: HellaSwag released as "impossible for current models"
- 2023: Multiple models achieve >95% accuracy
- 2024: Researchers discover training sets contain paraphrased HellaSwag examples
- 2025: Everyone pretends to be shocked when models come out that can ace it
Stanford's recent analysis found that benchmark performance improvements correlate more strongly with "optimisation effort" than actual capability gains. To put it simply - models are getting better at cheating, not thinking.
Human Evaluation Theatre
See also: The Overfittening
"But wait," cry the labs, "we also use human evaluation!"
Anthropic's Constitutional AI relies on human feedback that's as representative as a hiring panel in Silicon Valley is of diversity. OpenAI's reinforcement learning (from) human feedback (RLHF) optimises for sounding smart to contractors whose working conditions remain conveniently undisclosed in most papers.
The result is models that excel at performative intelligence while failing at tasks any competent intern could handle.
Real solutions exist, but implementing them would require admitting there's been a problem all along:
Dynamic benchmarking: Generate new test questions on the fly. Allen AI's recent work shows this drops top model scores by 4–13 percentage points. Not as dramatic as you'd hope, but enough to expose the emperor's new clothes.
Adversarial evaluation: Let actual users try to break models. Anthropic's red-teaming consistently and repeatably find failure modes their benchmarks completely missed.
Task completion metrics: Measure whether models can actually do useful, meaningful work. This rolls back into generation of dynamic data for benching.
Contamination detection: Tools like Stanford's data provenance tracker and Microsoft's MMLU-CF could expose training data overlap, if anyone actually wanted to use them.
The Uncomfortable Truth
Fact of the matter is that we've created a multibillion-dollar industry optimising for the wrong target. It's Goodhart's Law incarnate.
The latest results suggest that genuine capability improvements have slowed dramatically while benchmark scores continue climbing. At this point we're burning 10's of millions (Conservative guess) effectively building benchmark-passing machines that occasionally LARP as useful tools.
The real takeaway here for me is that benchmarks are made up and the scores don't matter. You should judge LLMs entirely off vibes.
I would like to extend another huge thank you to Richard Nichol for providing editorial advice on this post.