The Context Window Conspiracy

By Ben Ranford

September 24, 2025

Million-token models and the mathematics of make-believe

TL;DR (Click to expand)

Context window announcements are mostly marketing theatre. The mathematics of transformer attention make genuine long‑context reasoning prohibitively expensive, so companies use computational shortcuts that break down under real-world use. We're optimising the wrong metrics instead of admitting we need fundamentally different architectures.


I've been tracking context window announcements like a degenerate gambler watching horse racing odds. OpenAI says 128K. Anthropic counters with 200K. Google raises to a million.

Soon enough, someone will claim infinity tokens and we'll all pretend that means something.

I fed GPT-5 150k-ish tokens directly into context of awful code and asked it to identify duplicated work. It found exactly one. There were 30 instances of duplication (which I know because I wrote them). When I pointed this out to the model, it apologised and hallucinated another 4 that didn't exist. This is what passes for "long‑context reasoning" in the big '25.


The Quadratic Problem

The mathematics tell a story nobody wants to hear.

Transformer attention, the beating heart of every well-known foundation model, scales quadratically with sequence length. For the non-mathematicians: doubling your context doesn't double your compute. It quadruples it.

Scaling from 32K to 2 million tokens? The maths is brutal:

S=(nnewnold)2=(2,000,00032,000)2=3,906S = \left(\frac{n_{\text{new}}}{n_{\text{old}}}\right)^2 = \left(\frac{2{,}000{,}000}{32{,}000}\right)^2 = 3{,}906

Processing a genuinely full 2-million-token context would consume roughly the same energy as running a small office space heater for nearly 24 hours (34.2 kWh)[4]. Per query.

Eattention=Ebase×α×S=0.0056×3,906=21.9 kWhE_{\text{attention}} = E_{\text{base}} \times \alpha \times S = 0.0056 \times 3{,}906 = 21.9\text{ kWh}

Non‑attention layers scale ~linearly with tokens, so they dominate less at small n but add materially at 2M. These components (feed-forward networks, embeddings) contribute another ~12.3 kWh. That's E=34.2 kWhE = 34.2\text{ kWh} total energy consumption.

So they don't actually do it.


Corner-Cutting Chronicles

What happens instead is computational corner-cutting. The models employ sliding windows that examine only local token neighbourhoods, creating a myopic reader who can't see the forest because they're examining individual leaves through a magnifying glass.

Models often emphasize start and end tokens over middle regions due to position‑bias and recency effects[1], creating the illusion of long‑context capability while much of the middle content receives reduced attention.

They use aggressive KV-caching that stores keys and values from past tokens to avoid recomputation, which works until context drift occurs or cache mismatch under topic drift or pointer‑style retrieval - then it confidently returns stale cached results that no longer match the evolving conversation state. They prune "less important" tokens based on algorithms that can't distinguish between crucial context and Lorem Ipsum filler.

Google's own documentation acknowledges the limitations, stating that while Gemini achieves "high performance across various needle-in-a-haystack retrieval evals" with single needles, "in cases where you might have multiple 'needles' or specific pieces of information you are looking for, the model does not perform with the same accuracy"[3]. Research shows effective context length drops significantly across all models[6], with performance degradation being severe as context windows expand[7]. But the marketing department didn't get that memo, apparently, because here we are with promises of 2 million tokens.


The Turd in a Haystack Delusion

The benchmark frequently waved around as proof is the Needle in a Haystack test, and calling it inadequate would be generous[3]. The test hides a single fact in a massive document, then asks the model to retrieve it. That's the entire evaluation. Finding one sentence. It's the computational equivalent of proving you can read by successfully playing Where's Waldo.

When researchers tested these same models on tasks requiring actual reasoning across long‑contexts - synthesising information from multiple sources, tracking entity states through narratives, understanding how different sections of legal documents interact - performance collapsed entirely[5]. Models that achieved near-perfect Needle scores suddenly couldn't determine whether contract clause 47 contradicted clause 3[5].

The Needle test evaluates exactly one thing: can the model perform an expensive Ctrl+F?

Congratulations, you've built the world's most power-hungry ripgrep alternative. What it doesn't test is whether the model can think about what it finds, connect disparate pieces of information, or maintain coherent understanding across that massive context and multiple turns.

You know what else passes the Needle in a Haystack test? A Python script with a regex matcher.


Conversational Catastrophe

But the real comedy emerges in multi‑turn conversations, where these supposedly capable systems reveal their true nature. Research published this year confirms what anyone who's actually used these models already knows: "All the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi‑turn conversations than single-turn"[2]. Not marginally lower. Significantly.

During an extended conversation it typically looks a bit like this:

  • By turn three, the model starts forgetting key details
  • By turn five, it's contradicting itself
  • By turn ten, you might as well be talking to a new model entirely

The attention mechanism treats each conversational turn like geological sediment, with earlier layers getting compressed and distorted under the weight of new tokens. The model doesn't have dementia - it was architecturally designed to forget.

I tested this myself, feeding a foundation model made by a company whose name rhymes with "Schmoodle" a simple narrative about three characters over fifteen exchanges. By the end, it had forgotten one character existed and amalgamated the other two into a single entity.


The Economics of Delusion

The economics driving this farce are beautifully perverse. No company can afford to admit their context windows are essentially decorative. Share prices move on these announcements. Anthropic's valuation jumped the day they announced their 1M window. Google's million-token announcement sparked a fresh round of funding that valued them even higher.

Engineering teams know the truth. They're the ones watching quality metrics tank while their managers prepare slide decks about "breakthrough capabilities."

But admitting this would be corporate suicide.

Imagine being the CEO who stands up and says, "Actually, our context window is 100K tokens of actual utility, everything else is marketing fiction." Your stock would crater. Your competitors would claim superiority with their fake numbers. Your board would have your head on a spike by third lunch.

So the charade continues. Companies announce bigger numbers. Customers pay higher prices. Models get more expensive to run while delivering the same context window capability we had two years ago. It's a pyramid scheme where everyone knows it's a pyramid scheme but nobody can afford to admit it.


The Wrong Problem

Here's what really kills me: we're solving the wrong problem. Current transformer architectures will never efficiently handle genuine long‑context reasoning. The mathematics forbid it. The quadratic scaling is the fundamental constraint that defines these systems, not something to be optimised or hand-waved away.

We don't need bigger context windows with cleverer caching strategies. We need to admit that transformers, as currently conceived, are the wrong tool for this job. At this point we're trying to build a skyscraper with Lego blocks. You can stack them pretty high, use special adhesive, add support structures but eventually physics wins.

The revolution won't come from incremental improvements to attention mechanisms. It'll come from someone brave enough to throw out the transformer playbook entirely and use architectures designed from first principles for long‑context reasoning.

Maybe hierarchical systems that naturally compress information. Maybe something that processes context in semantic chunks rather than token sequences. Maybe something we haven't come up with yet.

But that requires admitting we've been optimising the wrong thing. It requires telling investors that their billions went towards building very expensive toys that can't do what we promised. It requires actual innovation instead of parameter inflation.


References

[1]: "The Context Window Illusion: Why Your 128K Tokens Aren't Working." Dev.to, May 2025.

[2]: Laban, P., Hayashi, H., Zhou, Y., & Neville, J. (2025). "LLMs Get Lost In Multi-Turn Conversation." arXiv:2505.06120.

[3]: "Long context | Gemini API | Google AI for Developers." Google AI, September 2025.

[4]: "Ecotransformer: Attention without Multiplication." arXiv:2507.20096v1, July 2025.

[5]: Liu, S., Li, Z., Ma, R., Zhao, H., & Du, M. (2025). "ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Analysis in Commercial Contracts." arXiv:2508.03080.

[6]: "Effective Context Length and Block Diffusion." The Ground Truth, March 2025.

[7]: Rando, S., et al. (2025). "Evaluating Coding LLMs at 1M Context Windows." arXiv:2505.07897.

All views, opinions etc. here are my own, and do not represent those of any affiliated parties.

© 2025 Ben Ranford