#2312: When Bigger Context Windows Aren't Better

Exploring the real-world impact of massive context windows in AI models, from academic research to codebase analysis.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2470
Published: Apr 19
Updated: May 15
Duration: 46:56
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: context-window ai-models ai-workflows

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The rapid advancement of AI context windows is reshaping how we interact with large language models. With models like Gemini 3 Pro offering ten million token capacities—equivalent to roughly 7.5 million words—the possibilities for practical applications are expanding dramatically. From academic research to software development, these massive context windows enable tasks that were once unimaginable, such as loading entire theses or codebases into a single prompt.

However, the relationship between context size and performance is not straightforward. While larger windows theoretically allow for more comprehensive analysis, the computational cost of attention mechanisms scales quadratically with sequence length. This means doubling the context window quadruples the computational load, making efficiency a critical concern. Techniques like sparse attention patterns, sliding window attention, and hybrid architectures have emerged to mitigate these challenges, enabling models to process vast amounts of data without overwhelming hardware resources.

Another key consideration is the "lost in the middle" effect, where performance degrades for information located in the middle of a long context. Models tend to perform better on information at the beginning or end of a prompt, raising questions about reliability for tasks requiring synthesis across distant sections. For example, while a thesis might fit within a large context window, the model may struggle to equally weigh and retrieve critical details buried in the middle.

Practical applications of large context windows include document-grounded question answering, where models analyze entire papers or codebases to provide detailed answers. This approach preserves the connective tissue between sections, offering advantages over chunked retrieval methods. However, developers must carefully structure inputs to maximize performance, placing critical information early or late in the context to minimize degradation.

As context windows grow, workflows are evolving from scarcity-driven designs to abundance-driven ones. Prompt engineering, once focused on fitting essential information into limited contexts, is shifting toward managing signal-to-noise ratios in vast inputs. This transition opens new possibilities for enterprise applications, such as analyzing policy manuals or legal contracts, while posing fresh challenges for practitioners navigating this uncharted territory.

Mentions

Azure AI Foundry Microsoft's AI development platform
Claude Opus 4.7 1M token context window model
Gemini 3 Pro 10M token context window model
Gemma 4 medium Open source 256k token model
Gemma 4 small Edge-deployed 128k token model
GLM 5.1 Open source 256k token model
GPQA Diamond Reasoning benchmark for models
GPT-5 400k token context window model
Mistral NeMo Edge-deployed 128k token model

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2312: When Bigger Context Windows Aren't Better

Daniel sent us this one, and it's a question I've been sitting with for a while. He's asking about context windows in state-of-the-art models today, what sizes we're actually working with, how they're being used in practice, whether it's conversations or something like loading an entire academic thesis into a single prompt. And then the part that I think is genuinely underexplored: max output tokens. Will that metric ever catch up? Does it even matter as much as the input side? A lot to get into.

I'm Herman Poppleberry, and yes, there is a lot to get into. The thesis example Daniel throws in there is not hypothetical, by the way. That's a real use case people are running right now, and the fact that it's even possible is kind of remarkable when you think about where we were even three years ago.

By the way, today's episode is powered by Claude Sonnet four point six. Our friendly AI down the road doing the heavy lifting on the script.

Doing fine work, I have to say.

High praise from a retired pediatrician turned DJ. Okay, so let's actually set the stage here, because I think the numbers alone are worth pausing on before we get into the mechanics. Where does the landscape actually sit right now?

The headline number at the moment is Gemini three Pro, which is sitting at a ten million token context window. And I want to make sure that lands properly, because I think people hear "tokens" and their eyes glaze over. A token is roughly three quarters of a word in English, so ten million tokens is something in the neighborhood of seven and a half million words. The entire Lord of the Rings trilogy is about five hundred thousand words. You could fit fifteen of those in a single Gemini three Pro prompt and still have room left over.

Which raises the obvious question of who is actually doing that, but I take your point.

Right, the use cases are real, they're just not "load fifteen fantasy novels." But the scale matters because it sets the ceiling on what's even theoretically possible. And then you have Claude Opus four point seven at one million tokens, which is still extraordinary. GPT-5 and the variants in that series are sitting around four hundred thousand tokens. And then the open source tier, things like GLM five point one and Gemma four medium, are coming in around two hundred and fifty-six thousand tokens, with some of those extending to a million or beyond through hybrid attention mechanisms.

Even the "smaller" open source models are running contexts that would have been considered science fiction not long ago.

A hundred and twenty-eight thousand tokens is now the floor for edge deployment models. Gemma four small, Mistral NeMo, that tier. The models you run on a phone or a local machine. That's the floor.

This is where I want to push a little, because there's a version of this conversation where we just recite numbers and everyone feels impressed and goes home. But the more interesting question to me is: what does the size of a context window actually change about how you can use a model? Because I think there's a common assumption that bigger is just better, and I don't think that's quite right.

It's definitely not right, and this is one of the places where the coverage tends to oversimplify. The framing is usually "model X now has Y million tokens, therefore it can do more things." And that's true in a narrow sense. But the relationship between context length and performance is not linear, and in some cases it actively works against you.

Say more about that.

The core mechanism here is attention. Transformers, which is the architecture underlying basically every major model we're talking about, process context through something called self-attention, where every token in the sequence attends to every other token. The computational cost of that operation scales quadratically with sequence length. Double the context, quadruple the compute. So going from one hundred and twenty-eight thousand tokens to a million tokens isn't eight times harder, it's more like sixty-four times harder in raw attention terms.

Which is why nobody was doing this three years ago.

Which is exactly why. The hardware wasn't there, the architectural tricks weren't mature enough. What's changed is a combination of things: sparse attention patterns, sliding window attention, hybrid architectures that mix full attention with more efficient local attention, and just raw hardware improvements. Gemma four's ability to hit a million tokens on some configurations is partly because it's using a hybrid approach where not every layer is doing full global attention.

The models are cheating a little, in a useful way.

I'd say they're being smart about where they spend their compute. Full attention everywhere is expensive and also, it turns out, not always necessary. The information that matters for answering a given question is usually locally structured. You need to understand the paragraph you're in, you need to grab relevant sections from elsewhere in the document, but you don't need every token constantly aware of every other token.

This connects to something I've been thinking about, which is that the size of the window and the quality of retrieval within that window are two different things. You can have a ten million token window and still have the model systematically underweight information that's buried in the middle of a long document.

The "lost in the middle" effect. There's actually been quite a bit of work on this. The pattern is that models tend to perform well on information at the beginning of a prompt and at the end, and performance degrades for material in the middle. It's not uniform retrieval across the whole context. So if you load a two-hundred-page document and the critical fact is on page one hundred and twelve, you're not guaranteed the model will surface it correctly even if the document technically fits in the window.

Which makes the thesis use case more complicated than it sounds. Yes, you can fit a thesis into a Gemini three Pro prompt. Whether the model reliably reasons over every section of it with equal quality is a different question.

That's where I think the honest answer is: it depends on what you're asking. If you're asking for a high-level summary, large context handles that well. If you're asking a specific question that requires synthesizing a detail from chapter four with a claim in chapter seven, you're more exposed to context rot. There's actually a paper from earlier this year, a decomposition perspective on long-context reasoning, that breaks down where exactly the performance degrades and why. The short version is that models are much better at retrieval tasks, finding a specific thing you told them, than at multi-hop reasoning tasks that require chaining together information from multiple distant parts of a long context.

The capability is real, but it's not magic. You're not just feeding the model a thousand pages and getting back perfect synthesis.

And I think the practical implication for developers and for anyone designing workflows around these models is that you should think carefully about how you structure what goes into the context. Putting the most important information early or late rather than in the middle. Breaking up long documents into chunks and asking targeted questions rather than trying to do everything in one pass. The window size tells you the ceiling, not the floor of what you'll actually get.

Let's talk about the thesis case more concretely, because I think it's a good anchor for the real-world applications side of this. What does it actually look like to use a large context model for something like academic research?

The most common pattern I've seen described is what you might call document-grounded question answering. You load the full text of a paper, or multiple papers, into the context, and then you ask the model questions that require it to draw on that material. "What are the main limitations the authors acknowledge?" "How does the methodology in section three address the critique raised in section two?" That kind of thing. The advantage over retrieval-augmented generation, where you chunk the document and retrieve relevant pieces, is that you don't lose the connective tissue. The model can see the whole argument at once.

There's something appealing about that. The problem with chunked retrieval is always that the interesting stuff often lives in the relationships between sections, not in any single section.

A thesis argument is cumulative. The conclusion only makes sense in light of the methodology, which only makes sense in light of the literature review. If you're chunking and retrieving, you're constantly at risk of pulling a piece out of context in a way that misrepresents it. Full-document context, in principle, preserves that.

With the caveat we just discussed about middle-context degradation. But even with that caveat, for many tasks, full-document context outperforms chunked retrieval in practice. The MindStudio write-up on Claude's one million token window made this point pretty clearly for code-related work. When you're working with a large codebase, having the whole thing in context rather than retrieving relevant files means the model can catch interactions between modules that retrieval would miss.

The code case is interesting because a codebase is exactly the kind of thing where you don't know in advance which pieces are going to be relevant. A bug in one function might be caused by an assumption made somewhere else entirely.

That's why the one million token window was such a big deal for software development workflows specifically. You can load an entire medium-sized codebase, all the files, all the dependencies, and ask the model to reason about it holistically. Before that was possible, you were always doing a kind of archaeological dig, pulling up files you thought were relevant and hoping you'd found the right ones.

What about the instructional workflow case? Enterprise settings, that kind of thing.

This is where I think the knock-on effect start to get really interesting. The obvious application is loading long documents, policy manuals, legal contracts, technical specifications. But the more interesting shift is in how workflows get designed. When you have a hundred and twenty-eight thousand tokens, you're still making choices about what to include and what to leave out. When you have a million tokens, those choices largely go away. You just include everything.

That changes the workflow architecture pretty fundamentally.

The whole field of prompt engineering was partly born out of context scarcity. How do you get the most relevant information into a limited window? With very large contexts, some of those techniques become less necessary. You don't need to write elaborate retrieval pipelines if you can just... put it all in.

Although I'd push back slightly on the idea that context abundance makes prompt engineering obsolete. The structure of what you put in still matters, even if the size constraint is relaxed.

Structure still matters. Clarity of instruction still matters. The format you use to present information to the model still affects how well it reasons over it. What changes is the bottleneck. Before, the bottleneck was often "I can't fit everything I need." Now it's more often "I'm including too much noise along with the signal.

Which is a more interesting problem, in a way. It's not a hardware problem, it's a design problem.

It's a problem that I think practitioners are still figuring out. The instinct when you have a million token window is to stuff it. Load every relevant document, every piece of context, every prior conversation turn. But there's evidence that performance actually degrades when you fill the context with loosely relevant material. The model has to work harder to find what matters, and the attention is more diffuse.

The optimal strategy is probably somewhere between "include only what you absolutely need" and "include everything you have.

Curated comprehensiveness, if that's not too much of an oxymoron. Include everything that's relevant, but be disciplined about what counts as relevant. Which sounds obvious but is actually a real skill in workflow design.

Let me bring in the comparison that I think is useful here, just to ground the historical arc. GPT-4, when it launched, had an eight thousand one hundred and ninety-two token context window. That was the standard model. There was a thirty-two thousand token variant that came later. And people were doing extraordinary things with eight thousand tokens because that was what you had.

The jump from eight thousand to where we are now happened in roughly two years. GPT-5 is at four hundred thousand tokens. Gemini three Pro is at ten million. That's a factor of over a thousand in about two years. It's one of the faster scaling curves in the history of this technology.

Which raises the question of whether it continues. Because there's a version of this where we're at ten million tokens and the next step is a hundred million, and then a billion. And there's another version where ten million is roughly the practical ceiling for a while because of hardware constraints.

I think the honest answer is that ten million tokens is already pushing against what current hardware can do efficiently. The compute requirements for full attention at ten million tokens are staggering, which is why Gemini three Pro is using architectural tricks to make it work. The path to a hundred million tokens probably requires either fundamentally different architectures or hardware advances, specifically around unified memory, where you can hold enormous context in fast-access memory rather than shuttling it back and forth.

This is where I think the max output tokens question becomes relevant, because it's been almost entirely absent from this conversation so far, and from most of the coverage.

It really has. And I find that strange because it's a real bottleneck for a lot of workflows. You can have a ten million token input window, but if the model can only generate, say, eight thousand tokens of output, you've created a very lopsided system. You can read a thesis, but you can only write a short response.

What do the current numbers actually look like on the output side?

This is where the data gets a bit murkier, because model providers are less forthcoming about max output tokens than they are about context window sizes. The context window is a marketing number, it's a headline. Max output tokens is more of a technical footnote. But the general picture is that most models are generating somewhere between four thousand and thirty-two thousand tokens of output maximum. Some newer models have pushed toward sixty-four thousand. But it's nowhere close to proportional with the input side.

The asymmetry is enormous.

You have a model that can read ten million tokens but can only write, at best, a few tens of thousands. If you're asking it to produce a long document, a detailed analysis, a full codebase, you hit the output ceiling long before you hit the input ceiling.

The obvious workaround is to ask for the output in multiple passes, but that introduces its own problems.

Coherence across passes is hard. The model doesn't carry a running state in the way a human writer does. Each pass has to re-establish context, and you can lose the thread. The seams show.

Why has the output side lagged so badly? Is it a different kind of problem technically?

Generation is autoregressive, meaning each token is produced one at a time, conditioned on everything that came before. That's just slow, and it gets slower as the output gets longer because the context of the generation itself grows. There's also a quality argument: the longer a model generates, the more likely it is to drift, to become repetitive, to lose coherence. So there's a soft ceiling imposed by quality degradation that's somewhat separate from the hard ceiling imposed by the context window.

Which suggests that just lifting the max output token limit wouldn't actually solve the problem, because the model would start producing garbage before it hit the new limit anyway.

For current architectures, probably yes. The quality of very long outputs is an active research problem. There's work on training models to maintain coherence over longer generation spans, but it's not solved. The input side benefited from architectural innovations like sparse attention that made long contexts computationally feasible without sacrificing too much quality. The output side needs something analogous, and I'm not sure we have it yet.

The near-term expectation is probably that context windows continue to grow, possibly modestly, and max output tokens grows more slowly, and the gap between them persists for a while.

That's my read. Though I'd be cautious about predicting timelines in this space, because the pace of change has consistently surprised people. A year ago I would have said ten million token context windows were at least two to three years out. And here we are.

The thing that strikes me about all of this is that the practical implications are still being worked out. The models have gotten ahead of the workflows. People have access to a million token contexts and they're still mostly using them the way they used one hundred and twenty-eight thousand token contexts, because the mental models and the tooling haven't caught up.

That's a really important point. The tooling for actually using large context well is immature. Most interfaces still present the context window as a thing you fill up progressively during a conversation, rather than something you architect deliberately at the start. The paradigm is still "chat interface" even when the underlying capability is "load your entire knowledge base and reason over it.

Which is a product design problem as much as a technical one.

A user education problem. Most people don't know what a token is, let alone how to think about structuring a million of them. The gap between what the models can do and what users know how to ask them to do is probably the biggest practical bottleneck right now, more than the context window size itself.

I want to come back to something you mentioned earlier, the hybrid attention architectures, because I think there's a misconception worth addressing. The assumption seems to be that if a model advertises a one million token context window, it's doing the same quality of processing across all one million tokens. And that's not quite right.

It's not right at all, and it matters a lot for how you interpret the capability. Hybrid architectures typically have some layers doing full global attention and other layers doing more local or sparse attention. The full attention layers are where the real integration happens, where the model can connect distant pieces of information. The local attention layers are cheaper and handle more fine-grained sequential processing. The ratio and the arrangement of these layers affects how well the model actually uses long context versus just technically accepting it as input.

Two models can both claim a one million token context window and have very different actual performance on tasks that require reasoning across the full document.

And the benchmarks don't always capture this cleanly. The standard long-context benchmarks tend to test retrieval, find this specific fact that I put in the document, rather than multi-hop reasoning, connect these three facts from different parts of the document to answer this question. Models that are good at retrieval but weak at integration can score well on the retrieval benchmarks and still fail on real-world tasks that require the harder kind of reasoning.

Which is why the Gemini three Pro score on GPQA Diamond, ninety-four point three percent as of February, is interesting but not the whole story for long-context applications.

GPQA Diamond is a reasoning benchmark, not specifically a long-context benchmark. High performance there tells you the model is a good reasoner. How that reasoning holds up when the relevant information is spread across a million tokens of input is a separate question that requires separate evaluation.

Most people deploying these models in enterprise settings are not doing that separate evaluation. They're taking the headline numbers and assuming.

Which is how you end up with workflows that technically use a large context window but don't actually perform better than a smaller context model with well-structured retrieval. The context window size is a necessary condition for certain applications, not a sufficient condition for good performance on those applications.

Okay, so let's start pulling some of this together into something actionable, because I think there are real things people can do with this understanding. What's the first thing you'd tell someone who's building a workflow around one of these large context models?

Think about information density before you think about information completeness. The instinct is to include everything because you can. The better instinct is to ask: what does the model actually need to see to do this task well? Include that, in a clear structure, with the most important material positioned deliberately. Don't treat the context window as a dump.

The structure point is worth emphasizing. Headers, clear section delineation, explicit pointers to what's important. The model processes text sequentially at some level, and helping it navigate a long document is not busywork.

The second thing I'd say is: test your specific use case, don't rely on benchmark performance. If you're building a legal document analysis workflow, find a representative set of legal documents, construct questions that reflect what you actually need the model to do, and evaluate it. The headline context window size and the general benchmark numbers are starting points, not guarantees.

On the output side?

Plan for the ceiling. If your workflow requires generating long outputs, think in advance about how you're going to handle the max output token limit. Can you structure the task so that each generation pass is self-contained? Can you build coherence checks between passes? Don't discover the ceiling in production.

There's also a monitoring point here. This space is moving fast enough that a model that was the right choice six months ago might not be the right choice now. Not because it got worse, but because something better came out.

The open source side especially. GLM five point one and Gemma four medium at two hundred and fifty-six thousand tokens are capable models that weren't available a year ago. If you're in a context where you can't use proprietary models for compliance or cost reasons, the open source options are much stronger than they were.

The hybrid attention innovation is happening in open source too, so the gap between open and proprietary on long-context tasks is narrowing.

Narrowing, though not closed. The proprietary frontier models are still ahead on the hardest reasoning tasks. But for a lot of practical enterprise workflows, the open source tier is now competitive.

What's the open question you keep coming back to on all of this?

The output side. I don't know when we'll see max output tokens scale to match context window sizes, or whether the architecture that enables that will look anything like current transformers. There's a real question of whether you can train a model to maintain coherent, high-quality generation across, say, a million output tokens without the quality degrading in ways that make the capability practically useless. And right now I don't think anyone has convincingly solved that.

It's the question nobody's asking loudly enough. The input side gets all the attention, the output side is where the real bottleneck is going to land for a lot of applications.

I suspect that when someone does crack it, it'll change what AI-assisted work looks like at a pretty fundamental level. Right now, the model is a very good reader and a somewhat constrained writer. Flip that, or balance it, and you're in different territory.

Thanks to Hilbert Flumingtop for producing this one, and to Modal for keeping the compute running. If you want to dig into any of this further, you can find all two thousand two hundred and thirty-four episodes at myweirdprompts.Leave us a review if you've been enjoying the show. This has been My Weird Prompts.

Until next time.

Before we get into the mechanics of why any of this works, it's probably worth making sure everyone's on the same page about what a context window actually is, because the term gets thrown around in ways that obscure more than they reveal.

Right, and the intuition most people have is partially correct but missing something important. The basic idea is that a language model can only see a finite amount of text at once. Everything it reads, everything it generates, the instructions you gave it at the start, the conversation history, the document you pasted in, all of that has to fit within a fixed window measured in tokens. A token is roughly three quarters of a word in English, so a hundred thousand token window is something like seventy-five thousand words. A novel, give or take.

The reason the window is finite isn't arbitrary. It's a direct consequence of how the attention mechanism works. The model has to compute relationships between every token and every other token in the window, which is why the cost scales the way it does.

Quadratically with length, yes. Double the context, roughly quadruple the compute. That's the constraint that made anything beyond a few thousand tokens impractical for years, and why the jump to the numbers we're seeing now required genuine architectural innovation, not just more hardware.

Where does the field actually sit right now? Because the numbers we touched on at the top are worth grounding for anyone who hasn't been tracking this closely.

The spread is remarkable. Gemini three Pro is at ten million tokens, which is a different category of capability than anything that existed eighteen months ago. Claude Opus four point seven sits at one million. GPT-five is at four hundred thousand. Then you've got a tier of open source models, GLM five point one, Gemma four medium, around two hundred and fifty-six thousand. And the smaller edge-deployed models like Gemma four small and Mistral NeMo at one hundred and twenty-eight thousand.

That's a four hundred times spread between the smallest and largest in active deployment.

Which tells you the field hasn't converged on a consensus about what the right size actually is. Different use cases have different requirements, and the cost curve means you don't always want the biggest window available.

And the cost curve is doing real work in that sentence. The instinct when you first see the Gemini three Pro number is to think, why wouldn't you just always use the biggest window? Ten million tokens, throw everything in. But that's not how the economics work.

Not even close. The quadratic scaling means that using a ten million token window for a task that only needs fifty thousand tokens is wasteful in a way that affects latency, cost per query, and in some deployment contexts, throughput. You're paying for attention computations that are doing nothing useful. The right context window is the smallest one that actually fits your task with some headroom.

Which requires you to know your task well enough to make that judgment, and a lot of teams don't, at least not at the start.

Right, and there's a tendency to over-provision because it feels safe. If something breaks, nobody gets blamed for having too much context. But over-provisioning has real costs that accumulate at scale, and it can also degrade performance in ways that aren't obvious. If you're feeding a model five hundred thousand tokens of loosely relevant material when the actually relevant material is twenty thousand tokens, you're not helping it, you're introducing noise.

That's the thing that I think gets underappreciated. The relationship between context size and performance isn't monotonic. More isn't always better even when you have the capacity.

The research on this is pretty consistent. There's a phenomenon sometimes called context rot, where retrieval and reasoning quality degrades as the distance between relevant pieces of information increases within a long context. The model can technically see everything, but the attention signal gets diluted. The relevant tokens have to compete with a much larger pool of irrelevant tokens for the model's representational budget.

The lost in the middle effect is the specific version of this that's been documented most carefully. Information in the early and late positions of a long context gets better recall than information in the middle. Which is a strange artifact if you think about it. The model reads everything, but it doesn't treat all positions equally.

It's a consequence of how positional encoding and attention patterns interact. The beginning of the context tends to carry high weight because it often contains instructions or framing. The end carries high weight because it's most recent. The middle is where you're most at risk of losing things. And for something like an academic thesis, which is typically structured with the important claims distributed throughout the document, that's a real problem.

Let's actually work through the thesis case because it's a good concrete example of where this gets interesting. A standard doctoral thesis is somewhere between sixty thousand and a hundred thousand words. That's roughly eighty to one hundred and thirty thousand tokens. So it fits comfortably within the GPT-five window, well within Claude Opus, and is almost trivially small relative to Gemini three Pro.

The context window question for a thesis is essentially solved at the current frontier. The harder question is what you actually want to do with it. If you're asking the model to summarize the thesis, that's a retrieval and synthesis task, and it tends to work reasonably well because the model can draw on the full document. If you're asking it to evaluate the logical consistency of the argument across chapters, you're asking for multi-hop reasoning across long distances, and that's where you start to see degradation.

Because the claim in chapter two that underpins the conclusion in chapter seven might be separated by forty thousand tokens of intervening text.

The model has to hold the thread of the argument across that gap. Current models can do this to a degree, but it's not reliable enough that you'd want to use it as your primary quality check without human oversight. What works better is a structured decomposition approach, where you break the reasoning task into explicit steps and ask the model to surface specific claims before asking it to evaluate their relationship.

You're essentially compensating for the lost in the middle problem by making the relevant information explicit and proximate before asking for the reasoning.

Which is also good prompt engineering independent of the context window issue. Clear structure, explicit pointers, staged reasoning. These practices help regardless of whether you're working with fifty thousand tokens or five hundred thousand.

The instructional workflow case is different though, because there the challenge isn't a single large document, it's accumulation over time. A multi-step enterprise workflow where the model is receiving instructions, executing tasks, getting feedback, updating its approach, and that whole history needs to stay in context.

This is where the one million token windows start to matter in a way that the thesis case doesn't quite illustrate. Because in an agentic workflow, the context isn't static. It's growing with every turn. You start with your initial instructions, maybe a few thousand tokens. By turn fifty you've got tool call results, intermediate outputs, error messages, corrections, and the context is now two hundred thousand tokens and climbing.

If you hit the ceiling mid-workflow, you have a problem that's qualitatively different from just not having enough room for a document.

It can break the coherence of the entire task. The model loses access to earlier decisions and constraints, and you can end up with outputs that contradict what was established in the first half of the conversation. Managing context in long agentic workflows is one of the harder engineering problems right now, and the larger windows don't eliminate it, they just push the ceiling out further.

Which buys time but doesn't solve the underlying architecture question of how you maintain coherent state across very long interactions.

There are approaches, periodic summarization of earlier context, explicit state tracking in structured formats that the model can reference efficiently, retrieval augmented generation where you pull relevant history back in as needed rather than keeping it all in context. None of them are perfect substitutes for a model that can reason across long distances, but they're practical now.

The retrieval augmented approach is interesting because it's almost inverting the problem. Instead of expanding the context window to fit everything, you're being selective about what goes in and when.

For a lot of enterprise applications, that's actually the right architecture even if you have a large context window available. The discipline of deciding what's relevant forces you to think more carefully about the task structure, and that clarity tends to improve outputs. The context window is a tool, not a solution.

And those knock-on effect of that discipline are underappreciated. When teams start treating context as a scarce resource even when it technically isn't, something interesting happens to the quality of their prompts and their overall workflow design.

You see this in enterprise deployments. The teams that got good results early were often the ones who had worked with tighter context limits and developed habits around precision. When the windows expanded, they applied the same discipline and got compounding returns. The teams that just threw everything in because they could, tended to get mediocre results even with access to better models.

Which is a slightly uncomfortable finding if you're a model vendor selling context window size as a headline feature.

It's a real tension. The capability is genuine and valuable. But the marketing tends to imply that bigger windows are a substitute for thoughtful architecture, and they're not. They're an enabler of thoughtful architecture. That's a different thing.

Let's put some numbers on the GPT-4 comparison because I think it's useful for calibrating how far the field has actually moved. When GPT-4 launched, the standard context window was eight thousand one hundred and ninety-two tokens. The extended version got to thirty-two thousand. That was considered remarkable at the time.

Now the baseline expectation for anything claiming to be a frontier model is somewhere around a hundred thousand tokens, with the serious players at four hundred thousand and above. That's a fifty-fold increase in roughly two years at the low end, and orders of magnitude more at the high end. The architectural work required to make that happen without the cost curve destroying the economics is impressive.

The hybrid attention piece is doing a lot of that work, right? You're not running full attention over the entire context for every layer.

Right, the full self-attention over every token pair is reserved for the layers and positions where it matters most. Local attention windows, sparse attention patterns, linear attention approximations for the bulk of the computation. The model is making bets about where the important relationships are and concentrating its compute there. That's what allows a one million token window to be economically viable in a way that naive scaling never could have been.

The bet isn't always right, which is part of why the performance on very long contexts is still uneven. The architecture is making tradeoffs, and some tasks land on the wrong side of those tradeoffs.

The benchmark data reflects this. On something like needle in a haystack tests, where you hide a specific fact in a long document and ask the model to retrieve it, the frontier models perform well even at very long contexts. But on multi-hop reasoning across long distances, where the model has to connect multiple pieces of evidence distributed through a long document, performance drops off more sharply. The Microsoft Foundry documentation on Claude Opus four point seven specifically calls out codebases and multi-day projects as the sweet spot, which is telling. Those are tasks with relatively well-structured dependencies, not arbitrary long-range reasoning.

The practical implication for developers is that the context window number tells you what's possible, but the task structure tells you what's actually going to work.

That's a good way to frame it. And the task structure point connects directly to how you should think about instructional workflows in enterprise settings. The most successful deployments I've seen described aren't the ones using the largest available context windows. They're the ones that have thought carefully about what information the model actually needs at each step and when.

The classic enterprise case is something like a legal document review workflow. You have a large corpus of contracts, you want the model to flag inconsistencies, identify non-standard clauses, summarize key terms. That's a task where the large context window is load-bearing.

It is, but even there the architecture matters. If you're reviewing a thousand contracts, you don't want to put all of them in a single context. You want a workflow where each contract gets its own context, the model produces structured outputs, and those outputs get aggregated at a higher level. The context window enables the per-document analysis. The workflow architecture handles the cross-document synthesis.

Which means the context window expansion has actually shifted where the interesting engineering problems are. It used to be that fitting your content into context was the hard problem. Now the hard problem is designing workflows that use context effectively at scale.

That's exactly the shift. And it's a more interesting problem in some ways, because it's an architectural and product design problem, not just a raw capability problem. You can't solve it by waiting for the next model release. You have to think carefully about task decomposition, state management, how outputs from one stage inform the inputs to the next.

The future workflow question is where this gets speculative, but it's worth pushing on. If context windows continue expanding toward the hundred million token range, and some of the hardware roadmaps suggest that's not implausible, what does that actually change?

The honest answer is that we don't fully know, because the use cases that become possible at that scale are qualitatively different from anything we're building for now. There's a Daniel Jeffries piece that makes this point about the ten million token range already representing a category shift, not just a quantitative one. At ten million tokens you can hold an entire software project, multiple years of email correspondence, a complete scientific literature review, all in a single context. That's not a bigger version of what we had before. It's a different kind of tool.

The analogy that comes to mind is the transition from batch processing to interactive computing. It's not that interactive computing was just faster batch processing. The real-time feedback loop enabled entirely different ways of working.

The context window expansion might be doing something similar for knowledge work. The friction of having to carefully select what information to include, to summarize and compress before you can work with it, that friction shapes how people think about problems. Remove it, and you might get different approaches emerging, not just faster versions of current approaches.

Though the context rot problem means you can't just assume that a hundred million token window behaves like a perfectly attentive reader who has absorbed everything equally. The limits shift but they don't disappear.

That's the key caveat. The hardware and architecture improvements are real. The remaining challenge is making sure the model's effective attention actually scales with the nominal context size. Right now there's a gap between what the model can technically receive and what it can reliably reason over. Closing that gap is probably the most important unsolved problem in this space, and it's getting less attention than the headline context window numbers.

Because the headline number is easier to communicate in a benchmark table than the nuanced question of whether the model actually uses what you gave it.

Which is why the max output tokens question we flagged at the start is worth coming back to. The input context expansion has been dramatic. The output side has been much quieter. Most frontier models are still capped somewhere between four thousand and sixty-four thousand tokens on the output, even when the input context is orders of magnitude larger.

That asymmetry has real implications. If you're using a model with a one million token input context to analyze a large codebase, but the model can only produce thirty-two thousand tokens of output, you've got a bottleneck that limits what you can actually do with that analysis.

The output constraint forces you to be selective about what the model produces, which isn't always bad, but it does mean that certain tasks that seem enabled by large input contexts are still practically limited by the output ceiling. Generating a comprehensive refactor of a large codebase, producing a full-length research synthesis, writing a complete implementation from a detailed specification, these tasks run into the output limit before they run into the input limit.

The question is whether output token limits are a deliberate design choice or just a capability that hasn't been prioritized.

Probably both, depending on the model. There are cost and safety reasons to keep output limits conservative. Longer outputs are more expensive to generate and harder to evaluate for quality and safety. But there's also just less competitive pressure on the output side because the input context number is the one that gets cited in benchmarks and marketing materials. Max output tokens is the metric that doesn't show up in the headline.

Which is exactly why it's worth paying attention to.

Practically speaking, what does someone actually do with all of this? Because we've covered a lot of ground on what the limits are and where the tradeoffs live, but the listener who's building something today needs to make decisions today.

The first thing I'd say is that prompt design is still where most of the leverage is, and it's the thing most people underinvest in relative to how much time they spend evaluating models. The context window gives you a budget. Prompt design is how you spend it wisely. Front-load the most important information and instructions. Put the material you most need the model to reason over at the beginning and end of your context, not buried in the middle, because the lost-in-the-middle effect is real and it hasn't gone away even in the frontier models.

That's a counterintuitive piece of advice given how much the headline numbers have grown. You'd think with a million token window the placement of information wouldn't matter much.

The architecture doesn't care about your intuitions about scale. The attention patterns that cause middle-of-context degradation are baked into how these models were trained. Until that changes at a fundamental level, the practical advice holds: treat the edges of your context as prime real estate.

The second thing I'd add is on the monitoring side. The gap between what a model nominally supports and what it reliably does is not something vendors advertise clearly. Running your own evals on your specific task is not optional if you're building something that matters. Don't assume that because a model claims a certain context length, your task will perform well at that length.

That's especially true for anything involving multi-hop reasoning or synthesis across long documents. The needle-in-a-haystack numbers look great. The multi-hop numbers are much more variable, and they're task-specific. A benchmark that doesn't reflect your task structure is not telling you what you need to know.

On staying informed about where this is going, the honest advice is to watch the output token side as closely as the input side. The input numbers get all the attention. The output ceiling is where a lot of real workflows are actually hitting their limits right now, and that's the metric that's most likely to move in ways that unlock new use cases when it does.

I'd add: watch what the open-source models are doing. GLM five point one and Gemma four are already at two hundred fifty-six thousand tokens with hybrid attention approaches, and some are extending toward a million via architectural variations. The open-source trajectory tends to lag the frontier by six to twelve months, but it also tells you what's becoming commoditized. When two hundred fifty-six thousand tokens is the open-source baseline, that's a signal about where the floor of expectations is heading.

The practical upshot being: don't over-engineer for context limits that are going to look conservative in eighteen months. But also don't assume that the limits expanding means the hard thinking about task design becomes less important. It doesn't.

If anything it becomes more important, because the surface area of what you can attempt gets larger, and the ways you can architect a bad solution scale right along with it.

It’s the paradox at the heart of this whole space. More capability inevitably means more ways to get it wrong at scale.

Which is a good place to leave it, honestly. The open question that I keep coming back to is whether max output tokens will ever close the gap with input context in any meaningful way. Because right now you've got models that can receive a million tokens and respond with thirty-two thousand. That's a three-to-one ratio in the best case, and often much worse. At some point that asymmetry becomes the defining constraint.

My instinct is that output expansion is coming, just quietly. The competitive pressure on input context has been loud and public. Output limits will probably move the same way memory limits moved in early computing, incrementally and without fanfare, until one day the ceiling that felt permanent just isn't there anymore.

I hope you're right. Because the workflows that become possible when output limits stop being the bottleneck are different from what we can build today. Full-length synthesis, end-to-end implementation, complete multi-chapter drafts without stitching. That's not a marginal improvement. That's a different class of tool.

Something worth watching. Thanks to Hilbert Flumingtop for producing this one, and to Modal for keeping our infrastructure running. This has been My Weird Prompts. If you've got a minute, a review on Spotify goes a long way. We'll see you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2312: When Bigger Context Windows Aren't Better

Mentions

Downloads

You Might Also Like

#2312: When Bigger Context Windows Aren't Better