#2356: Why AI Coding Needs Two Brains

Discover how specialized fast apply models streamline AI-powered code edits, cutting costs and latency while maintaining precision.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2514
Published: Apr 20
Updated: May 15
Duration: 23:04
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: software-development ai-models productivity

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Fast apply models are revolutionizing how AI-powered code edits are implemented, addressing a critical bottleneck in developer workflows. While frontier models like GPT-5 and Claude excel at reasoning about what changes are needed, they struggle with the mechanical task of merging those edits into existing source files efficiently. This inefficiency leads to high costs — up to $18 per edit — and unacceptable latency, with developers often waiting over 100 seconds for a full file regeneration. Enter fast apply models, a specialized class of tools designed to handle the precise task of stitching AI-suggested edits back into code at speeds of up to ten thousand tokens per second.

The solution lies in a two-model pipeline architecture. The frontier model identifies the necessary changes, while the fast apply model handles the mechanical merge. This division of labor is load-bearing: frontier models are optimized for reasoning, not for the high-throughput precision required for code stitching. Tools like Relace Apply 3 exemplify this approach, offering a 256,000-token context window and zero-dollar routing on OpenRouter, making them scalable for real-world production use.

Training fast apply models is key to their success. Unlike synthetic datasets, these models learn from production traces — real GitHub commits and multi-file bug fixes. This ensures they handle the messy, inconsistent nature of actual codebases, not just idealized examples. The specialization persists even as frontier models improve, thanks to Amdahl’s law: the apply step remains a bottleneck, ensuring fast apply models remain essential. This episode explores the architecture, training, and future of these models, offering insights into how they’re reshaping AI-assisted coding.

Mentions

Cursor AI code editor with apply pipeline
Devstral Open-source coding model
DFlash Diffusion model for faster generation
Lovable Prompt-to-app builder using two models
Mercury 2 Speculative decoding implementation
OpenRouter API router for LLM access
Qwen3-Coder Open-source code generation model
Relace Apply 3 Fast apply model for code edits
S2D2 Self-speculative decoding paper
Windsurf AI coding tool with similar architecture

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2356: Why AI Coding Needs Two Brains

Daniel sent us this one, and honestly it's a category I think most developers are using without realizing it. The question is about fast apply models — a specialized class of LLMs built specifically to merge AI-suggested code edits back into source files at around ten thousand tokens per second. The setup is a two-model pipeline: your frontier model, say Claude or GPT-5, figures out what needs to change. Then a small, purpose-built model does the actual stitching. Tools like Cursor, Windsurf, and Lovable all run this way under the hood. Daniel wants us to use Relace Apply 3 as the worked example but treat this as a broader category explainer. So that's what we're doing.

The reason this category exists at all comes down to a pretty brutal cost-and-latency problem. When a frontier model suggests an edit to a thousand-line file, the naive approach is to have that same model regenerate the entire file from scratch. And the numbers on that are genuinely painful. We're talking over a hundred seconds of wait time and roughly eighteen dollars per edit for a large file. That's not a rounding error, that's a structural problem.

Eighteen dollars per edit. I've had leaf medicine consultations that cost less than that, and those involve actual ancestral wisdom.

I'm not touching that. But yes, eighteen dollars per edit at any kind of volume makes the economics completely unworkable. And the latency is arguably worse for the user experience than the cost. A developer sitting there watching a spinner for a hundred-plus seconds while a frontier model laboriously rewrites code they mostly want to keep — that's the kind of friction that kills a tool.

By the way, today's script is powered by Claude Sonnet four point six — our friendly AI down the road doing the heavy lifting.

Good to know. Appropriate, given the topic. So the insight behind fast apply models is essentially: don't use a sledgehammer for a precision task. The frontier model is extraordinarily good at reasoning about what change needs to happen. It's not particularly optimized for the mechanical work of accurately stitching a targeted edit back into an existing file at speed. Those are different problems. And once you recognize they're different problems, you can specialize.

Which is the architectural pattern we're really talking about here. It's not just about one model being faster. It's about decomposing the coding pipeline into distinct tasks and slotting the right tool into each slot. The frontier model handles the hard reasoning. The apply model handles the high-throughput mechanical merge. And the sum is faster, cheaper, and more reliable than either model doing everything alone.

That's exactly the frame. And it's worth sitting with that for a second because it runs counter to how a lot of people think about capability scaling. The default assumption is that as frontier models get better and faster, specialized models get absorbed. But fast apply is a case where the specialization is load-bearing in a way that doesn't just disappear when the base model improves. We'll get into why that is, but the short version involves Amdahl's law applied to agent pipelines, which is a interesting argument.

The counter-argument — which is worth taking seriously — is that some researchers think this niche is already being eaten by diffusion models and speculative decoding. So there's real tension here about whether fast apply as a category has legs or whether it's a transitional solution on its way to being absorbed.

That tension is real and I don't think it's resolved yet. But let's start with the mechanics of why the problem exists in the first place, because the failure mode of frontier models on diff application is more specific than people usually appreciate.

Because the obvious assumption is that a model smart enough to write the code should be smart enough to apply a patch to it. And that turns out to be wrong in an interesting way.

It's wrong in a very specific way. Frontier models are trained primarily to be helpful and complete. When you ask one to apply a diff — meaning take this existing file, apply these changes, output the updated file — the model's tendency is to either over-generate or under-generate. Over-generation means it rewrites sections it wasn't asked to touch, introduces stylistic changes, occasionally hallucinates small modifications. Under-generation is the more notorious problem, which is the lazy edit: the model outputs something like, slash slash dot dot dot rest of code, or slash slash continue as before. It summarizes instead of completing.

The "rest of code" comment. I've seen this in the wild and it's maddening. You ask the model to update a function and it hands you back a file with a placeholder where sixty percent of the original code used to be.

It's not a bug exactly — it's the model doing what it's optimized to do, which is produce a concise, representative response. The problem is that concise and representative is catastrophically wrong in this context. You need verbatim fidelity to the unchanged portions. The model's training doesn't specifically reward that. So you get drift. And the Cursor Composer analysis that circulated a while back put some numbers on this — the cost and latency figures I mentioned, the eighteen dollars and the hundred-plus seconds, those came out of looking at what happens when you let a frontier model handle the full file regeneration in production. It's not theoretical.

The solution isn't to yell at the frontier model harder. It's to not ask it to do that job at all.

You ask the frontier model for a minimal diff. It outputs only the tokens that describe what needs to change — the hard tokens, the semantically meaningful delta. Then you hand that diff to a model that was specifically trained to take a diff plus an original file and produce a correctly merged output at high speed. That's the entire architecture. It sounds simple but the training and the deployment engineering behind it are non-trivial.

That's where Relace Apply 3 becomes a useful concrete example, because it's one of the more fully-specified public implementations of this pattern. Two hundred and fifty-six thousand token context window, zero-dollar routing on OpenRouter, which means you can slot it into a pipeline without a dedicated API contract. Those are real production-relevant numbers.

The context window matters more than it might seem. A lot of real codebases have files that are long. Not toy examples — actual production files that run to several hundred lines, sometimes more. If your apply model has a short context window, you hit a ceiling fast. Two hundred and fifty-six thousand tokens gives you substantial headroom for large file edits without chunking, which is its own source of errors when you have to reassemble chunks.

The ten thousand tokens per second throughput figure — how does that compare to what a frontier model is doing when it regenerates a file?

It's roughly an order of magnitude faster in practice for large files. Frontier models doing full file regeneration are running at somewhere in the range of fifty to a few hundred tokens per second depending on the model and the infrastructure. Apply models at ten thousand tokens per second are in a different performance class entirely. And because they're smaller, the cost per token is dramatically lower. The combination of those two factors is what makes the two-model pipeline economically viable at scale.

We've got a clear problem, a clear architectural solution, and a concrete example of a model built for it. What fascinates me is how these models actually get trained — that's where the production traces angle really comes into play.

And that training process is what truly sets these models apart. It’s not just about being smaller or faster; it’s about using production traces — real snapshots of edits happening in real codebases. We’re talking GitHub commits, multi-file bug fixes, the full messy texture of developer work. Morph documented this clearly in their technical writeup: you’re training the model on the ground truth of what a correct merge looks like, at scale, across thousands of real-world cases.

Which means the model learns from the kinds of files and edits that actually show up in production, not from idealized examples someone constructed to be clean and tractable.

Right, and that matters enormously because real code is not clean. It has comments in weird places, inconsistent indentation, variable names that made sense to one engineer three years ago, legacy sections nobody wants to touch. A model trained on synthetic clean examples will perform fine on synthetic clean examples and fall apart on the actual thing. Training on production traces means the model has seen the mess.

In a sense, the training pipeline is itself a form of specialization. You're not just building a smaller model — you're building a model whose entire data diet was the specific task you want it to do.

That's the right way to think about it. And it connects to why this isn't something you can easily replicate by just fine-tuning a general model for a few epochs. The volume and diversity of those production traces, and the specificity of what correct behavior looks like in each case — that's the actual competitive moat. The architecture is not secret. The data is.

Which has some interesting implications for open-source alternatives trying to play in this space. Models like Devstral or Qwen3-Coder can get at the general capability, but assembling that production trace dataset is a different kind of lift — especially when you consider the labeling challenge.

Production traces give you the ground truth merge implicitly — the commit is the label. You don't have to pay annotators to decide what the correct output should be. The correct output already exists in the repository history. That's a significant data engineering advantage.

The moat is partially about volume of traces, partially about not needing to construct the labels artificially. The label is just... what actually happened.

And this connects directly to why the specialization persists even as frontier models improve. Because here's the thing — if you imagine a future where frontier models are twice as fast, the apply problem doesn't go away. It just means you're waiting fifty seconds instead of a hundred. The bottleneck in an agent pipeline isn't uniform. Some parts of the pipeline get faster as the base model improves. The edit application step has its own throughput ceiling that's structurally separate from how good the reasoning model gets.

That's the Amdahl's law argument you flagged earlier. The classic formulation is about parallel computing — how much you can speed up a system by improving one component depends on what fraction of total time that component represents. If edit application is thirty percent of your pipeline time, making your reasoning model infinitely fast only gets you to a thirty percent reduction in total latency at best.

Right, and in practice the apply step is often a larger fraction of total time than people expect, especially on larger files. So even aggressive improvements to the frontier model don't collapse the value of a specialized apply model. The ceiling just shifts. You still want the fastest possible execution on the mechanical merge, and that's still a different optimization target than reasoning quality.

Which is a slightly uncomfortable argument for the "one big model to rule them all" camp.

And I find it pretty convincing, honestly. The pipeline decomposition isn't just a stopgap. It reflects something real about the structure of the task. Reasoning about what to change and executing the merge faithfully are different enough that optimizing for one doesn't automatically optimize for the other.

The tradeoff worth naming here, though, is that splitting the pipeline introduces a new failure surface. If the diff the frontier model produces is even slightly malformed, the apply model has to handle that gracefully or you get a corrupted output. You've traded one set of errors for a different set.

That's a genuine tradeoff and I don't want to paper over it. The apply model needs to be robust to imperfect diffs, because frontier models do occasionally produce diffs that are ambiguous or partially malformed. The production trace training helps here too — real diffs from real commits include plenty of edge cases — but it's not a complete solution. Tools like Cursor have error recovery logic layered on top of the apply model precisely because of this. The apply model is not the last line of defense.

The full picture is: frontier model generates a minimal diff, apply model merges at speed, and there's error handling around both steps to catch the cases where something goes wrong at the seam.

That's the production architecture, yes. And Relace Apply 3 operating with that two hundred and fifty-six thousand token window means the error surface from chunking is largely eliminated for single-file edits. You're not introducing seam errors from splitting a large file into pieces. The whole file fits in one pass. Still, I know there’s some debate about whether this approach is future-proof.

There’s a counter-argument worth steelmanning here. Some researchers are making the case that fast apply models are already obsolete — that diffusion models and speculative decoding are going to absorb this niche entirely. I don’t think it’s a fringe position.

It's not fringe at all. And I'll be honest — when I first read the arXiv work on block diffusion and the S2D2 self-speculative decoding paper, I found it unsettling for the fast apply thesis. The core claim is that by parallelizing token generation, you can get three to five times speedups on standard autoregressive models without any of the architectural specialization. If that holds at scale, the speed advantage of a dedicated apply model shrinks considerably.

Walk me through the mechanism, because speculative decoding in particular gets described in ways that range from "obvious engineering trick" to "fundamental breakthrough" depending on who's writing about it.

The basic idea is that you use a small draft model to generate several tokens ahead in parallel, and then a larger verifier model checks them all at once. Accepted tokens move forward, rejected tokens get resampled. The win is that verification is cheaper than generation, so if the draft model is right most of the time, you get a significant throughput gain without sacrificing quality. Mercury 2 is the implementation I've seen cited most recently — claims in the range of five to ten times speedups over standard inference.

If you're getting ten times faster on a frontier model, the gap between that and a dedicated apply model running at ten thousand tokens per second starts to look a lot less decisive.

That's the argument. And diffusion models take a different but related approach — instead of generating left to right autoregressively, they start from noise and denoise the whole output in parallel. DFlash and S2D2 are both pursuing this direction. The potential is real. But here's what I think the critics are underweighting: speed is only part of what makes a fast apply model useful. The other part is correctness on the specific task.

Meaning a faster general model is still a general model.

Speculative decoding makes a frontier model faster, but it doesn't change what that model was trained to optimize for. It still has the lazy edit problem. It still has the tendency to drift on unchanged code. You've sped up a model that's fundamentally not trained for verbatim merge fidelity. That's a different issue from latency.

The counterargument isn't actually about whether diffusion models can generate tokens faster. It's about whether faster generation solves the correctness problem.

Which it doesn't, on its own. Now, you could imagine training a diffusion model specifically on production traces for the apply task, and then you'd have something interesting — the speed benefits of parallel generation combined with the task-specific training. But at that point you've essentially reinvented the fast apply model category using a different underlying architecture. The specialization doesn't go away. It just migrates.

The niche survives even if the underlying approach changes. What about the coding tools themselves — Cursor, Windsurf, Lovable? They're all running some version of this pipeline. Do they build their own apply models or route to third-party ones?

Mix of both, from what's been reported. Cursor has a custom apply model trained on their own production traces, which is a significant proprietary advantage given the volume of edits running through their platform. Windsurf has taken a similar direction. Lovable, which is more focused on the prompt-to-app end of the spectrum, leans more heavily on the two-model pipeline pattern but has less publicly disclosed about the specific apply layer. The general pattern is consistent across all of them — nobody is letting a frontier model do the raw file merge at scale.

Because the economics don't work. Eighteen dollars per edit times however many edits per day across a large user base is not a viable cost structure.

The latency is equally disqualifying from a product standpoint. If a user has to wait a hundred seconds every time they apply an edit to a moderately large file, the tool feels broken regardless of how good the reasoning is. The apply speed is load-bearing for the user experience in a way that's easy to underestimate from the outside.

Which is maybe why this category has stayed relatively under the radar. The part of the pipeline that's actually making the product usable isn't the part that gets the press release.

The frontier model gets the announcement, but the apply model is the plumbing — and it's plumbing that the whole system depends on.

Right, and that plumbing is critical for developers building things today. We've been deep in the architectural weeds, but what does this actually mean for someone sitting down to create something practical?

The most immediate thing is just knowing this pipeline exists. A lot of developers using Cursor or Windsurf have no idea there's a two-model system under the hood. They assume the frontier model is doing everything, which leads to misattributing errors. If you get a garbled file merge, that's usually the apply layer, not the reasoning model. Knowing that changes how you debug.

Presumably changes how you structure your prompts.

If you're working with a large file and you want clean edit application, keeping your diffs minimal and unambiguous helps the apply model do its job. The more surgical the frontier model's output, the less the apply model has to interpret. Vague instructions that produce vague diffs compound across both layers.

"rewrite the whole function" is a worse prompt than "change the return type on line forty-two.

That's the practical heuristic, yes. Smaller, scoped edits produce cleaner diffs, which produce cleaner merges. The pipeline rewards precision at the input stage.

What about choosing tools? If someone's evaluating whether to use Relace Apply 3 on OpenRouter versus whatever's baked into their editor...

The built-in apply model in a tool like Cursor is trained on Cursor's own production traces, which is a genuine advantage if you're using Cursor's workflow. But if you're building your own agent pipeline, routing through something like Relace Apply 3 with that two hundred and fifty-six thousand token window gives you flexibility and avoids chunking errors on large files. The zero-dollar routing on OpenRouter also means you're not paying a premium just to access it. For custom pipelines, that's a real consideration.

The future-proofing angle?

Don't build workflows that assume one big model handles everything end to end. The two-model pattern is durable. Even if the specific apply models change, the architectural pattern of separating reasoning from execution is going to persist. Design for composability and you'll spend less time rearchitecting when the underlying models turn over.

The plumbing changes. The fact that you need plumbing doesn't.

Which is maybe the most honest summary of where this whole category lands.

The open question I keep coming back to is whether the production trace moat holds. Right now, Cursor has an advantage because they have Cursor-scale data. But if open-source models like Devstral or Qwen3-Coder start accumulating comparable trace datasets through community contribution, the proprietary edge erodes. That's not a given, but it's a plausible pressure.

On the architecture side, I'm curious how the diffusion model story develops. Not because I think it kills fast apply models in the near term, but because a diffusion model trained specifically on apply traces would be a different beast. Parallel denoising plus task-specific training. That's an interesting combination that nobody has fully demonstrated yet.

The category might not stay called "fast apply models" in five years, but the function it performs isn't going anywhere. Something has to do the merge, and something that does the merge well will always be worth optimizing for.

The edit application problem is as old as version control. The tooling just keeps getting more interesting.

Thanks to Hilbert Flumingtop for producing, as always. Modal is keeping our inference costs from being eighteen dollars per episode, which we appreciate. This has been My Weird Prompts. If the show has been useful, leaving a review on Spotify goes a long way, and we do read them.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2356: Why AI Coding Needs Two Brains

Mentions

Downloads

You Might Also Like

#2356: Why AI Coding Needs Two Brains