#1620: Why VRAM Is the Wrong Way to Measure Your AI PC

Forget VRAM—bandwidth is the new king. Discover why your local AI feels slow and how to build a true "agent computer" for professional coding.

0:000:00

Episode Details

Published: Mar 27
Duration: 21:36
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM
Topics: local-ai model-context-protocol ai-inference

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The transition from simple chatbots to autonomous coding agents has fundamentally changed how we evaluate computer hardware. We have entered the era of "agentic infrastructure," where the personal computer is no longer just a tool for a human, but a high-bandwidth environment designed for an AI agent to inhabit. To build a machine capable of handling these tasks, users must look past traditional metrics like VRAM and focus on the actual bottlenecks of agentic workflows: memory bandwidth and prefill throughput.

The Frustration Threshold

For professional "vibe coding"—where a developer describes a feature and expects an agent to generate multiple files of logic—speed is not a luxury; it is a requirement for cognitive flow. The industry has identified a "frustration threshold" at approximately 15 to 20 tokens per second. Anything slower than this allows the human brain to outpace the machine, breaking the feedback loop. For a truly seamless experience, hardware should aim for 35 to 50 tokens per second. High-end hardware, such as the RTX 5090, is now pushing these limits, reaching over 200 tokens per second on smaller 8B parameter models.

The Hidden Killer: Prefill Speed

While generation speed gets the most attention, "prefill" is the stage where most local AI setups fail. Prefill is the time it takes for a model to ingest and understand the prompt and context before it begins typing. Modern agents often send 20,000 or more tokens of context back to the model with every turn. If the prefill speed is low, the user is left staring at a blinking cursor for several seconds. To maintain a professional workflow, a minimum ingestion speed of 200 tokens per second is required. This is why chips with massive memory bandwidth, like the Apple M5 Max, have become the gold standard; they allow the model to "read" massive codebases almost instantaneously.

The Model Context Protocol (MCP) Tax

The rise of the Model Context Protocol (MCP) has introduced a new challenge: context overhead. By giving agents access to tools like web browsers and databases, we are inadvertently consuming the model's "working memory." Defining these tools can consume nearly 10% of a standard 128K context window before a single line of code is even written. This has led to the development of "progressive disclosure," where tool definitions are only injected into the context window when the agent explicitly needs them, preserving memory for the actual task at hand.

Scaling Through Distributed Inference

As local models grow in complexity, the "middle ground" of hardware is disappearing. Users are often forced to choose between expensive workstations or underpowered consumer rigs. However, new developments in distributed inference are offering a third path. Technologies now allow users to link multiple older devices—such as retired laptops or secondary GPUs—over a peer-to-peer network to run massive models that would be impossible to load on a single machine. This "digital duct tape" approach suggests that the future of local AI may not lie in a single powerful chip, but in the efficient networking of local resources.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1620: Why VRAM Is the Wrong Way to Measure Your AI PC

Daniel's Prompt

Custom topic: When we're talking about the viability of local AI in order to replace cloud API services, there are two "make or break" metrics: inference speed and context window. These get far less attenttion than

So, I was looking at my desk this morning and I realized I do not have a personal computer anymore. I have a very expensive, very warm roommate who occasionally writes Python scripts for me. We have officially entered the era of the agent computer, and according to Daniel's prompt today, most of us are measuring our hardware all wrong. My R T X fifty ninety is humming, the lights are flickering, and yet, I am still sitting here waiting for my agent to tell me why my C S S is broken. It feels like I have a Ferrari that is stuck in a school zone.

Herman Poppleberry here, and Corn, you are hitting on the most relevant shift in local computing we have seen in a decade. Daniel's prompt is asking about the viability of local A I as a total replacement for cloud A P I s, specifically for coding agents. And he is right to point out that everyone is obsessed with video ram, or V RAM, when that is really just the baseline requirement to get the lights on. As of late March twenty twenty-six, the real bottlenecks have moved to memory bandwidth and prefill throughput. We are no longer in the hobbyist era where just getting a model to load was a victory. We are in the era of agentic infrastructure.

Agentic infrastructure. That sounds like something I would need a permit for. But it is true—I am not just playing with a chatbot anymore. I am using this thing as a production dependency. If my local model goes down or starts acting sluggish, my actual work stops. It is like my compiler suddenly decided to take a coffee break.

That is exactly the distinction. A personal computer is a tool for a human. An agent computer is a high-bandwidth environment designed for an A I agent to live in. Think about the A M D Ryzen A I Max Plus that just started shipping. It has a hundred twenty-eight gigabytes of unified RAM. A human does not need that to check email or even to edit video in the traditional sense. That hardware exists because the primary user is an agent that needs to ingest a hundred thousand lines of code in a heartbeat. If you are still looking at your computer as a machine for you, you are missing the shift.

It is like buying a car based entirely on how big the gas tank is, but forgetting to check if the engine can actually move the wheels faster than a brisk walk. I mean, I have seen people brag about fitting a massive model into their fifty ninety setup, only to have the agent sit there and stare at them for ten seconds before typing a single character. Is that what we are calling hanging now? Because it feels a lot like the old spinning wheel of death.

That is precisely what it is. In the world of agentic infrastructure, the agent is reading your entire codebase, checking documentation via the Model Context Protocol, and trying to reason through a multi-turn debugging session. If your hardware cannot ingest those thousands of tokens of context instantly, the agent hangs. It loses the thread. We have to talk about the frustration threshold.

And when it hangs, I start checking my phone, and suddenly the thirty minutes I saved using an A I agent is gone because I spent forty minutes looking at memes while waiting for the prompt to process. So, let us get into the numbers. What is the actual frustration threshold for a developer trying to do real work locally? Because I feel like my patience is getting shorter as the models get smarter.

The industry standard we are seeing right now in March twenty twenty-six is that fifteen to twenty tokens per second is the absolute floor. That is the frustration threshold. If you are generating code at ten tokens per second, you can literally read faster than the machine can type. It feels like watching a student who did not study try to answer a question on a chalkboard. It is painful. It breaks your cognitive flow. For what we call professional vibe coding, where you are describing a feature and expecting the agent to spit out three full files of boilerplate and logic, you really need to be hitting thirty-five to fifty-plus tokens per second.

Fifty tokens per second. That is faster than I can think, which is a low bar, but it is also faster than most people realize their current hardware can handle. If I am running a massive model, like the new Qwen three Coder Next, on an older card, I am probably getting what, eight? Ten?

On older hardware, definitely. But look at the benchmarks for the Nvidia R T X fifty ninety that just stabilized this month. We are seeing two hundred thirteen tokens per second on eight billion parameter models at full brain float sixteen precision. Even on a nine billion parameter model, it is hitting eighty-three tokens per second. That is the gold standard for consumer hardware right now because it keeps you in that flow state. But as I said, the generation speed is only half the story. The hidden killer is prefill speed.

Explain prefill to me like I am a sloth who just woke up from a nap. Because I think a lot of people see the generation speed and think they are good to go, but then they load a large file and everything crawls. I have had moments where I hit enter, go get a glass of water, come back, and it still hasn't started typing.

Prefill is the prompt processing stage. It is the time it takes for the model to read and understand everything you just sent it before it starts generating the first token of the answer. When you are using an agent like Claude Code or OpenCode locally, that agent might be sending twenty thousand tokens of context back to the model every single time you hit enter. It has to re-read the conversation, the file structure, and the specific code you are working on. If your prefill speed is slow, you get that long, awkward pause where the little cursor just blinks at you.

The blinking cursor of doom. It is like the A I is buffering a movie from two thousand five.

It really is. For a professional workflow, you want a minimum of two hundred tokens per second for prompt ingestion. If you are below that, the multi-turn logic starts to break down. The agent feels sluggish, and you start to lose trust in its ability to handle complex tasks because the feedback loop is too long. This is why the Apple M five Max is such a big deal this month. Apple's new chips are hitting six hundred fourteen gigabytes per second of memory bandwidth. That bandwidth is what allows the prefill to happen almost instantaneously. You can throw a massive context at it, and it starts responding before your finger has even fully left the enter key.

Six hundred fourteen gigabytes per second. I remember when we thought sixty gigabytes was fast. It feels like Apple is basically building a giant memory pipe with a computer attached to it. But here is the thing, Herman. Even if I have the bandwidth, I keep hearing about this lost in the middle problem. If I am throwing a massive repository at a local model, does it actually remember what was in the middle of the file, or is it just nodding along like you do when I talk about sloth conservation?

That is a legitimate concern and it is directly tied to hardware. We are seeing models like Qwen three Coder Next supporting up to a million tokens in their context window, which is incredible for a local model. But the physical hardware hits a performance wall long before the model's math does. For single file analysis, thirty-two K to a hundred twenty-eight K tokens is the sweet spot. Once you push into that two hundred thousand to four hundred thousand token range for a full repository analysis, you need at least sixty-four gigabytes of unified memory or V RAM to avoid massive speed degradation.

So if I have thirty-two gigabytes of V RAM, and I try to load a four hundred thousand token context, what happens? Does it just crash, or does it become a very expensive space heater?

It becomes a very slow space heater. The system starts swapping memory, the attention mechanism becomes inefficient, and suddenly that fifty tokens per second generation speed drops to two. And more importantly, the model starts to lose accuracy. It misses the function definition on line four hundred because it is overwhelmed by the sheer volume of data it is trying to hold in its active memory. It is not just that it is slow; it is that it gets stupider. This is where the sparse attention mechanism in models like DeepSeek V three point two Speciale comes in. They have managed to reduce the key value cache memory usage by up to ninety percent, which is the only reason we can even talk about running these long context agents on local hardware.

I love that name, Speciale. It sounds like a limited edition pizza. But it is actually doing something pretty heavy lifting with the memory. I want to go back to something you mentioned earlier, the Model Context Protocol, or M C P. I saw a debate on the forums last week about M C P overhead. People are saying that the tools we are giving these agents are actually eating our context window. It is like I am giving the agent a toolbox, but the toolbox is so heavy it cannot remember what house it is supposed to be building.

They are absolutely right. This is a huge realization from the last two weeks. If you are using M C P servers like Playwright for web browser automation or various database tools, the definitions for those tools have to live in the context window so the agent knows how to use them. In some cases, just defining the tools can consume seven to nine percent of a hundred twenty-eight K context window before you even type your first line of code. If you have ten different tools connected, you are losing a massive chunk of your working memory just to the manual for the tools.

That is like paying a ten percent tax just for the privilege of having a toolbox on your belt. If I only have thirty-two K of context, and nine percent is gone instantly, I am losing a lot of room for my actual code. That seems like a massive design flaw in how we are building these agents.

It is forcing a shift in how we build these agents. We are moving toward something called progressive disclosure. Instead of loading every tool the agent might ever need at the start of the session, the system only injects the tool definitions when the agent actually expresses a need for them. It is a more dynamic way of managing that precious local memory. It is the difference between carrying a hundred pound backpack of tools everywhere and having a drone drop off the specific wrench you need when you ask for it. L M Studio zero point four point eight actually just launched with native support for this kind of M C P management, which they are calling L M Link.

I like the drone idea. It feels very twenty twenty-six. But let us talk about the hardware again. If I am someone who wants to stop paying for cloud tokens and go fully local for my coding work, what am I actually buying? You mentioned the M five Max, and we talked about the fifty ninety. Is there a middle ground, or are we in a world where you either have an agent computer or you have a toy? Because I look at the prices of these things and my wallet starts crying.

The middle ground is disappearing, honestly, but there are clever ways around it. The A M D Ryzen A I Max Plus is a great example of where the market is going for a single-box solution. But if you do not want to drop five thousand dollars on a single workstation, we are seeing the rise of disaggregated inference. This is where companies like ExoLabs come in. If you have three or four older Mac Minis or a couple of P C s with older cards, ExoLabs lets you link them together over a peer-to-peer network to run a massive four hundred five billion parameter model.

Wait, so I can take the three old laptops in my closet, string them together with some digital duct tape, and suddenly I have a supercomputer that can actually run a decent coding model? That sounds too good to be true. What is the catch?

Digital duct tape is a good way to put it. The catch is latency. You are splitting the memory load across multiple machines, so the data has to travel over your local network. It is not as fast as a single high-bandwidth chip like the M five Max, but it makes these massive models viable for people who already have hardware lying around. It turns your home office into a distributed data center. For an agent that is doing background tasks, that latency might not matter as much as the sheer intelligence of a four hundred billion parameter model.

It is a bit humbling, isn't it? My computer is now being optimized for the thing living inside it, and I am just the guy who provides the electricity and the vague instructions. I am basically the landlord for an A I.

It is a massive shift in architecture. We are seeing this even in the software layer. Ollama zero point five point x just introduced a headless mode called ollama launch. It is designed specifically for these autonomous agents. You do not even need a user interface. The agent just lives in the background, waits for a command from your code editor, spins up the model, does the work, and then stays ready for the next task. It is turning the local machine into a silent, high-performance A P I. The human doesn't even see the model anymore; they just see the results in their I D E.

I love the idea of a silent partner. No small talk, no asking me how my weekend was, just high-speed code generation. But I do wonder about the energy cost. We are talking about these fifty nineties and M five Maxes running at full tilt. If I am running this all day, am I going to see my power bill spike? I mean, my fifty ninety already makes my room feel like a sauna.

Efficiency is improving, but there is no free lunch. A fifty ninety under full load is going to pull a lot of power. However, when you compare the cost of that electricity to the cost of a hundred dollars a month in cloud tokens for a high-volume developer, the local hardware usually pays for itself in less than a year. Plus, there is the privacy aspect. If you are working on a proprietary codebase for a client, being able to tell them that not a single line of their code ever left your local network is a huge selling point. In twenty twenty-six, privacy is a luxury that local hardware provides.

That is a big deal. Especially with all the concerns about training data and who owns what. If it stays on my desk, it stays my business. So, if we are looking at the benchmarks Daniel asked about, let us summarize for the people who are ready to go shopping. If I am building my agent computer today, I am looking for at least two hundred tokens per second on prefill, fifty tokens per second on generation, and at least sixty-four gigabytes of high-bandwidth memory. Is that the shopping list?

That is the professional shopping list for March twenty twenty-six. Anything less and you are going to feel the friction. You want to prioritize memory bandwidth over raw compute power. A chip that can do massive floating point operations but has slow memory is going to be a bottleneck for an agent every single time. And keep an eye on that key value cache efficiency. Models that use multi-latent attention, like the DeepSeek R one series, are going to be much more viable on hardware that might feel slightly dated because they are just so much more efficient with how they use their memory.

It is funny how the conversation has changed. Two years ago, we were just happy if the thing could finish a sentence without hallucinating that it was a pirate. Now we are complaining if it takes three seconds to read a ten thousand line repository. We have become very spoiled, very quickly.

We have, but the demands of the work have scaled too. We are not just asking it to write a poem anymore. We are asking it to refactor an entire microservice architecture. That requires a level of throughput that simply didn't exist in the consumer space until very recently. The Qwen three Coder Next eighty billion parameter Mixture of Experts model is really the gold standard for a sixty-four gigabyte setup right now. It only uses about three billion active parameters for any given token, which is why it is so fast, but it has the intelligence of a much larger model.

That is the magic of Mixture of Experts, right? It is like having a giant library, but you only ever have two or three librarians working at a time. You get the knowledge of the whole library without having to pay the salaries of a thousand people.

That is exactly the logic. And when you combine that with something like the new Ollama release, it gets even more interesting. We are seeing the hardware and software finally shake hands. For anyone listening who wants to dive deeper into how we got here, we actually covered the start of this throughput gap in episode one thousand seventy-eight. Back then, we were just starting to see the frustration of these agents hitting a wall, and it is amazing to see how the hardware has finally caught up in twenty twenty-six.

Yeah, it is a completely different world. I think the takeaway for me is that I need to stop looking at the gigabytes of V RAM and start looking at the gigabytes per second of bandwidth. It is about the flow, not just the capacity. If the model fits but the prefill is slow, I am still going to be frustrated.

And do not forget the progressive disclosure for those M C P tools. Do not let your toolbelt weigh you down before you even start the job. If you are setting up an agent, look for tools that support dynamic loading. It will save your context window for the actual code.

It really makes me wonder what the next bottleneck will be. If we solve memory bandwidth and prefill speed, what is next? Is it just going to be the speed of the human at the other end? Are we going to be the ones hanging while the A I waits for us to read its output?

Honestly, we are already hitting that. That is why the agentic part is so important. We are moving away from the model waiting for the human, and toward the model just doing the work and presenting the result. The bottleneck then becomes how fast we can review and approve the agent's work. But in terms of hardware, I think we will start seeing specialized A I processors that ignore display output entirely. We do not need a graphics card to show us pretty pictures if the primary user is an agent that only cares about text and logic.

A computer with no monitor. Just a black box that thinks really fast. It sounds a bit like a sci-fi movie from the nineties, but I guess that is just our reality now. I can imagine a future where I just have a stack of these boxes in a closet, humming away, and I just interact with them through my glasses or a simple terminal.

It is already happening. The headless mode in Ollama is the first step toward that. We are decoupling the thinking from the showing. And for anyone who wants to understand the history of this, check out episode six hundred thirty-three, where we talked about the early memory wars. It gives some good historical context to these bandwidth struggles we are seeing today.

Great call. Well, I think my roommate, the agent computer, is starting to get warm, so I should probably go give it some work to do. This has been a fascinating look at where the hardware is actually at right now. I feel a lot better about my fifty ninety purchase now, even if it does mean I have to wear shorts in my office during the winter.

It really is the most exciting time to be building locally. The tools are finally matching the ambition. Just remember: prefill is king, bandwidth is queen, and V RAM is just the floor you walk on.

For sure. If you are looking to upgrade your setup, definitely check out the benchmarks we mentioned. There is a lot of good data out there from this month that can save you a lot of money and frustration. Do not just buy the biggest number on the box; buy the number that actually moves the tokens.

And keep an eye on those sparse attention models. They are the secret sauce for making modest hardware punch way above its weight class.

Well, that is it for our deep dive into the agent computer era. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes and making sure our own prefill speeds are up to par.

And a big thanks to Modal for providing the G P U credits that power this show. They are doing some incredible work in the serverless space that complements everything we talked about today. Even if you go local, having a cloud burst option is always a smart play.

This has been My Weird Prompts. If you are enjoying the show, a quick review on your favorite podcast app really helps us out and helps other people find these deep dives into the guts of our new A I roommates.

You can also find us on Telegram by searching for My Weird Prompts to get notified the second a new episode drops. We post a lot of the raw benchmark data there too.

See you in the next one. I am going to go see if my agent has finished that refactor yet.

Goodbye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.