I'm curious about how serverless GPU providers like Modal actually operate. Do they sit on top of infrastructure from giants like AWS and Google, or do they own their own hardware? How do they manage elastic demand across their customer base and ensure efficient provisioning of GPUs?

Episode #482

The Silicon Sharing Economy: Inside Serverless GPUs

How do small teams run massive AI models without $50,000 chips? Corn and Herman dive into the hidden plumbing of serverless GPU providers.

0:00/0:00

Download Episode

Episode Details

Published: Feb 4, 2026
Duration: 24:32
Audio: Direct link
Pipeline: V4
TTS Engine
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In a rainy kitchen in Jerusalem, hosts Herman and Corn recently sat down to peel back the layers of the infrastructure that makes their own podcast possible. The discussion, sparked by a question from their housemate Daniel, centered on the rapidly evolving world of serverless Graphics Processing Units (GPUs). As AI models grow in size and complexity, the hardware required to run them has become prohibitively expensive for all but the largest tech giants. This episode explores how a new breed of infrastructure providers is democratizing access to high-end silicon through clever engineering and a "sharing economy" model.

The Problem of Prohibitive Hardware

Herman and Corn begin by framing the sheer scale of the hardware challenge in early 2026. An Nvidia Blackwell B200 chip—the current gold standard for AI—costs between $40,000 and $50,000. For a small team, purchasing even one of these chips is a non-starter, and that is before considering the specialized power delivery and cooling infrastructure required. A single B200 can draw 1,200 watts of power, enough to "melt the floor tiles" of a standard residential kitchen.

The hosts argue that this creates a massive barrier to entry. If small developers had to buy their own hardware, the AI revolution would be limited to companies with massive capital. Instead, serverless providers allow developers to "rent" the exact amount of compute they need, down to the second, making it possible to run a sophisticated transcription and synthesis pipeline for a podcast without owning a single chip.

The Rise of Tier Two Clouds

A major insight from the discussion is the shift in where these chips actually live. In the early 2020s, many "serverless" companies were simply middlemen reselling capacity from giants like Amazon Web Services (AWS) or Google Cloud Platform (GCP). However, Herman points out that this model was economically unsustainable due to the high retail margins charged by the "Big Three."

Today, the landscape is dominated by "Tier Two" clouds like Core Weave and Lambda Labs. Unlike Amazon, which offers hundreds of different services, these companies focus almost exclusively on high-performance compute. They build massive data centers optimized specifically for AI workloads. Providers like Modal then partner with these Tier Two giants to get "bare metal" access to the hardware. By talking directly to the silicon rather than running on top of another company’s virtualization layer, these providers can achieve the lightning-fast performance required for modern AI applications.

Engineering the "Secret Sauce": Bin Packing and Snapshots

The core of the episode focuses on the "plumbing" that makes serverless GPUs feel like magic. Herman explains that the "secret sauce" lies in two areas: orchestration and cold start optimization.

Because GPUs have finite memory (VRAM), providers must act as "grandmasters of Tetris," using a technique called bin packing. They must constantly shuffle incoming requests to ensure that every megabyte of VRAM is utilized without overloading the chip. If two tasks exceed the memory limit of a single GPU, the system crashes. To manage this, companies use custom-built, AI-native runtimes that are significantly faster than standard industry tools like Docker.

The most impressive technical breakthrough discussed is "GPU memory snapshotting." Historically, "cold starts"—the delay caused by loading a massive model into memory—could take minutes. Herman explains that providers like Modal have developed ways to take a "snapshot" of the GPU’s memory state after a model is loaded. When a new request comes in, they can restore that state in seconds. This effectively eliminates the "cold start problem," allowing AI functions to scale from zero to thousands of instances almost instantly.

Transparency vs. The "Black Box" API

Finally, Corn and Herman touch on why developers are moving away from monolithic AI APIs (like those offered by OpenAI) in favor of serverless containers. The primary reason is transparency.

When using a standard API, the developer has no visibility into why a request is slow or how the model is being executed. In contrast, the serverless container model gives developers total control over the code and the environment. They can see real-time logs, monitor memory usage, and identify exactly where bottlenecks occur. This transparency transforms AI development from a "black box" mystery into a disciplined engineering challenge.

By the end of the discussion, it is clear that the future of AI isn't just about bigger models; it's about the sophisticated infrastructure that allows those models to be shared, scaled, and managed efficiently. The "silicon sharing economy" is what allows a small podcast in Jerusalem to harness the power of a $50,000 chip for the price of a cup of coffee.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Cover · OG · Instagram

Episode #482: The Silicon Sharing Economy: Inside Serverless GPUs

Hey everyone, welcome back to My Weird Prompts. We are coming to you from our usual spot here in Jerusalem. It is a bit of a rainy February morning here, the kind of day where the stone walls of the Old City look almost silver, and we are huddled in the kitchen with way too much coffee. Today, we are pulling back the curtain on something that actually makes this very show possible, and it is a topic that has been sitting in our inbox for a while.

Herman Poppleberry here, and I am particularly excited about this one because it gets into the real plumbing of the internet. Our housemate Daniel, who is usually the one finding these bizarre prompts for us, sent us a voice note earlier today asking about the literal hardware and software that runs our back end. He wanted to know about the serverless Graphics Processing Unit providers like Modal. Specifically, he asked if they are just middle men sitting on top of Amazon or if they are building their own digital fortresses.

It is a great question because, as Daniel mentioned, we use Modal to run the entire pipeline for this podcast. When he sends in a voice memo, it triggers a sequence of events that would have been impossible for a small team five years ago. We are talking about transcribing audio with Whisper, processing it with the latest large language models, and then synthesizing voices, all happening on these high powered chips that we do not own and could never afford to keep in our apartment.

Oh, absolutely not. If we tried to run a single Blackwell B two hundred in this kitchen, we would blow every fuse in the building and probably melt the floor tiles. Daniel was hitting on a really interesting tension in his prompt. On one hand, you have the massive cloud giants like Amazon Web Services and Google Cloud Platform, and on the other, you have these specialized, developer focused platforms that feel much more agile. He wanted to know how they actually manage that elasticity without just being a thin wrapper around a more expensive service.

Yeah, and the economics of it in early twenty twenty six are just wild. If you look at the price of an Nvidia B two hundred today, you are looking at forty thousand to fifty thousand dollars for a single card. That is for the chip itself. That does not include the server rack, the specialized power delivery, or the cooling requirements. For a small startup or a couple of guys in Jerusalem, buying that hardware is a non starter.

It really is. And it is not just the hardware cost; it is the expertise. These data center grade chips are not like the ones in a gaming laptop. A single B two hundred can draw up to one thousand two hundred watts of power. That is almost double what the previous generation, the H one hundred, required. So, let us dig into Daniel’s first question. Where does the hardware actually live?

From what I have gathered looking into the architecture of companies like Modal, Replicate, and Core Weave, the answer has shifted significantly over the last two years. In the early days, say twenty twenty three, a lot of serverless companies were essentially reselling instances from Amazon or Google. They would wrap a nice user interface around a standard cloud virtual machine and charge a premium for the convenience.

But that model had a major flaw, which was the margin. If you are paying Amazon’s retail prices for a Graphics Processing Unit, which can be as high as fourteen dollars an hour for a Blackwell instance, and then trying to sell it in smaller slices to developers, you are basically subsidizing your customers until you go out of business. Plus, the big cloud providers are notorious for having limited availability of the newest chips. They keep the best stuff for their own internal projects first.

Exactly. So what we have seen is the rise of the Tier Two clouds. Companies like Lambda Labs or Core Weave have built their own massive data centers specifically optimized for artificial intelligence workloads. Core Weave in particular has been fascinating to watch. They raised over a billion dollars in their Series C recently just to build out more capacity. They do not try to offer every service under the sun like Amazon does. They just offer massive amounts of compute. A provider like Modal often partners with these Tier Two giants to get bare metal access to the hardware.

That is a crucial distinction. Bare metal means they are not running on top of another company’s virtualization layer. They are talking directly to the silicon. This is how they manage to get those lightning fast boot times that Daniel was talking about. If you were sitting on top of a standard Amazon instance, you would have to wait for their operating system to boot, then your container to start, and by then, the user has already refreshed the page. Modal has built their own custom, AI native runtime that they claim is one hundred times faster than standard Docker.

I was reading their technical blog recently, and they mentioned that they use a specialized version of the Linux kernel to handle this. But let us talk about the orchestration, because that was the second part of Daniel’s question. How do they handle the elasticity? Because at any given second, they might have ten thousand developers hitting their endpoints, and then ten seconds later, it might be only one hundred. How do you keep those chips busy without wasting money?

This is where the engineering gets really impressive. The secret sauce is something called bin packing and cold start optimization. Imagine a giant warehouse full of Graphics Processing Units. Each of those chips has a finite amount of memory, or V RAM. For a B two hundred, that is one hundred and ninety two gigabytes. The orchestrator has to be like a grandmaster of Tetris, but the game is played in milliseconds.

So the orchestrator is looking at the incoming requests and deciding which physical machine has enough free memory to handle that specific task?

Exactly! But it is even more complex than that. They are often using micro virtual machines. We have talked about the Firecracker technology that Amazon open sourced a few years ago. It allows you to launch a virtual machine in a fraction of a second. But for Graphics Processing Units, it was historically very difficult because you had to pass the hardware through to the container. The container needs to think it has direct access to that Nvidia chip, and until recently, Firecracker did not support that very well.

I remember seeing a GitHub issue about that. It looks like they finally cracked the PCIe passthrough for micro virtual machines in late twenty twenty four and early twenty twenty five. That was a huge breakthrough for the industry.

It was the missing link. Now, a provider like Modal can spin up a secure, isolated environment with full access to a Graphics Processing Unit in less than a second. But even with fast boot times, you still have the problem of the model itself. If you are running a massive model like Llama three or Mistral, you have to load gigabytes of data into the chip’s memory before you can even start processing. That is the real bottleneck.

Right, and that leads to what they call the cold start problem. If the model is not already sitting in the memory of a chip, the first user has to wait for it to load. I saw that Modal recently launched something they call Graphics Processing Unit memory snapshotting. Have you looked into that?

It is a game changer. Instead of loading the model from scratch every time, they basically take a snapshot of the chip’s memory state after the model is loaded. When a new request comes in, they can restore that snapshot almost instantly. They have shown that for some models, it can reduce the cold start time from two minutes down to just ten or twelve seconds. It is like being able to pause a video game and then resume it on a different console instantly.

That is incredible. It basically makes the distinction between a warm and a cold container much smaller. But let us talk about the demand side for a second. How do they ensure they do not just run out of chips? Because if a major client suddenly decides to scale up to ten thousand concurrent functions, that could theoretically wipe out the available supply for everyone else.

This is where the multi cloud strategy comes in. While they have their primary partners like Core Weave, they also have the ability to spill over into other providers if they hit a capacity limit. They use a global scheduler that looks at the availability across multiple data centers. If Virginia is full, your request might get routed to a data center in Oregon or even Europe, depending on the latency requirements.

And they use pricing as a throttle, right? I noticed that the price for a B two hundred on Modal is around six dollars and twenty five cents per hour, which is actually very competitive compared to the on demand prices at the big clouds.

It is, and that is because of the utilization. If you rent a Graphics Processing Unit for a month, it might cost you three thousand dollars. But if you only use it for ten minutes a day to process a few podcast episodes, you are wasting ninety nine percent of what you paid for. The serverless providers can sell those same minutes to a thousand different Daniels. They are arbitraging the idle time of these chips. It is the ultimate sharing economy for silicon.

I also think about the transparency that Daniel mentioned. He likes being able to see the logs in real time. When he runs a command on his laptop here in Jerusalem, and it executes on a chip three thousand miles away, it feels like it is running locally. How do they handle that streaming data without it feeling laggy?

That is mostly clever networking. They use specialized protocols like g R P C and high speed backbones to tunnel the standard output of the container back to the local terminal. They are also using edge locations for the initial handshake. So your request hits a server near you, which then tunnels it through a high speed line to the actual compute cluster. It is a very layered cake of technology.

Let us talk about the hardware limitations again. We touched on this back in episode twenty five, but it seems even more relevant now. The V RAM is still the biggest constraint, right? If you run out of memory on a Graphics Processing Unit, the whole process just crashes. You cannot just swap to the disk like you can with a standard processor.

Exactly. And that is why these providers have to be so precise with their bin packing. If they put two tasks on the same chip and they both try to use eighty percent of the memory at the same time, the whole machine goes down. So they use custom drivers and specialized container runtimes that can monitor the memory usage in real time and move tasks around if things get too crowded. It is a constant balancing act.

It is funny that Daniel mentioned the mystery of A P I providers versus the purity of serverless. I think he is onto something there. When you use an A P I like Open A I, you have zero visibility into what is happening. You send a request, you wait, and you get a response. You do not know if they are running on an A one hundred, an H one hundred, or a cluster of older chips. You also do not know why it is slow. Is it the model? Is it the network? Is it their database?

With the serverless container model, you are in control of the code. You can see exactly where the bottleneck is. If your image processing takes five seconds, you can look at the logs and see that four seconds were spent just downloading the weights of the model from a storage bucket. That transparency is why developers love it. It takes the magic out of it and replaces it with engineering.

That is a great point. The data transfer problem is often overlooked. These models are huge. A standard stable diffusion model might be five gigabytes, but some of the newer multimodal models are fifty or even one hundred gigabytes. If you have to pull that much data from a hard drive to the Graphics Processing Unit memory every time someone makes a request, it is never going to be fast.

Right, so the providers build these massive high speed internal networks. They use N V M e drives on the host machines to cache the most popular models. So when you say I want to run this specific model, the orchestrator tries to send your job to a machine that already has those files sitting in its local cache. They are essentially building a globally distributed file system just for model weights.

I also want to go back to the physical infrastructure for a second. Daniel asked if they own it. While many start by renting, as they scale, they almost always move toward owning or at least co locating their own hardware. Because when you own the hardware, you can modify the B I O S, you can optimize the power delivery, and you can choose exactly which networking cards to use. If you can squeeze five percent more efficiency out of a rack of servers, that is pure profit at scale.

And that is where we get into the liquid cooling. I have seen some of these newer data centers using liquid immersion cooling, where they literally dunk the entire server into a non conductive fluid to keep it cool. It sounds like science fiction, but it is becoming necessary. A rack of Blackwell chips can generate so much heat that traditional air cooling just cannot keep up. You would need a hurricane of air to move that much thermal energy.

That is wild. It makes Daniel’s point about the cooling system in his apartment even more relevant. Even if he could afford the fifty thousand dollar chip, he could not keep it cool without his electricity bill doubling and his room feeling like a sauna. By moving that problem to a serverless provider, he is essentially outsourcing the thermodynamics of the project.

He really is. And it allows for these weird, experimental workflows. Like what we are doing here. We do not need a supercomputer all the time. We just need a supercomputer for three minutes at four o'clock on a Tuesday. It democratizes the brains of the internet. We have seen people building everything from medical imaging tools to automated video editors on these platforms. Things that would have required a whole team of infrastructure engineers five years ago are now being built by solo developers in their bedrooms.

It also changes how you think about architecture. Instead of building one big, monolithic application, you build a series of small, ephemeral functions. It is much more resilient. If one part of your pipeline fails, it does not take down the whole system. And you only pay for what you use. We can run this entire show's back end for less than the cost of a couple of cups of coffee in Jerusalem.

It is a good time to be a builder, that is for sure. And I think Daniel’s curiosity about the mystery of it is something a lot of people feel. We are so used to technology being a black box. But the serverless model, at least the way companies like Modal do it, is surprisingly transparent if you know where to look. It feels more like a collaborative tool than a service you are just consuming. You feel like you are working directly with the hardware, even if that hardware is in a different time zone.

So, what do you think the future looks like for these providers? Do they eventually get swallowed up by the big three cloud giants, or do they become the new giants themselves?

I think we are going to see a split. Some will definitely be acquired because Amazon and Google want that developer experience. They want the cool factor and the ease of use. But I also think there is room for a few of these specialized providers to become massive independent companies. Because the needs of artificial intelligence are so different from the needs of a standard web server. A web server needs high availability and low latency for small packets of data. An artificial intelligence workload needs massive bandwidth, huge memory, and long running compute cycles. They are fundamentally different engineering problems.

Exactly. It is like the difference between a city bus and a heavy duty freight train. They both move things, but you would not use a bus to move ten thousand tons of coal. Well, I hope that sheds some light on the mystery for Daniel and for everyone else listening. It is a fascinating world under the hood.

It really is. And if you are out there building something with these tools, we would love to hear about it. The way people are stitching these different services together is where the real innovation is happening right now. It is not just about the models; it is about the plumbing.

Absolutely. And hey, if you have been enjoying these deep dives into the weird world of prompts and the tech that powers them, we would really appreciate it if you could leave us a review on Spotify or whatever podcast app you use. It genuinely helps more people find the show. We are trying to grow this thing, and every review counts.

Yeah, it really does make a difference for a show like ours. We love doing this, and the more people we can bring into the conversation, the better. You can find all our past episodes, including the ones we mentioned today about Graphics Processing Units and artificial intelligence inference, at our website, myweirdprompts.com.

There is a contact form there too if you want to send us your own weird prompts. Or just come find us in Jerusalem. We are the ones arguing about container orchestration over a plate of hummus at the market.

Speak for yourself, Herman. I am usually the one just trying to get the code to run while you explain the difference between a hypervisor and a micro virtual machine.

Fair enough. Well, this has been another episode of My Weird Prompts. Thanks to Daniel for the prompt and thanks to all of you for listening. We will see you next time.

Catch you later. Stay curious.

And keep those prompts coming. Bye for now.

Wait, Corn, before you stop the recording, I just thought of something. If we did get that liquid cooling setup, we could probably use the waste heat to keep the hummus warm.

Herman, no. We are not dunking servers in the kitchen. The landlord already thinks we are running a crypto mine because of the fan noise.

Seriously though, thanks for listening everyone. We will be back next week with another one. We might even talk about the specific trade offs of scaling text to speech workloads, which connects perfectly to what Daniel was asking today.

Oh, that is a good one. We can look at episode three hundred and forty six for that. Alright, now I am actually hungry. Let's go get some lunch.

Lead the way, Herman Poppleberry. I think it is your turn to pay.

Was it? I thought we were arbitraging the lunch bill today.

Nice try, Herman. Nice try.

Worth a shot. Goodbye everyone!

Goodbye!

This has been My Weird Prompts.

A production of two brothers and a very busy housemate.

Indeed.

Okay, now I am really stopping the recording. Done.

Wait, is it still on?

No, I am... wait, now it is. Just kidding. Bye!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.