Daniel sent us this one — and it's one of those topics where I think a lot of developers just kind of nod along but haven't actually sat down and understood what's happening under the hood. He's working with audio pipelines, sending voice data to transcription APIs, and he keeps running into this question: what actually is Base64 encoding, why do so many APIs want it, and what are the real limits when you're pushing audio files through it? He mentions figuring out he could send about an hour of audio through a Base64 payload, which surprised him. He wants clarity for people building with these things for the first time.
Before we dive in — fun fact, DeepSeek V four Pro is writing our script today. So credit where it's due, the model's doing the heavy lifting on this one.
DeepSeek, appreciate it. Now, Daniel also threw in a quick caution about DeepAgent, the framework we talked about yesterday. He's noticed recursive tool-calling loops where agents keep hitting the same URL over and over — he mentioned nytimes.com as an example — and said you have to put in guardrails. But he wants us to focus on Base64, and I think that's the right call because this is one of those things that's everywhere once you start looking.
It really is. And the thing I want to say right at the top, because I think it's the single biggest misconception people have — Base64 is not compression. It's the opposite, actually. You're taking binary data and expanding it by about thirty-three percent. So when Daniel says it seemed insane to take an audio file and turn it into this huge glob of numbers, he was exactly right to be suspicious. It is inefficient in terms of size. But the reason we do it is about safety and compatibility, not efficiency.
Expand on that. What do you mean by safety?
Binary data is just raw bytes — and some of those bytes, when you try to send them through systems designed for text, will break things. Email protocols, JSON payloads, XML, even some HTTP headers — they were all built assuming the data is readable text. A null byte, a carriage return, a byte that looks like an end-of-file marker — any of those can corrupt your transmission or get stripped out entirely. Base64 solves this by taking any arbitrary binary data and representing it using only sixty-four characters that every text-based system can handle safely. Uppercase A through Z, lowercase a through z, digits zero through nine, plus and forward slash. That's it. Sixty-four characters. Nothing in that set will confuse a text parser.
It's essentially a safe transport encoding. You're not making it smaller, you're making it survivable.
And the name tells you what it is — Base64, meaning base sixty-four. You're representing data in a numbering system with sixty-four symbols instead of the two symbols of binary or the two hundred fifty-six values of a raw byte. Every three bytes of input becomes four characters of output. That's where the thirty-three percent overhead comes from. Three bytes is twenty-four bits. You split those twenty-four bits into four groups of six bits each. Each six-bit group maps to one of the sixty-four characters. So three bytes in, four characters out. If your input isn't a multiple of three bytes, you get padding with equals signs at the end.
This is why when you look at a Base64 string, it looks like gibberish — but it's not random. It's a direct, reversible mapping.
Every Base64 string is just a different representation of the same underlying bits. It's like writing a number in hexadecimal instead of decimal — same value, different notation. Base64 is just a more compact text representation than hex, which would use two characters per byte. Base64 uses about one point three characters per byte on average. Hex uses two characters per byte. So Base64 is more efficient than hex for text transport, even though it's still bigger than the raw binary.
Daniel's audio pipeline — he records a voice note on his phone or whatever, and that audio file is binary data. If he's sending it to an API that expects JSON, he can't just dump raw bytes into a JSON string. The JSON parser would choke on it. So he Base64-encodes the audio, wraps it in a JSON field, and sends it. The server decodes it back to binary and processes it.
That's exactly the flow. And the question he asked about limits — how much audio can you actually send this way — that's where things get interesting. It's not really a Base64 limit. It's about the limits of the systems at both ends. Base64 itself can encode arbitrarily large data. You could encode the entire Library of Congress in Base64 if you had enough memory and time. The practical limit comes from API payload size restrictions, memory constraints on the server, and timeouts.
Daniel mentioned he calculated about an hour of audio. Let's walk through that math.
Let's take a typical compressed audio format. If you're recording voice for transcription, you're probably using something like MP3 or Opus or AAC. For voice, you don't need high bitrate. A mono MP3 at sixty-four kilobits per second is perfectly intelligible. Let's do the math. Sixty-four kilobits per second. That's eight kilobytes per second. For one minute, that's about four hundred eighty kilobytes. For an hour, that's about twenty-eight point eight megabytes. Now, Base64 encode that, you get about thirty-eight point four megabytes of text. That's the payload size.
Most modern APIs — REST APIs, JSON endpoints — what's their typical payload limit?
It varies, but ten megabytes is a very common limit for a lot of web servers and API gateways. Some go up to fifty, some to a hundred. AWS API Gateway, for example, has a hard limit of ten megabytes for REST APIs. If you're using HTTP APIs on API Gateway, you can go up to three hundred kilobytes by default, but you can increase that. So a thirty-eight megabyte Base64 payload would blow past a lot of default limits. But Daniel said he's using Cloudflare R2 for storage, and he's probably not sending the Base64 through API Gateway — he's likely uploading directly to R2, which handles much larger files via presigned URLs or direct uploads.
The pipeline might be: audio gets recorded, Base64 encoded, sent as a JSON payload to a webhook, the webhook uploads it to R2, and then a transcription service picks it up from R2. The Base64 step is just the bridge from the client to the first server.
And if the transcription service itself accepts Base64 — which many do — you might skip the intermediate storage entirely and just send the Base64 directly to the transcription endpoint. OpenAI's Whisper API, for instance, accepts multipart form data with the audio file, not Base64 in JSON. But other services, especially older or more enterprise-focused ones, do accept Base64 inline. It really depends on the API design.
Let's talk about the alternatives Daniel mentioned. He brought up S3 buckets — or R2 in his case — as the other main approach. Direct file upload versus Base64 inline. What are the tradeoffs?
Direct file upload is almost always better for large files. You get a presigned URL from your server, the client uploads directly to object storage, and then you pass a reference — a URL or an object key — through your pipeline. The audio data itself never goes through your application server as a Base64 string. That means your server doesn't have to hold a thirty-megabyte string in memory, decode it, and then do something with it. For large files, that memory pressure is real. If you have a hundred concurrent uploads and each one is a thirty-eight megabyte Base64 string, you're looking at nearly four gigabytes of memory just for the strings, plus the decoded binary buffers.
The direct upload approach scales much better. But there's a complexity cost, right? Now you're managing presigned URLs, you've got an additional network round trip, you have to handle the upload lifecycle.
And for small payloads — say, a short voice command that's a few seconds long, maybe a hundred kilobytes of audio — Base64 in a JSON payload is actually simpler. One request, one response, done. No presigned URL dance, no separate upload step, no cleanup. The simplicity is real. The line where it stops making sense is when the payload size starts hitting API limits or causing noticeable latency from the encoding and decoding overhead.
There's a middle ground that some people miss — multipart form data. You're sending the raw binary as part of an HTTP request, but it's not Base64 encoded. The binary data goes in its own part of the multipart message, with proper content-type headers. The server receives it as a file upload, same as if you'd submitted an HTML form with a file input. That's what OpenAI's transcription endpoint uses.
Multipart is underappreciated. It gives you the simplicity of one request without the thirty-three percent overhead of Base64. The binary data travels as binary. The text fields — your API key, your model selection, your parameters — travel as text. Everyone's happy. The only limitation is that not all client libraries make multipart uploads as ergonomic as a simple JSON body. And if you're going through certain proxies or middleware, multipart can get mangled.
We've got three approaches. One: Base64 inline in JSON — simple, compatible, but thirty-three percent overhead and memory-hungry. Two: direct file upload to object storage — scalable, efficient, but more moving parts. Three: multipart form data — efficient, single request, but less universally supported. Daniel, is that a fair summary of the landscape?
I'd add a fourth, actually. If your audio is being generated in real time — like from a microphone in a browser — you can stream chunks to a WebSocket or use something like WebRTC. The audio arrives at the server as a stream of binary chunks. No encoding, no temporary files, no presigned URLs. Deepgram's API supports this really well. So does AssemblyAI. You open a WebSocket connection, start sending audio chunks as they're captured, and you get transcription results back incrementally. For real-time use cases, it's the only approach that makes sense.
That's where a lot of the agentic audio pipelines are heading — real-time voice agents that don't wait for a full recording before they start processing. But Daniel's use case seems to be more batch-oriented. He records a prompt, sends it in, it gets processed. So the streaming approach might be overkill for his pipeline.
But it's worth knowing about because once you've built the batch pipeline and it's working, someone's going to ask, "Can we make this real-time?" And the answer is yes, but the architecture is completely different.
Let's go back to Base64 for a minute. There's a nuance I think a lot of developers miss when they're first working with it. Daniel mentioned looking at a Base64 string and seeing a huge glob of numbers. But actually, the characters aren't just numbers — they're a specific alphabet. And there's more than one version of Base64.
This trips people up all the time. The standard Base64 alphabet uses plus and forward slash as the last two characters. But if you're putting Base64 in a URL, those characters are problematic — plus becomes a space in query strings, forward slash looks like a path separator. So there's a URL-safe variant that uses minus and underscore instead. And then there's the padding question. Some implementations strip the trailing equals signs because they can be inferred from the string length. Others require them. If you're working with multiple systems that each have slightly different Base64 expectations, you can spend hours debugging why your perfectly valid Base64 string is being rejected.
You've got to know which flavor your API expects. And the documentation isn't always clear about it.
The documentation is almost never clear about it. You usually find out by trial and error. Send a Base64 string with padding — rejected. Strip the padding — accepted. Or vice versa. It's one of those things where reading the source code of the client library is faster than reading the docs.
Another thing that catches people — and this relates to Daniel's question about limits — is that when you Base64-encode something, you're creating a string in memory. In most programming languages, strings are immutable and stored in a particular way. If you're working in JavaScript or Python and you Base64-encode a fifty-megabyte audio file, you're creating a string that's about sixty-seven megabytes long. That's going to put pressure on the garbage collector, it's going to take time to allocate, and if you're doing it in a serverless function with limited memory, you might just crash.
This is why for larger files, you want to stream the encoding. You don't load the entire file into memory, encode it, and then send it. You read chunks, encode each chunk, and stream the output. Most Base64 libraries support this, but the default examples usually show the all-at-once approach because it's simpler to demonstrate.
On the decoding side, same thing. If your server receives a sixty-seven megabyte Base64 string in a JSON body, the JSON parser has to allocate that entire string before you can even start decoding it. Then you decode it, which creates another fifty-megabyte buffer. You've now got over a hundred megabytes allocated just to handle one audio file. That's fine for one request, but at scale it adds up fast.
This is actually why a lot of production systems use the direct upload approach even for relatively small files. It's not that Base64 doesn't work — it's that it shifts the burden onto your application servers, which are often the most expensive and least scalable part of your infrastructure. Object storage is cheap and scales horizontally almost effortlessly. Your API server, not so much.
The architectural principle here is: keep your application servers doing application logic, not acting as file transfer intermediaries. Base64 is a bridge, not a highway.
Use it when you need to get binary data through a text-only channel, and the data is small enough that the overhead doesn't matter. For everything else, find a way to move the binary data directly.
Let me ask you something about the encoding itself. You said every three bytes becomes four characters. What happens if your input is exactly one byte? Or two bytes?
If you have one byte — eight bits — you pad it to twelve bits with zeros, split into two six-bit groups, output two characters, and then add two equals signs for padding. If you have two bytes — sixteen bits — you pad to eighteen bits, split into three six-bit groups, output three characters, and add one equals sign. The equals signs are how the decoder knows how much padding was added. If you see two equals signs, you know the original input was one byte. One equals sign means two bytes. No padding means the input was a multiple of three bytes.
If you strip the padding, the decoder can still figure it out from the string length. A Base64 string with no padding that has a length not divisible by four — you can work backward to determine the original length.
If the length modulo four is two, you had one byte of input. If it's three, you had two bytes. If it's zero, you had a multiple of three bytes. The padding is redundant information, which is why it can be safely stripped in contexts where string length is preserved.
This is the kind of thing that, once you understand it, you stop being mystified by those equals signs at the end of Base64 strings. They're not random. They're telling you something specific about the original data.
The whole scheme is elegant in its simplicity. It's from the early days of email attachments — MIME, the Multipurpose Internet Mail Extensions standard, specified Base64 as a way to send binary files through email, which at the time was purely text-based. That was in the early nineties. We've been using this for over thirty years, and it's still everywhere. Every time you look at a PNG image embedded in an HTML page as a data URL, that's Base64. Every time you see a certificate file in PEM format, that's Base64. Every JWT token you decode has Base64-encoded segments. It's one of those quiet workhorses of the internet.
People keep reinventing it badly. I've seen developers try to roll their own encoding schemes because Base64 "seems wasteful." They usually end up with something that breaks on a different character set or chokes on a null byte. The thirty-three percent overhead is a feature, not a bug — it's the price of universal compatibility.
There's actually a variant called Base85, used in PDF files and some other formats, that uses eighty-five characters and only has about twenty-five percent overhead. But it's more complex to implement and the characters it uses include things like parentheses and angle brackets that can cause issues in some contexts. Base64 hit the sweet spot of simplicity and safety.
Let's circle back to Daniel's specific use case — audio transcription pipelines. He's got this setup with Cloudflare R2. Walk me through what you think the ideal architecture looks like for someone building something similar.
You've got a few different patterns. If you're building a mobile app or a web app where users record audio and you need to transcribe it, the cleanest approach I've seen is: client records audio, gets a presigned upload URL from your backend, uploads directly to R2 or S3, and then your backend gets notified — either via a webhook or by polling — that a new file is available. The backend then sends the file to your transcription service. If the transcription service can read directly from your object storage, even better — you just pass it the URL and it fetches the file itself.
Where does Base64 fit into that? It might not.
It might not. And that's fine. Base64 is not something you should feel obligated to use. It's a tool for a specific job. The job is: I have binary data and I need to put it somewhere that only accepts text. If your architecture doesn't have that constraint, you don't need Base64. Daniel mentioned that a lot of APIs want Base64 — and that's true for some older or more enterprise APIs — but increasingly, the modern approach is direct binary uploads.
There's also the question of what happens when something goes wrong. With a direct upload, if the upload fails, the client can retry just the upload. With Base64 inline, if the request fails, you're re-sending the entire payload, including the encoded audio. For a thirty-megabyte payload on a flaky mobile connection, that's painful.
If you're building for mobile, you should also think about what happens when the app goes into the background. On iOS, you've got a limited time to finish your network request before the system suspends you. If you're Base64-encoding a large audio file on the main thread and then trying to send it, you might not finish before the suspension kicks in. Direct uploads with background upload tasks are much more reliable for that scenario.
For mobile, direct upload is almost always the right call.
For anything over maybe one or two megabytes, yes. Below that, the simplicity of Base64 inline can be worth it. You save the round trip for the presigned URL, you save the complexity of handling the upload lifecycle, and the memory and latency overhead is negligible.
There's one more thing I want to touch on. Daniel mentioned looking at a Base64 string and thinking, "This is a huge glob of numbers." And that's the other thing about Base64 — it's not human-readable in any meaningful sense. You can't look at a Base64 string and tell if it's an audio file or an image or a PDF. It all looks the same. That can make debugging harder. If something goes wrong in your pipeline, you can't just glance at the payload and say, "Oh, that's clearly a corrupted MP3.
Although — and this is a small thing — you can sometimes identify the file type from the first few characters after decoding. Many file formats have magic bytes at the beginning. A PNG file always starts with the bytes that correspond to the ASCII characters for PNG. An MP3 file starts with a sync word of eleven ones. If you decode the first few characters of the Base64 and look at the raw bytes, you can identify the format. But you're right, in the encoded form, it's opaque.
If you're debugging a pipeline, you want logging that captures the content type and file size before the Base64 encoding step, not after.
Ideally, you want to validate the decoded data before you send it downstream. If someone accidentally sends you a text file instead of an audio file, you want to catch that early, not after it's been sent to the transcription service and you get back an error or, worse, garbage output.
Let's talk about the security angle for a moment. Base64 is not encryption. It's encoding. Anyone who has the Base64 string can decode it back to the original binary. I've seen junior developers confuse these two concepts — they think Base64 is somehow protecting the data. It's not. It's like writing something in a different alphabet. If someone knows the alphabet, they can read it.
This is a really important point. If you're sending sensitive audio data — say, a confidential business meeting recording — and you Base64-encode it before sending it over the network, you have not added any security. You need actual encryption for that, which means TLS for the transport layer, and possibly additional encryption at rest. Base64 provides zero confidentiality. It's purely a format conversion.
The other security consideration: if your server is accepting Base64-encoded data and decoding it, you need to be careful about the size of the decoded output. This is a vector for denial-of-service attacks. Someone could send you a relatively small Base64 payload that decodes to a massive binary blob — this is called a "zip bomb" pattern, but with Base64. Your server tries to allocate memory for the decoded data and falls over.
This is why you should always check the size of the decoded output before allocating the buffer. Most Base64 libraries let you calculate the decoded size from the encoded string length before you actually decode. If the decoded size exceeds your limit, reject the request early.
We've got encoding overhead, memory pressure, API limits, debugging opacity, and security considerations. Base64 is simple in concept but has a lot of edges in practice.
Yet it's survived for thirty-plus years because it solves a real problem elegantly. The internet is a text-based system at its core. HTTP headers are text. JSON is text. XML is text. Email is text. Whenever you need to move binary data through these text-based channels, you need something like Base64. Until we rebuild the entire internet to be binary-native — which is not happening — Base64 is going to be with us.
There's an interesting parallel here with how we think about AI agents, actually. Daniel mentioned the DeepAgent framework and those recursive tool-calling loops. In both cases, you've got a mismatch between what the system expects and what you're giving it. With Base64, the mismatch is binary versus text. With agent loops, the mismatch is between the agent's goal and the tool's actual behavior. The agent expects that calling a URL will give it new information, but if it's the same URL with the same parameters, it gets the same result, and without a guardrail, it doesn't know to stop. In both cases, the fix is understanding the underlying mechanism, not just treating it as a black box.
That's a really good connection. And the guardrail Daniel mentioned for DeepAgent — what he called "standoff things" — those are essentially the equivalent of validating your Base64 before you send it. You're checking: have I already called this URL? Has anything changed since last time? Should I actually make this call or am I in a loop? It's the same principle of understanding the failure modes of your transport layer, whether that transport is an HTTP request or an agent's tool-calling interface.
For someone building with these things for the first time, what's the one thing you want them to take away about Base64?
That it's a tool for a specific job, not a default. Before you Base64-encode something, ask yourself: does this data actually need to go through a text-only channel? If the answer is no, you probably have a better option. Multipart form data, direct upload, or streaming. Base64 is your fallback when those aren't available.
If you do use it, understand the overhead. Thirty-three percent size increase, memory pressure from string allocation, and the fact that debugging encoded data is harder than debugging raw binary. None of these are reasons to avoid Base64 entirely — they're reasons to use it deliberately.
Test your limits. Daniel did the right thing by actually calculating how much audio he could send. A lot of developers just assume it'll work until it doesn't, and then they're debugging a production outage at two in the morning. Sit down, do the math on your bitrate, your expected recording length, the Base64 overhead, and your API's payload limit. It's ten minutes of arithmetic that can save you hours of debugging.
There's one more edge case worth mentioning. Some systems impose limits on Base64-encoded data specifically, not just on raw payload size. For example, certain API gateways will reject a JSON body if any single string field exceeds a certain length, even if the total body size is within limits. If you're embedding Base64 in a JSON field called "audio_data" and that field is sixty-seven megabytes, you might hit a field-level limit that isn't documented anywhere obvious.
Some JSON parsers have limits on string length. In Python, the default JSON library doesn't have a hard limit, but if you're using a streaming parser or a parser with security limits enabled — which you should be, in production — it might cap strings at a few megabytes. If your sixty-seven megabyte Base64 string hits that cap, the parser throws an error and your request fails. That's another reason to prefer direct uploads for large files.
The limits aren't just about Base64 itself. They're about the entire stack — the client's memory, the network's reliability, the server's JSON parser, the API gateway's field size limits, the application's memory allocation. Base64 is just one link in a chain, and the chain breaks at its weakest point.
That weakest point is usually not the Base64 encoding algorithm. It's the assumption that everything else in the pipeline can handle a giant text string. Most systems are optimized for lots of small strings, not a few enormous ones.
Let's zoom out for a second. Daniel's original question was essentially: I'm building audio pipelines, Base64 seems weird and inefficient, what are the limits, what are the alternatives, help me understand this. I think we've covered the alternatives and the limits pretty thoroughly. But I want to make sure we've actually answered the "what is Base64" part in a way that's useful for someone who's encountering it for the first time.
Let me try to distill it. Base64 is a way to represent any binary data — an audio file, an image, a PDF, anything — using only sixty-four text characters that are safe to transmit through any text-based system. It works by taking your binary data in groups of three bytes, splitting them into four six-bit chunks, and mapping each chunk to a character in the Base64 alphabet. The output is about thirty-three percent larger than the input. It's not compression, it's not encryption — it's a format translation. You use it when you have to send binary data through a channel that only accepts text. If you don't have that constraint, you have better options.
That's clean. I'd add one thing: Base64 is everywhere, and once you know what to look for, you'll see it constantly. Those long strings of apparently random letters and numbers that end with equals signs? That's Base64. In API responses, in web pages, in email headers, in configuration files. It's one of those foundational technologies that most of the internet runs on but almost nobody thinks about.
Now Daniel — and our listeners — can be among the people who actually understand it.
Before we wrap, I want to touch on something Daniel mentioned at the very beginning of his prompt. He said he lumps these technical deep-dive episodes under "continuous professional development." And I think that's exactly right. The difference between a developer who struggles with these pipelines and one who builds them confidently is often just understanding what's actually happening under the hood. Base64 isn't complicated once you spend ten minutes with it. But if you never do, it remains this mysterious thing that sometimes works and sometimes doesn't.
The recursive tool-calling issue with DeepAgent that he flagged — same thing. Once you understand that an agent doesn't have inherent judgment about when to stop, you start building guardrails. The technical details matter. They're not just trivia — they're the difference between a system that works reliably and one that fails in production.
To summarize the practical advice for someone building audio pipelines: for small audio clips — a few seconds, maybe up to a minute — Base64 inline in a JSON payload is fine. Keep it simple. For anything larger, use direct upload to object storage with a presigned URL. For real-time transcription, use WebSockets and streaming. And regardless of which approach you pick, understand the limits of every component in your pipeline — not just the one you're focusing on.
If you're working with an API that requires Base64, check which variant it expects. Standard or URL-safe? Padding or no padding? These small differences will waste hours if you don't catch them early.
Should we do the fun fact?
Let's do it.
Now: Hilbert's daily fun fact.
Hilbert: The average cumulus cloud weighs about one point one million pounds — roughly the same as one hundred elephants — and stays aloft because its weight is distributed across millions of tiny water droplets spread over a vast volume of rising warm air.
A hundred elephants floating above us at all times. actually kind of unsettling.
I'm going to think about that every time I look up now.
And thanks to our producer, Hilbert Flumingtop, for keeping this show running.
This has been My Weird Prompts. If you want more episodes like this one — technical deep dives that actually explain what's happening — you can find us at myweirdprompts.com or on Spotify.
We'll be back soon. Until then, keep questioning your assumptions about how things work.
Especially the ones that end with equals signs.