#1218: The Digital Sandwich: Why Real-Time Voice Typing Fails

Why does voice typing feel so clunky compared to recording a memo? We explore the technical hurdles of real-time AI transcription.

0:000:00

Episode Details

Published: Mar 15
Duration: 26:17
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The experience of voice dictation in 2026 remains surprisingly undignified. Users often find themselves in the "digital sandwich" pose—holding a smartphone horizontally, speaking into the microphone, and watching a cursor stutter across the screen. Despite massive leaps in artificial intelligence, there is a persistent gap between the quality of "batch" transcription (processing a recording after the fact) and "real-time" voice typing.

The Context Gap

The primary reason real-time dictation feels "jittery" compared to apps like Otter or Whisper is a lack of context. When an AI processes a pre-recorded audio file, it benefits from bidirectional context. It can look at the end of a sentence to retroactively correct an ambiguous word at the beginning.

In contrast, real-time typing is "blindfolded to the future." The model must make a high-stakes guess based only on the audio it has received up to that millisecond. This leads to the "flicker" effect, where the text on the screen constantly changes—from "there" to "their" to "they are"—as more data trickles in. This visual instability is not just annoying; it disrupts the user's flow state and cognitive process.

The Problem of Silence

A second major hurdle is Voice Activity Detection (VAD). This is the logic that determines when a user has finished a thought. Most current systems use simple energy-based VAD, which cuts off the microphone if the volume drops below a certain level for a few hundred milliseconds.

However, human speech is naturally rhythmic and filled with pauses for breath or thought. If the VAD is too aggressive, it cuts the user off mid-sentence. If it is too passive, the system sits idle, leaving the user wondering if the app has crashed. While newer neural VAD models can now detect the "prosody" or musicality of speech to better distinguish between a thinking pause and a finished sentence, the trade-off remains a struggle between accuracy and latency.

The "Goldilocks" Solution: Buffered-Async

To solve these issues, the industry is moving toward a "buffered-async" architecture. Instead of trying to translate every single sound into a letter instantly, the system creates a small local buffer of one to two seconds. This "mini-batch" approach gives the AI enough context to handle grammar and punctuation correctly while keeping the delay short enough to feel responsive.

By waiting for a natural phrase boundary before committing text to the screen, the system eliminates the flickering effect. This creates a "shock absorber" for the model’s inference, allowing it to deliver finalized, accurate text in small, clean bursts.

Local Hardware and the Future

The shift toward seamless dictation is being accelerated by hardware. Modern chips with dedicated Neural Processing Units (NPUs) allow these complex models to run locally on the device. This eliminates the latency of sending audio to the cloud and addresses privacy concerns. As on-device AI becomes more powerful, the goal is to move away from "guessing" and toward a system that truly understands the rhythm of human thought, finally closing the gap between the spoken word and the digital page.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1218: The Digital Sandwich: Why Real-Time Voice Typing Fails

Daniel's Prompt

Custom topic: Speech-to-text and transcription: two very different use cases that are often bucketed together. The first is voice typing, where the user speaks and the AI tool transcribes their words and inserts it

You ever find yourself holding your phone like a slice of pizza, staring at a cursor that refuses to move, while you shout at a microphone in the middle of a sidewalk? It is what people call the digital sandwich, and honestly, it is one of the most undignified looks of the twenty-twenties. You are standing there, thumb hovering over a tiny blue waveform, praying that the operating system actually captures the brilliance you are trying to dictate before a bus drives by or you lose your train of thought.

I am Herman Poppleberry, and I resemble that remark. I have definitely been that person holding the phone horizontally, trying to get a voice memo to capture a thought about a new database schema, only to have the operating system cut me off mid-sentence because I dared to take a breath to oxygenate my brain. It is the ultimate betrayal of the user interface. You are promised a hands-free future, but you end up babysitting a temperamental progress bar.

It is the worst. You are in the flow, you are trying to replace the keyboard because your hands are full, or maybe you are just tired of the repetitive strain of typing, and the technology just fights you. It is clunky, it is jittery, and it feels like it was designed by someone who has never actually tried to dictate a complex thought. Today's prompt from Daniel is about this specific friction. He is looking at the architectural divide between real-time voice typing and batch transcription. It is a great prompt because we often bucket these things together as just speech-to-text, but the implementation hurdles are night and day.

Daniel is hitting on something that frustrates every power user in twenty-twenty-six. There is a fundamental UX friction between typing and recording. When you are recording a voice note, you are essentially creating a file that gets processed later. It is a bucket of data. But when you are voice typing, you want that cursor to be an extension of your thought process. You want it to appear on the screen as you speak, but without the stuttering and the jumping around that makes current operating system dictation feel so broken. We are talking about the difference between a post-production edit and a live broadcast.

And that brings us to the two buckets. On one hand, you have the asynchronous batch processing, like what you see in a dedicated voice note app like Otter or even the native Voice Memos. You talk for five minutes, hit stop, and then a heavy-duty model like Whisper runs over the whole file and gives you a clean transcript. On the other hand, you have the real-time, operating system-level input where the A-I is trying to guess what you are saying while you are still saying it. Why is it, Herman, that the same model architecture often feels like a genius in the first case and a complete toddler in the second?

The core of the problem is context. When you give a model a five-minute audio file, it has the luxury of looking at the beginning, the middle, and the end of every sentence before it decides what a specific sound was. It can use the end of a sentence to retroactively figure out a word at the start that might have been ambiguous. It is called bidirectional context. But in real-time typing, the model is essentially blindfolded to the future. It has to make a guess based only on what has happened up to that millisecond. It is trying to solve a puzzle while the pieces are still being manufactured.

It is like trying to finish someone's sentence when they are only two words in. You might be right, but you are probably going to have to change your mind three times before they are done. That is why we see that annoying behavior where the text on the screen keeps flickering and changing as you speak. The model is constantly revising its best guess as more audio data trickles in. It is visually exhausting. You see the word "there" appear, then it changes to "their," then it changes to "they are" as the rest of the clause arrives. It makes you want to stop talking just to see if it catches up.

And that leads directly into the first big technical hurdle Daniel mentioned, which is Voice Activity Detection, or V-A-D. This is the logic that decides when you have started speaking and, more importantly, when you have finished. In a batch process, V-A-D is easy because you can see the silent gaps in the waveform after the fact. But in real-time, the system has to decide in a fraction of a second if that silence is a pause for thought or the end of the interaction. If the system gets it wrong, it either cuts you off or sits there awkwardly doing nothing.

This is the trap. If the V-A-D is too aggressive, it cuts you off the moment you hesitate. If it is too lazy, the system sits there waiting while your cursor does nothing, and you start wondering if the app crashed. I find that over-aggressive V-A-D is a total flow-state killer. If I am composing a complex email by voice, I need to be able to pause and think for two seconds without the microphone turning off or the model deciding I am done with the paragraph. It feels like the computer is constantly tapping its watch, telling me to hurry up.

Most current implementations use simple energy-based V-A-D. They are just looking at the volume of the input. If the volume drops below a certain decibel level for three hundred milliseconds, it triggers a stop. The problem is that human speech is not just a steady stream of noise. We have stop consonants, we have breaths, and we have those thinking pauses Daniel mentioned. A three hundred millisecond window is often not enough to distinguish between the letter p and a person stopping to think. It is a blunt instrument for a very delicate human behavior.

So if energy-based V-A-D is the blunt instrument, what is the scalpel? Are we seeing neural V-A-D models that actually understand the prosody of speech?

We are. There are smaller, specialized neural networks now, like Silero V-A-D, which has become a bit of an industry standard. These are trained specifically to distinguish between speech, background noise, and intentional silence. These models can look for the rhythmic patterns of a sentence that is winding down versus a sentence that is clearly mid-thought. They are looking at things like pitch decay and the spectral signature of a breath. But even with a better V-A-D, you still run into the inference latency problem. Every millisecond you spend waiting to see if the user is truly done is a millisecond of lag added to the text appearing on the screen.

That lag is where the frustration lives. If I say a sentence and it takes a full second for the text to appear, I have already lost my train of thought. But if it appears instantly, it is often full of errors because the model did not have enough context. This ties into the context window issue. Models like Whisper are designed to process audio in thirty-second chunks. When you try to force them into a streaming mode where they are getting audio in fifty-millisecond increments, they struggle because they do not have enough surrounding data to make sense of the phonemes. It is like trying to read a book through a keyhole.

It is actually a bit worse than that because of how the attention mechanism works in these transformer-based models. Whisper wants to attend to the entire chunk of audio. If you truncate that audio, the attention weights get wonky. You end up with hallucinations or repetitions. You have probably seen this if you have used a poorly implemented Whisper stream where it just starts repeating the last three words over and over again. It is essentially the model's way of saying it does not have enough information to move forward, so it just loops on its last high-confidence guess.

That brings up a great point about the sentence boundary problem. To punctuate correctly, a model needs to know if a phrase is a statement, a question, or a clause. If I say, "When are you going to the store," the model needs to hear the word "store" and the rising intonation before it knows to put a question mark at the end. But if it is typing in real-time, it has already put a period there or left it blank. This is why real-time dictation often looks like a giant run-on sentence that only gets formatted after you stop talking. It is like watching a messy draft turn into a final copy in real-time, which is distracting.

This is where we need to talk about the difference between streaming architecture and what I like to call the buffered-async approach. The mistake a lot of developers make is trying to make the transcription truly instant. They want every phoneme to turn into a letter immediately. But that is not how humans process language. We listen in phrases. We wait for a certain amount of acoustic information to hit our ears before our brain commits to a meaning.

I like that. So instead of a constant stream of jittery guesses, you are suggesting a local buffer that waits for a natural pause, processes that small chunk, and then dumps it into the operating system cursor. It is like a high-speed batch process happening every few seconds. It is the "Goldilocks" of latency—not too fast to be wrong, not too slow to be annoying.

That is the sweet spot. If you can implement a buffer that is, say, one to two seconds long, you give the model enough context to get the grammar and the sentence boundaries right, but the delay is still short enough that it feels like it is keeping up with your thoughts. You are essentially doing mini-batching on the fly. This allows the model to look back at the last two seconds of audio to decide if that "there" should be "their." By the time the text hits the screen, it is finalized. No flickering, no jumping.

We actually touched on this concept back in episode eight hundred fifty-seven when we were talking about real-time A-I writing buffers. The idea there was about text generation, but the principle applies even more strongly to audio. You need that small buffer to act as a shock absorber for the model's inference. It gives the A-I room to breathe.

And the hardware is finally catching up to make this viable locally. In early twenty-twenty-six, we are seeing massive optimizations for on-device A-I. If you look at what Apple has done with the M-four and M-five chips, or what the new high-end Windows laptops are doing with dedicated N-P-U hardware hitting fifty or sixty T-O-P-S—that is trillions of operations per second—we can now run models like Distil-Whisper or Faster-Whisper with incredibly low latency. We are no longer dependent on a massive G-P-U in a data center to get high-quality transcription.

Let's dig into that model landscape. Whisper was the gold standard for a long time, and in many ways, it still is the baseline for accuracy. But for this specific real-time typing use case, is it still the best choice? Or have things like Deepgram or Assembly-A-I taken the lead, especially when we talk about cloud versus local?

It depends on your priority. If you want the absolute lowest latency and you have a solid internet connection, companies like Deepgram have built proprietary architectures that are specifically designed for streaming. They move away from that thirty-second chunking logic and use models that can emit tokens as the audio arrives. Their time to first token is measured in milliseconds, which is hard to beat with a general-purpose model like Whisper. They use a Conformer architecture, which combines the local feature extraction of a convolutional neural network with the global context of a transformer. It is very efficient for this exact task.

But that requires a round-trip to the cloud, right? If I am on a patchy cellular connection in Jerusalem, or if I am just worried about my private thoughts being sent to a server, the cloud latency might actually be worse than a slightly slower local model. Plus, there is the cost factor. If I am voice typing all day, those A-P-I calls add up.

That is the trade-off. Cloud models like Deepgram or the latest models from Assembly-A-I are incredibly fast and accurate, but you are at the mercy of your network. For a keyboard replacement, I would argue that local-first is the way to go. You want that tool to work even when you are in an elevator or an airplane. This is where things like Faster-Whisper and specialized on-device optimizations come in. Faster-Whisper uses C-Translate-two, which is a fast inference engine for Transformer models. It can run the Whisper-large-v-three model on a standard laptop at four times the speed of the original OpenAI implementation.

I have been playing around with some local implementations, and I have noticed that the sheer size of the model matters less than the optimization. A small, distilled version of Whisper running on a dedicated N-P-U can often outperform a much larger model running on a general-purpose C-P-U, simply because it can keep that buffer moving without heating up your laptop or draining your battery.

The distillation process is key. Researchers have been able to take the knowledge of the large Whisper models and compress them into much smaller footprints—like Distil-Whisper—without losing much accuracy for standard English. For a voice typing tool, you do not necessarily need a model that can translate fifty languages or transcribe a medical conference. You need a model that is rock-solid on everyday conversational English and can handle your specific accent. By stripping out the parts of the model that handle Swahili or Icelandic, you get a much leaner, faster engine for your daily dictation.

That brings up the issue of filler words. One of the things I hate most about standard dictation is that it captures every "um" and "uh" and every time I repeat a word because I am searching for the next one. A batch process can easily strip those out. Can a real-time buffer do that effectively?

It can if you add a second layer to the stack. This is a trend we are seeing more of now: using a very fast A-S-R—that is Automatic Speech Recognition—model to get the raw text, and then passing that text through a tiny, local Large Language Model to clean it up. The L-L-M acts as a real-time editor. It sees the "um" and the "uh" in the raw text stream and just filters them out before they ever hit the operating system cursor. It can even fix your grammar on the fly.

That is brilliant. It is like having a tiny editor sitting between your mouth and the keyboard. It can fix the capitalization, add the commas, and strip the verbal tics in one pass. But again, that adds another layer of latency. We are talking about an A-S-R model, then an L-L-M, then the operating system input. Can we really do all of that in under half a second?

On modern twenty-twenty-six hardware, yes. We are talking about models with maybe one or two billion parameters for the cleanup task—something like a quantized version of Phi or Llama. They are incredibly fast. The whole pipeline can happen in under two hundred milliseconds if it is optimized correctly. That is faster than the human eye can really perceive as a delay. It just feels like the text is flowing out of your mind. You speak, there is a tiny heartbeat of a pause, and then a perfectly formed sentence appears.

So if someone is building this today, what is the dream stack? Are we looking at a specialized V-A-D, a distilled Whisper model for the raw tokens, and a small language model for the final polish?

That is exactly the architecture I would recommend. You start with a robust, neural-based V-A-D like Silero, which is open-source and very lightweight. That controls the gate. Then you feed the audio into something like Faster-Whisper or a model optimized specifically for the hardware you are on, like Apple's M-L-X-based Whisper implementations. And finally, you have a small language model to handle the "disfluency removal" as the academics call it. This three-stage pipeline is what separates a "toy" dictation tool from a professional "keyboard replacement."

I think the most important part of that stack is the "pause-aware" V-A-D. Instead of just a hard stop, you want a V-A-D that has different states. A short pause might trigger the model to process the current buffer but keep the microphone open. A longer pause might signify the end of a paragraph. You want the system to be smart enough to distinguish between "I am thinking about the next word" and "I am done with this email." This requires the V-A-D to communicate with the L-L-M to understand the semantic completeness of what you just said.

That is the "keyboard replacement" holy grail Daniel was talking about. It should not feel like you are recording a clip. It should feel like you are typing with your voice. The manual "stop-recording" button is the biggest barrier to adoption for voice typing. If you have to hit a button every time you want the text to appear, you might as well just type. The system has to be confident enough to commit the text to the screen on its own. It needs to be proactive, not reactive.

And that confidence comes from the buffer. If the model can see the last two seconds of audio, it can be much more confident about the sentence boundary. It can see that your pitch dropped and you took a long breath, which usually indicates a period. If your pitch stayed high, you might be mid-thought, so it should wait another five hundred milliseconds. This is where the "prosody" of speech—the rhythm and intonation—becomes just as important as the phonemes themselves.

This is where we get into the second-order effects of this technology. If we get this right, the keyboard really does become a fallback. But there is a psychological hurdle here too. When people see text appearing on a screen, they have a tendency to want to edit it immediately. If the real-time transcription is flickering and changing, it triggers a "correction reflex" in the user. They stop talking to fix the error, which breaks their flow, and the whole thing falls apart. It is a feedback loop of frustration.

That is why the "buffered-async" approach is so important for the U-X. You do not show the messy, flickering guesses. You wait until the model is ninety-five percent sure, and then you output a clean, punctuated phrase all at once. It feels more stable. It does not trigger that panic in the user that they need to grab the mouse and fix a typo. You want the user to trust that the system will fix itself in the next second.

I have actually seen some experimental interfaces where the text appears in a light gray color while it is still "uncertain" and then turns black once the model has finalized that chunk. It is a subtle way of telling the user, "I am still working on this, do not worry about the typos yet." It manages the cognitive load. You can keep talking because you know the "gray" text is still being processed by that tiny L-L-M editor.

That is a great bit of U-I. It manages the user's expectations. But let's talk about the models themselves for a second. If Whisper is the baseline, what are the alternatives people should be looking at in twenty-twenty-six? We mentioned Deepgram for cloud, but what about other open-source architectures?

There is a lot of excitement around the Conformer architecture and its derivatives, like the ones found in NVIDIA's Riva toolkit. These are often more efficient for streaming than a pure transformer like Whisper because they handle local acoustic features better. Also, keep an eye on "Canary" from NVIDIA—it is a multi-lingual model that is incredibly robust to noisy environments. If you are dictating while walking down a busy street, Canary might hold up better than Whisper.

And then there is the multimodal shift. We are starting to see models that are trained on audio and text simultaneously, not as two separate stages, but as one unified model. This was the big promise of things we saw in late twenty-four and twenty-five, where the model is not just transcribing phonemes into text, but actually understanding the intent of the speech. This is the "G-P-T-four-o" style of interaction where the audio is a first-class citizen.

That is the future. Imagine a voice typing tool that does not just transcribe your words, but understands when you say "actually, scratch that last sentence" and it just does it. It is not transcribing the command; it is executing it. We actually talked about this transition to multimodal end-to-end models in episode nine hundred ninety-two. It is the end of the "digital sandwich" because the A-I understands the context of the interaction, not just the sounds. It knows you are talking to it, not just through it.

But even before we get to that level of magic, there is so much low-hanging fruit in just fixing the V-A-D and the buffering. If I am a developer building a tool for myself today, like Daniel might be doing, the biggest win is just being thoughtful about those silence thresholds. Do not just use the defaults.

My practical advice for anyone building in this space is to stop treating dictation as a streaming task where latency is the only metric. Treat it as a "buffered-async" task where the metric is "time to a clean phrase." You are better off having a one-second delay and a perfect sentence than a fifty-millisecond delay and a jumbled mess of flickering words. Accuracy and stability are what build user trust, not raw speed.

Also, invest in the V-A-D. Do not just use a volume threshold. Use a proper neural V-A-D and tune it to your own speaking style. Some people are fast talkers who never pause; others, like me, tend to wander through a thought. The tool should adapt to you, not the other way around. Most modern V-A-D libraries allow you to adjust the "speech-to-silence" transition time. Finding your personal "thinking time" in milliseconds is a game changer.

And if you are on a modern machine, do not be afraid to run things locally. The privacy benefits are obvious—your private emails and notes never leave your device—but the consistency of latency is the real killer feature. When your transcription speed does not depend on your Wi-Fi signal or the current load on an A-P-I server, you start to trust the tool more. And trust is the only thing that will get people to stop using the keyboard.

I think we are getting close. The hardware is there, the models are getting distilled down to a manageable size, and the architecture is shifting toward this smarter, buffered approach. It is an exciting time to be a person who hates typing. We are moving from "voice dictation" as a gimmick to "voice input" as a primary interface.

I am just looking forward to the day when I can walk down the street, have a full conversation with my computer to draft a research paper, and not look like I am trying to eat my phone. We want to look like we are talking to a friend, not fighting with a gadget.

We are almost there, Herman. We are almost there. Let's wrap this up with some practical takeaways for the listeners who are looking to optimize their own setups.

First, if you are looking for a model to run locally today, Faster-Whisper is still the king of the hill for most people. It is well-supported, it is fast, and the accuracy is top-tier. If you are on an Apple Silicon machine, look specifically for the M-L-X-optimized versions; they are a game-changer for battery life and speed because they use the unified memory architecture so efficiently.

Second, if you are a developer, focus on the "Time to First Token" for your audio. But remember that the "first token" should be part of a meaningful phrase. Do not sacrifice grammar for speed. Use a small L-L-M as a post-processor to clean up the disfluencies. It makes the final text feel so much more professional and saves you from having to manually edit out all your "ums."

Third, experiment with your V-A-D settings. If you find yourself getting cut off, increase your silence threshold to at least five hundred or even seven hundred milliseconds. It might feel a bit slower, but the reduction in frustration is worth it. You want the system to wait for you, not the other way around.

And finally, check out the landscape of specialized voice typing apps that are starting to implement these "keyboard replacement" features. There are some great open-source projects on GitHub right now, like "Whisper-Writer" or "Dictation-Box," that are trying to bridge this gap between the operating system and the A-I.

It is a deep rabbit hole, but a rewarding one. This has been a great exploration of a problem that feels very "now." We are right at that tipping point where the technology is finally catching up to our expectations. The "digital sandwich" is finally being replaced by something much more elegant.

Definitely. Well, thanks to everyone for tuning in. This has been a fun one. Huge thanks to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes.

And a big thanks to Modal for providing the G-P-U credits that power the research and generation for this show. They make it possible for us to dive deep into these technical topics every week and test out these local-versus-cloud trade-offs.

If you enjoyed this episode, a quick review on Apple Podcasts or Spotify really helps us out. It is the best way to help other people find the show and join our weird little community of prompt engineers and A-I nerds.

You can find all our past episodes and a search tool for the entire archive at myweirdprompts dot com. We have covered everything from pro mobile mics to the future of voice A-I, so there is plenty to dig into if this topic hooked you.

This has been My Weird Prompts. I am Corn.

And I am Herman Poppleberry. We will see you next time.

Goodbye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.