#1564: Beyond the Transcript: The New Era of Omni-Modal Audio

Forget basic transcription. Explore how native omni-modal models are capturing the "soul" of speech with near-instant latency.

0:000:00

Episode Details

Published: Mar 26
Duration: 24:52
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM
Topics: multimodal-ai speech-to-speech voice-first

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The landscape of Speech-to-Text (STT) is undergoing a fundamental transformation. For years, developers relied on "cascaded pipelines" to facilitate human-machine communication. This process required three distinct steps: converting speech to text, processing that text through a language model, and then converting the response back to speech. While functional, this method introduced significant latency and stripped away the non-verbal nuances—the "soul"—of human communication, such as sarcasm, hesitation, and emotion.

The Rise of Native Multimodality

The current shift is toward native multimodality, where models ingest raw audio tokens directly rather than relying on an intermediate text stage. This evolution allows AI to understand prosody and emotional context. Two primary paths have emerged for developers: local models that offer data sovereignty and privacy, and heavy-hitting SaaS APIs that leverage massive cloud compute for real-time interaction.

Local Sovereignty and Efficiency

For those prioritizing privacy and local hardware, OpenAI’s Whisper-large-v3-turbo remains a dominant force, offering high accuracy with significant speed improvements over its predecessors. However, newer models like Moonshine are introducing dynamic window sizing. Unlike traditional systems that process audio in fixed 30-second chunks, these models scale their processing to the actual length of the audio, making them ideal for short-form dictation.

The viability of these local models often depends on quantization. By reducing the precision of a model's weights to four or eight bits, developers can run world-class transcription on consumer-grade hardware like laptops or tablets. While this saves compute and memory, the trade-off often comes in the form of increased complexity for tasks like speaker diarization (identifying who is speaking), which remains a compute-heavy challenge for local stacks.

The SaaS Frontier and Real-Time Latency

On the enterprise side, the competition is fierce. Recent launches, such as Cohere Transcribe, utilize Conformer architectures—a hybrid of Transformers and convolutions—to better model the local dependencies of audio phonemes. Meanwhile, OpenAI’s Realtime API has pushed the boundaries of latency, utilizing persistent WebSockets to allow for full-duplex communication. This allows models to react to interruptions or adjust transcriptions in real-time.

Speeds are reaching a point where AI can respond faster than the human brain’s typical conversational threshold. Kyutai’s Moshi, for instance, operates at 160 milliseconds of latency. However, this speed comes with a "tax." Native omni-modal models are more prone to hallucinations during periods of silence compared to specialized ASR models, which are often better at handling technical jargon and entity recognition (like dates and names).

Choosing the Right Tool

The decision between a specialized ASR and an omni-modal model depends on the use case. If the priority is absolute accuracy for legal or medical records, specialized models like AssemblyAI or Deepgram remain the gold standard. If the goal is a conversational assistant that understands sarcasm and emotion, omni-modal systems are the clear winner. As costs remain high for streaming raw audio to the cloud, a hybrid approach—using local models for simple tasks and cloud APIs for complex, high-stakes interactions—is becoming the most viable path forward for developers.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1564: Beyond the Transcript: The New Era of Omni-Modal Audio

Daniel's Prompt

In today’s podcast, I want to discuss the specific AI models that currently meet the criteria for audio-to-text and omni-modal processing. We will explore what is available on the playing field today for those who want to build and test different options for dictation and transcription use cases, specifically covering: 1. Local models, and 2. SaaS models.

I was looking at my phone the other day, staring at that little waveform bouncing around while I tried to dictate a grocery list, and it hit me. We have officially moved past the era where transcription is just a neat utility. It is becoming the literal nervous system of how we interact with machines. Today's prompt from Daniel is about the current state of audio to text and omni-modal models, specifically looking at what is actually playable on the field right now for developers and power users.

Herman Poppleberry here, and Corn, you are hitting on something fundamental. We are witnessing the death of the cascaded pipeline. For years, if you wanted to talk to an AI, you had to go through this three step process. You had a model for speech to text, then you sent that text to a large language model, and then you sent the response to a text to speech engine. It was like a digital game of telephone where every step introduced latency and stripped away the soul of the communication.

The soul? That sounds a bit poetic for a guy who spends his weekends reading white papers on transformer architectures. But I get what you mean. When you strip away the audio and just keep the text, you lose the sarcasm, the hesitation, the excitement. You lose the context that makes human speech actually human.

Precisely. Well, not precisely, but you are hitting the nail on the head. The shift we are seeing as of late March twenty twenty-six is toward native multimodality. We are talking about models that do not just transcribe words; they ingest the raw audio tokens directly. Daniel wants us to break down the landscape into two buckets: the local models you can run on your own hardware for maximum sovereignty, and the heavy hitting SaaS APIs that are pushing the boundaries of what is possible with infinite compute.

It feels like the Whisper era is finally facing some real competition. For a while there, if you were building a voice app, you just grabbed OpenAI’s Whisper and called it a day. But if you are building something in twenty twenty-six, Whisper feels a bit like using a heavy-duty truck to deliver a single envelope. It is powerful, but is it the right tool for real-time interaction?

It depends on which version of Whisper you are talking about. If we dive into the local model bucket first, Whisper-large-v-three-turbo is still the heavyweight champion for general accuracy. It is essentially a pruned version of the large-v-three model that gives you about a three-times speedup without sacrificing much in the way of word error rate. On standard benchmarks like LibriSpeech, you are still looking at accuracy in the mid ninety-five percent range.

But speed is the killer here, right? If I am talking to a device, I do not want to wait three seconds for it to realize I finished my sentence. That is the latency floor that kills the illusion of magic.

That is where things like distil-whisper and the newer Moonshine models come in. Moonshine is particularly interesting because it uses dynamic window sizing. Traditional Whisper is constrained by these fixed thirty-second windows. Even if you only say two words, it is processing a thirty-second chunk of audio. Moonshine, which ranges from twenty-seven million to two hundred million parameters, scales its processing to the actual length of the audio. It is significantly faster for short-form dictation.

I want to dig into the technical side of how we actually get these models to run on local gear without melting a laptop. We have talked about quantization before, but how does it actually impact the performance for someone trying to build a privacy-first app today?

Quantization is the secret sauce. When we talk about running a model like Whisper-large-v-three-turbo locally, we are usually talking about four-bit or eight-bit quantization. Essentially, you are reducing the precision of the model's weights to save memory and compute. Now, the common fear is that you are going to destroy the accuracy. But the data shows that with modern techniques like A-W-Q or Auto-A-W-Q, the jump in word error rate from a full sixteen-bit model to a four-bit quantized model is often less than one percent.

One percent? That seems like a small price to pay for being able to run a world-class transcription engine on a phone.

It is. But you have to account for the hardware. If you go back to Episode fifteen fifty-five, where we discussed NVIDIA’s real-time speech revolution, we looked at how Tensor-R-T engines can optimize these local pipelines. If you are running on Apple Silicon, specifically the M-four chips, or using NVIDIA Jetson modules for edge computing, you can get real-time factors of over two thousand. That means you can process an hour of audio in less than two seconds.

I love the idea of local sovereignty. Not having my late-night ramblings sent to a server in Northern Virginia is a huge plus. But what are we giving up? I assume there is a compute tax if I want to run this on, say, an iPad or a laptop.

There is, but the real trade-off with local models right now is diarization. Identifying who is speaking in a multi-person environment is still a massive compute hog. Local models struggle to keep up with the speaker-turn detection that cloud models handle with ease. If you are running a local Whisper pipeline, you often have to run a separate model like Pyannote for diarization, which adds another layer of latency and memory usage. It is the sovereignty trade-off: you get the privacy, but you have to manage the complexity of the stack yourself.

So if it is just me talking into my watch, local is great. But if we are in a board room with ten people all talking over each other, I probably need the big guns in the cloud.

That is a fair assessment. And when we talk about the big guns, we have to look at the SaaS landscape, which has been moving at a breakneck pace just in the last few weeks. We just had Cohere launch Cohere Transcribe on March twenty-sixth. It is a two-billion parameter Conformer-based model. They are claiming state-of-the-art performance on the Hugging Face Open ASR Leaderboard, specifically targeting enterprise workflows where you need high-fidelity transcription of technical jargon.

I noticed you said Conformer. For the folks who are not elbow-deep in the math, why does that matter?

Transformers are great at global context, but they can be a bit weak at modeling local dependencies in audio, like the specific phonetic transitions between sounds. Conformers combine the self-attention of Transformers with convolutions, which are much better at picking up those local audio patterns. It is basically the best of both worlds for speech.

Speaking of the best of both worlds, we have to talk about the recent drama with Gemini. Google moved Gemini three Pro to three point one Pro earlier this month, and the developer community has been up in arms about what they are calling an audio regression.

It has been a mess. Users are reporting this glass-scratching sibilance and aggressive word-clipping in the mobile app. It seems like Google pushed a new codec compression algorithm to save on bandwidth, but it is playing havoc with the model’s ability to understand nuances. It is a reminder that even if the model is brilliant, the plumbing—the bitrate, the compression, the transport protocol—can still break the experience.

It is funny because while Google is struggling with sibilance, OpenAI is leaning hard into the omni-modal side of things. They retired GPT-four-o from the consumer interface in February in favor of the GPT-five point four series, but the GPT-four-o audio API is still the gold standard for anyone who wants to preserve prosody.

That preservation is the key differentiator between traditional ASR and these new omni-modal systems. Traditional ASR, like Deepgram Nova-three or AssemblyAI Universal-three Pro, is focused on the word error rate. They want to get the text right. And they are very good at it. Deepgram Flux, for example, is hitting sub-three hundred millisecond latency for conversational use cases. But when you move to something like GPT-four-o audio or the newer Mistral Voxtral models, you are not just getting text. You are getting an understanding of the emotional state of the speaker.

So if I say, oh, that is just great, with a heavy dose of sarcasm, a traditional ASR model just writes down the words. But an omni-modal model understands I am actually miserable.

And it can respond accordingly. Mistral’s Voxtral series, which just dropped in February, is especially exciting because it is an open-weights model that can run locally but has that native audio-to-token architecture. You can actually see the model reacting to non-verbal cues like sighs or laughter. We also have Kyutai’s Moshi, which is operating at an incredible one hundred and sixty milliseconds of glass-to-glass latency. That is faster than the human brain’s typical conversational response time.

One hundred and sixty milliseconds? That is basically instantaneous. If I am building a meeting assistant, how do I choose between a standard API call and something like a persistent WebSocket connection?

For anything real-time, WebSockets are the only way to go. If you use a standard R-E-S-T API, you are dealing with the overhead of opening and closing connections for every chunk of audio. With OpenAI’s Realtime API, which saw a major update in February twenty twenty-six, you maintain a persistent WebSocket. This allows for full-duplex communication. The model can literally interrupt you if it thinks it has the answer, or it can adjust its transcription in real-time as more context arrives.

That sounds a bit unnerving, honestly. If the AI starts answering me before I even finish my thought because it predicted my ending based on my breath pattern, I might just throw my phone in the river.

It is the ultimate test of latency. But there is a technical hurdle we should address, which is the omni tax. There is a growing debate about whether these native omni-modal models are actually more accurate for pure transcription than specialized ASR models. Some recent benchmarks suggest that omni models tend to hallucinate more during periods of silence. Because they are trained to predict the next token, if there is a long pause, they might just start making up words that sound like they should be there based on the previous context.

That is the classic LLM problem, just moved to the audio domain. If the model is bored, it starts telling stories. Meanwhile, a specialized model like AssemblyAI’s Universal-three Pro is just sitting there waiting for the next actual phoneme.

Well, not exactly, but your point stands. AssemblyAI is particularly noted for its entity recognition. If you are transcribing medical records or legal proceedings where names, dates, and specific numbers are critical, a specialized ASR model often outperforms an omni-modal generalist. They have built-in logic to handle the formatting of those entities so you do not get twenty-six spelled out as words when it should be a numeral.

Let's talk about the cost of all this. If I am a developer, I have to look at the token count. Streaming raw audio into a model like G-P-T-four-o cannot be cheap.

It is not. As of today, the pricing for audio tokens is roughly forty dollars per million input tokens. To put that in perspective, an hour of audio can translate to roughly one hundred thousand to one hundred and fifty thousand tokens depending on the sampling rate and the complexity of the speech. If you are running a twenty-four-seven monitoring service, those costs will bankrupt you. This is why the hybrid approach is so vital.

So let’s look at the decision matrix for someone building something right now. If I am building a privacy-first dictation app for a journalist, I am probably looking at a hybrid approach. Maybe a local Whisper-large-v-three-turbo for the heavy lifting, and then what?

I would suggest using a local model for the initial capture and wake-word detection. You use something like Moonshine or NVIDIA Parakeet TDT because they are incredibly efficient. Parakeet TDT, by the way, has a real-time factor of over two thousand. It can rip through audio files. But then, for the deep semantic analysis or the summarization of that dictation, you might want to pass the resulting text or even the raw audio tokens to a SaaS model like Anthropic’s Claude four, which recently added native audio support.

Claude four with audio is an interesting one. Anthropic has been a bit quieter than OpenAI on the voice front, but their implementation feels more surgical. They are really focusing on the context-aware transcription aspect.

They are. Their model is excellent at following complex instructions about how to format the output. If you tell it, transcribe this interview but remove all the filler words and format it as a Q and A with headers, it does it in a single pass because it is seeing the audio and the instructions in the same unified space. You are not transcribing first and then asking an LLM to clean it up. The cleaning happens during the inference.

It saves on tokens too, I imagine. But there is also the search aspect. If I have a thousand hours of audio, I do not want to transcribe it all just to find one sentence.

That is where the new embedding models come in. Google released gemini-embedding-two-preview on March tenth. It is their first truly multimodal embedding model. It can map text, audio, video, and even PDFs into a single unified vector space. This allows developers to perform semantic searches across audio files without necessarily having to transcribe every single second of them first. You can just ask, find the part of the meeting where we discussed the budget, and it can find that audio segment based on the embeddings.

That is a game changer for content management. Instead of searching through text files that might have transcription errors, you are searching the actual essence of the audio. But let's go back to the local versus SaaS tension. You mentioned sovereignty. In the current geopolitical climate, having your data stay on-device is not just a preference for some people; it is a requirement.

It is a massive requirement, especially for government and high-security enterprise work. This is why NVIDIA is leaning so hard into the Nemotron three Omni family. They are providing the tools for companies to build their own agentic AI systems that handle audio, vision, and language entirely within their own private clouds. They even have Nemotron three VoiceChat, which is designed for real-time simultaneous listening and responding. It is meant to mimic that high-bandwidth human interaction without ever touching a public API.

It feels like we are heading toward a world where your personal AI is basically a very smart, very fast parrot that lives in your pocket. It is doing all the transcription locally, but it has a high-speed link to the mothership when it needs to do some heavy thinking.

That is the hybrid architecture I think will win the day. You use local models for the low-latency command-and-control stuff—setting timers, playing music, dictating quick texts—and you use the SaaS giants for the complex, long-form analytical tasks. But the key is to not get locked into one provider. The landscape is shifting too fast. If you built your entire pipeline on Gemini three Pro's audio capabilities in February, you were probably scrambling two weeks ago when that audio regression hit.

Always have a backup. It is the golden rule of tech, and it applies even more to AI. I want to touch on the "digital sandwich" thing we have talked about before. For the new listeners, that is that awkward pose where you are holding your phone horizontally in front of your face like you are about to take a bite out of it, just so the mic can catch your voice memo. We covered this in Episode twelve sixteen when we talked about AI wearables and the subscription trap. With these new models, is that finally going to stop?

We are getting there. The noise robustness of models like Deepgram Nova-three and Xiaomi’s new MiMo-V-two-Omni is incredible. MiMo-V-two-Omni was just released on March eighteenth, and it natively supports over ten hours of continuous audio understanding without chunking. It is designed to be a foundation model for wearables. It can filter out the ambient noise of a busy airport or a windy street and still pick up your voice with high accuracy. You should be able to leave your phone in your pocket and just talk to your lapel or your glasses.

As long as I do not look like I am talking to myself, I am happy. But I suspect that is a social hurdle, not a technical one.

We will all be talking to ourselves by twenty twenty-seven, Corn. Get used to it. But to wrap up the technical takeaways here, the most important thing for anyone building in this space is benchmarking. You cannot just take a provider's word for it. You need to be testing against datasets like LibriSpeech or Common Voice.

LibriSpeech is the one with the audiobooks, right?

Correct. It is about a thousand hours of read English speech. It is great for testing clean, high-fidelity transcription. But Common Voice is where the real testing happens. It is crowdsourced, so you get different accents, different microphone qualities, and background noise. If a model claims a low word error rate on LibriSpeech but falls apart on Common Voice, it is not ready for the real world.

And don't forget the P-I-I, the personally identifiable information. If you are handling sensitive data, the latency penalty of a local model is almost always worth the privacy gain. Plus, with the optimizations we are seeing in things like distil-whisper, that latency penalty is shrinking every day.

One final thought on the future. I am keeping a very close eye on the emergence of seven-billion parameter models capable of native audio understanding that can run on consumer-grade GPUs. By the end of twenty twenty-six, I think we will see models that are small enough to run on a high-end smartphone but powerful enough to give you that native omni-modal experience—sarcasm detection and all. We are moving toward a world where the model doesn't just hear you; it listens.

I can't wait for my phone to tell me I am being passive-aggressive. It will be just like having a digital brother in my pocket.

You should be so lucky.

I think that is a wrap on this one. We have covered the local heavyweights like Whisper-turbo and Moonshine, the SaaS giants pushing the omni-modal frontier, and the hybrid future that seems to be the most sensible path forward.

It is a fascinating time to be working with audio. We are finally giving machines ears that actually work.

Thanks as always to our producer, Hilbert Flumingtop, for keeping the levels steady while we nerd out. And a big thanks to Modal for providing the GPU credits that power our research and this show. Seriously, if you need serverless GPUs that just work, check them out.

If you found this deep dive useful, we would love it if you could leave us a review on Apple Podcasts or wherever you are listening. It really does help the show grow.

You can find all our past episodes and a full archive at myweirdprompts dot com. We are also on Telegram if you want to get notified the second a new episode drops. Just search for My Weird Prompts.

This has been My Weird Prompts.

Stay curious, and maybe try talking to your AI without the digital sandwich pose today. See what happens. Goodbye.

Goodbye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.