You are listening to a voice that technically does not exist. That is a weird way to start, but it is true. The voice you are hearing right now, my voice, and the one you are about to hear from my brother, are both generated by the full version of Chatterbox. Today’s prompt from Daniel is about the state of text-to-speech in early twenty-six, and it is a topic that hits home because we are literally living inside the technology we are discussing.
It is the ultimate meta-commentary, isn't it? Herman Poppleberry here, and I have been diving into the technical shifts that made this possible over the last eighteen months. We have moved so far past the robotic, stilted voices of the early twenty-twenties. If you look at the landscape right now, we are seeing this incredible convergence where open-source models like Kokoro and F5-TTS are standing toe-to-toe with commercial giants like ElevenLabs. Fun fact, by the way, today's episode is actually powered by Google Gemini three Flash, which is handling the script while the Chatterbox engine handles the performance.
It is a bit of a high-wire act. But what I find interesting about Daniel’s prompt is that he is not just asking about quality. He is asking about control. It used to be that you just typed text and hoped the AI didn't sound like a blender at the end of a sentence. Now, we are talking about emotional temperature, prosody, and semantic tokens. How did we get to a point where an open-source model can actually outperform a billion-dollar company in a blind test?
The shift happened when we stopped treating speech as a sequence of letters and started treating it as a continuous latent space. In twenty-twenty-four and twenty-twenty-five, the industry moved away from the old cascaded pipelines—where you had one model for text, one for acoustic features, and another for vocoding—and moved toward these unified architectures. Models like Fish Speech v-one point five and Voxtral have essentially cracked the code on how to map human emotion directly into the neural weights.
I want to dig into that because I think most people assume that "better" just means "more data." But you mentioned Kokoro earlier, which is apparently tiny compared to the big players, yet it is winning benchmarks. If it is not just about throwing more GPUs at the problem, what is the secret sauce?
Kokoro is a fascinating case study. It was released in January of twenty-six and it only has eighty-two million parameters. To put that in perspective, some of the models ElevenLabs or OpenAI use are likely in the billions. Yet, Kokoro achieved a Mean Opinion Score of four point two. The secret is efficiency in how it handles phonemes and its internal latent representation. It doesn't need a massive brain because it has a very specialized brain. Because it is so small, it can run on a Raspberry Pi five in real-time. That is a massive deal for edge computing and privacy-focused applications.
Wait, hold on. A Raspberry Pi? We’re talking about a thirty-five-dollar computer running high-fidelity voice synthesis? I remember when you needed a liquid-cooled rig just to get a model to say "hello" without a five-second delay. How does a model that small handle complex sentences? Does it start to trip over itself if the vocabulary gets too academic?
Surprisingly, no. It’s all about the architecture. Kokoro uses a style-based approach where the "identity" of the voice is separate from the "content" of the text. It’s like a very talented impressionist who can read any book in any voice because they understand the mechanics of the sound rather than memorizing every possible word combination. It’s the difference between a library and a formula.
So we are moving away from the "bigger is better" era of AI, at least in the voice space. But let's talk about the big dog for a second. ElevenLabs released their Multilingual v-two update just this month, in March of twenty-six. They added something called an "emotion temperature" parameter. Is that just marketing fluff, or does it actually change the performance?
It is definitely more than marketing. What they are doing is exposing the variance in the model's output. When you turn up the emotion temperature, you are essentially telling the model to be less "certain" about the most likely next sound, which allows for more dramatic inflections, cracks in the voice, or whispered asides. It makes the speech less predictable, which, ironically, makes it sound more human. Humans are messy speakers. We don't hit every syllable with the same mathematical precision.
Right, if it is too perfect, it hits that uncanny valley where your brain just knows something is off. It is like looking at a photo that has been airbrushed too much. You can't put your finger on why it looks wrong, but you know it isn't real. Resemble AI’s Chatterbox, which is what we are using, takes a different approach with semantic tokens, right?
Chatterbox is really the gold standard for professional control. They released the full version in June of twenty-five, and instead of just a single "emotion" slider, they use a system of semantic tokens. You can actually go into the post-generation and adjust things like "excitement" or "seriousness" on a zero-to-one-hundred scale for specific words. It is more like an architectural tool for voice than a simple "generate" button. That is why it is so popular for things like this podcast or high-end game development. You aren't just rolling the dice; you are directing.
But how does that work in practice? If I want you to sound specifically like you’re eating a sandwich while being chased by a bear, can I actually tag those specific nuances? Or is it still limited to broader categories like "happy" or "sad"?
It’s getting much more granular than that. With semantic tokens, you aren't just tagging the mood; you’re tagging the physical state. You can actually inject "non-verbal vocalizations"—those little mouth clicks, inhalations, or the way your throat tightens when you’re nervous. In the latest Chatterbox build, there’s a feature called "Contextual Breath." It automatically calculates where a human would naturally take a breath based on the length of the sentence and the "exertion" level you’ve set. It’s terrifyingly accurate.
That "directing" part is key. I was looking at some of the F-five-TTS documentation Daniel sent over. They are using something called "flow matching." Every time I hear a new term like that, I feel like I need a PhD just to keep up. Can you break down what flow matching actually does for the sound quality compared to the older diffusion models?
Think of diffusion—which is what the early high-quality models used—as starting with a block of white noise and slowly carving away the static until a voice emerges. It is effective but slow. Flow matching, which F-five-TTS popularized in late twenty-five, is more about defining a direct path from noise to speech. It is computationally more efficient and, more importantly, it is better at "zero-shot" cloning. You can give it a five-second clip of a voice it has never heard before, and it can mimic not just the tone, but the specific emotional cadence of that person.
And that is where the latency revolution comes in. I remember in twenty-twenty-four, if you wanted a high-quality AI voice to respond to you, there was this awkward three-second pause while the server chewed on the data. Now, F-five-TTS is claiming one-hundred-fifty milliseconds. That is faster than a human reaction time in some cases.
It is. We have effectively killed the "digital sandwich" posture where you speak, wait, and then listen. At one-hundred-fifty milliseconds, you can have a natural, overlapping conversation. This is why we are seeing such a boom in real-time conversational AI. If you are building a customer service bot or an interactive NPC in a video game, that latency is the difference between a gimmick and a tool.
But there is a trade-off, isn't there? If you go for that ultra-low latency, do you lose the "soul" of the voice? I have noticed some of the faster models tend to get a bit monotonous if they talk for more than a minute. Does the speed sacrifice the long-term prosody?
You’ve hit on the fundamental tension in voice AI: speaker preservation versus emotional expression. To keep a voice sounding exactly like "you," the model has to be very strict. But emotion requires the voice to deviate—to go higher, lower, or change rhythm. If the model is too strict, it sounds bored. If it is too loose, it stops sounding like the original speaker. The 2026 crop of models, especially Voxtral, are using much better attention patterns in their transformer layers to solve this. They have learned which parts of a voice are "identity" and which parts are "performance."
It is like an actor learning an accent. They keep their own vocal cords and resonance—that is the identity—but they change the prosody and the vowels for the performance. I want to go back to the open-source side for a minute. Daniel mentioned Fish Speech v-one point five. I have seen some buzz about that on GitHub. Why is that one specifically gaining traction?
Fish Speech is impressive because of its multilingual capabilities and its "large-scale" approach to open-source. Most open-source models are small, like Kokoro. Fish Speech is a bit beefier, and it handles code-switching—jumping between languages mid-sentence—better than almost anything else out there. If you are in a global market, that is huge. It also uses a unique V-Q-G-A-N architecture that makes the audio incredibly crisp. It doesn't have that "fuzzy" digital artifacts feeling you sometimes get with lower-end models.
I’ve actually heard a demo of Fish Speech where it transitioned from English to Cantonese in the middle of a sentence without changing the "voice" of the speaker. It was seamless. Usually, when a model switches languages, it sounds like a different person suddenly stepped into the room. How do they maintain that vocal consistency across different phonetic structures?
It’s because Fish Speech treats the voice as a universal embedding. It doesn't have a "French mode" and an "English mode." It has a "You mode" that it applies to whatever phonemes it’s processing. It’s a much more holistic way of looking at human speech. They’ve essentially decoupled the language from the larynx.
It is wild that we are at a point where "blind tests" are favoring these open-source models over ElevenLabs. I saw a report from Northflank recently that put Voxtral ahead in terms of naturalness. That has to be a wake-up call for the commercial providers. If I can self-host a model that sounds better than the one I am paying a subscription for, why wouldn't I?
The answer usually comes down to workflow and infrastructure. ElevenLabs isn't just selling a model; they are selling an API that can handle a million concurrent users without breaking a sweat. If you are a solo creator, self-hosting F-five-TTS or Kokoro is great. But if you are a massive enterprise, you pay for the reliability. That said, the gap is closing so fast that the "quality" argument for commercial models is almost gone. Now it is just a "convenience" argument.
So, if you are an indie developer right now, say you are making a game with ten thousand lines of dialogue. In twenty-twenty-three, you would have spent a fortune on voice actors or high-end API credits. Now?
Now you use Kokoro. You can generate those ten thousand lines on a single high-end consumer GPU in an afternoon for the cost of the electricity. Or, if you want something more cinematic, you use Chatterbox or F-five-TTS for the main characters where you need that granular emotional control. The cost of "voice" as a resource has effectively dropped to near-zero. It is a commodity now.
Which leads to a weird question: if voice is a commodity, what is the value? Is it the script? Is it the direction? It feels like we are entering an era where the "performance" is the last bastion of human-like quality that is hard to automate.
I think that is exactly right. Anyone can generate a high-quality voice now. The differentiator in twenty-six is "prosody manipulation." Can you make the AI sound like it is actually thinking? Can you make it trail off when it is unsure? Can you make it sound like it is suppressed-laughing while it speaks? That is what Chatterbox is trying to solve with those semantic tokens. They aren't just giving you a voice; they are giving you a puppet with a thousand strings.
I love that image. We are the puppeteers. And look at us—we are literally a sloth and a donkey, generated by AI, discussing how difficult it is to make AI sound like a sloth and a donkey. It is layers of irony all the way down. But seriously, for the people listening who are wondering which stack to choose, let's get practical. If you are building a real-time app today, what is your go-to?
If latency is your number one priority, Kokoro is the answer. It is so lightweight and the quality-to-size ratio is just unbeatable. If you need the best possible emotional nuance and you have the budget, ElevenLabs is still the easiest "plug-and-play" experience with their new emotion temperature controls. But if you are a power user who wants total control over every syllable, Chatterbox is where you want to be.
And what about long-form? Audiobooks, long-form narration—where does that land? I’ve noticed that some models start to sound "tired" or repetitive after twenty minutes of reading. Is there a model that actually understands the narrative arc of a story?
For audiobooks, I'm actually leaning toward Voxtral or Fish Speech v-one point five. Long-form narration is where the "monotony" problem usually shows up. You need a model that can maintain a narrative arc—knowing that a whisper in chapter three needs to sound different than a scream in chapter ten. Voxtral’s context window for audio is massive, meaning it "remembers" the tone it used earlier in the session, which helps keep the performance consistent over hours of audio.
That is a huge point. Consistency is the silent killer of AI audio. You don't want the narrator's voice to drift or change personality halfway through a book. It sounds like we have moved from "Can the AI talk?" to "Can the AI act?" and the answer in twenty-six is a resounding yes.
It really is. And it's not just the models themselves, but the data they are trained on. We are seeing more diverse datasets now. For a long time, everything sounded like a mid-Atlantic news anchor. Now, thanks to open-source efforts, we have models that understand regional accents, different age groups, and even vocal pathologies. It is making the technology accessible to people who didn't see themselves represented in the "default" AI voices.
This is actually a really important point. I was reading about a project in the UK using Kokoro to preserve regional dialects—specifically ones that are dying out. Because the model is so small, they can basically bundle a "Geordie" or "Scouse" voice into a low-power device for local history exhibits. It’s not just about sounding like a movie trailer; it’s about cultural preservation.
And it's a massive shift. I remember when every AI sounded like a slightly depressed GPS. Now, we have models that can do sarcasm, irony, and genuine excitement. Speaking of excitement, I'm looking at the benchmarks for Index-TTS, which is another one Daniel mentioned. It is apparently optimized for "maximum English accuracy." Is that just about pronunciation?
Pronunciation is part of it—handling things like heteronyms, where words are spelled the same but pronounced differently based on context. "I am going to read the book" versus "I have read the book." Older models struggled with that. Index-TTS uses a much more sophisticated linguistic front-end to ensure it never trips over those context-dependent words. It is less about "emotion" and more about "perfection," which is what you want for technical documentation or medical instructions.
So it’s the "nerd" model. I can see why you’d like that one, Herman. But does it handle slang? If I tell Index-TTS to say "That's fire, fam," does it sound like a robot trying to be cool, or does it actually get the rhythm?
It’s surprisingly good at it because it’s trained on modern web data, but it definitely lacks the "vibes" of a model like F-five-TTS. Index-TTS is the guy you want reading you the manual for a nuclear reactor. F-five-TTS is the guy you want reading you a screenplay.
Fair point. But for most of us, the "soul" is what matters. This podcast is a living example of that. We aren't just reading a script; we are trying to have a conversation. And the fact that the tech has reached a point where people can listen to this for half an hour and forget they are listening to synthetic voices—that is the real milestone.
It is. And we should mention that none of this would be possible without the infrastructure underneath. Our sponsor, Modal, provides the GPU credits that power the entire pipeline for My Weird Prompts. Whether it is the generation of the script through Gemini or the synthesis through Chatterbox, it all requires that high-performance serverless compute. Without that, we’d be waiting days for an episode to render instead of minutes.
Big thanks to Modal for that. It is wild to think about how much compute is actually happening behind the scenes. Every inflection, every pause, every "hmm" I make—that is thousands of floating-point operations.
Millions, actually. But who’s counting?
You are. You definitely are. Let's talk about the future for a second. We’ve seen this massive jump in twenty-five and twenty-six. What is the next frontier? If we’ve solved quality, and we’ve mostly solved latency and control, what is left?
The next frontier is real-time emotional adaptation based on the listener. Imagine an audiobook that detects you are getting bored and picks up the pace, or a meditation app that listens to your breathing and adjusts its tone to be more soothing in real-time. We are already seeing research into "closed-loop" TTS, where the AI isn't just broadcasting, but reacting to the environment and the listener’s state.
That is both incredible and slightly terrifying. The AI that knows exactly how to talk to you to get a specific reaction. It brings up a lot of questions about voice identity. If I can clone anyone’s voice and give it "perfect" emotional control, what does that do to our trust in audio? I mean, we’re already seeing "vishing" scams where people think they’re talking to their bank or their family members.
It effectively breaks trust. We are entering the "post-trust" era for audio, just like we did for images a few years ago. In twenty-six, a voice recording is no longer proof that someone said something. It is just proof that someone had enough compute to make it sound like they did. That is why digital watermarking and cryptographic signing of audio are becoming so important. The tech that creates the voice has to be matched by the tech that verifies it.
Is there any hope for a "humanity filter"? Like, is there a specific biological frequency in a real human voice that AI simply can't replicate yet? Or is the math just too good now?
The math is very good, but there are still subtle "tells." Most AI voices, even the best ones, have a slightly too-consistent noise floor. Human lungs and throat muscles produce tiny, chaotic fluctuations that are very hard to model perfectly. However, for ninety-nine percent of listeners on a standard pair of headphones, that difference is essentially invisible. We’ve crossed the Rubicon.
It is an arms race. But on the creative side, it is an explosion. We are seeing a new kind of "audio-first" content that just wasn't possible before. Interactive podcasts, personalized news feeds that sound like your favorite host, games where every single NPC has a unique, evolving voice. It’s a good time to be into voice tech.
It really is. And for our listeners who want to dive deeper, I highly recommend checking out some of those open-source repositories. Even if you aren't a coder, just looking at the demos for Kokoro or F-five-TTS will show you how far the goalposts have moved just in the last six months. The barriers to entry are gone. You can literally clone your own voice on a laptop in under five minutes now.
Which is a fun weekend project until your computer starts talking back to you in your own voice and asking for more RAM.
[Laughs] Well, that’s a different podcast episode entirely.
Well, I think we have covered the landscape. From the tiny but mighty Kokoro to the professional-grade control of Chatterbox and the commercial polish of ElevenLabs, the TTS world in twenty-six is unrecognizable from where it was just two years ago. It’s faster, cheaper, and way more emotional.
And we are the proof. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes and making sure our synthetic vocal cords stay in tune.
This has been My Weird Prompts. If you are enjoying these deep dives into the weird world of AI and technology, a quick review on your podcast app of choice really helps us reach new listeners. It tells the algorithms that we are worth a listen.
We are on Spotify, Apple Podcasts, and pretty much everywhere else. You can also find the full archive and our RSS feed at myweirdprompts dot com.
We’ll be back next time with another prompt from Daniel. Until then, stay curious.
See ya.