#2982: Why Your TTS Model Nails "Shabbat" but Not "Keren Hishtalmut

Why multilingual TTS models handle loanwords but fail at niche vocabulary — and what you can do about it.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3152
Published: May 22
Duration: 27:25
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: text-to-speech tokenization fine-tuning

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

A fascinating asymmetry plagues multilingual TTS and STT systems: common loanwords like "Shabbat" or "gracias" sound natural, but niche vocabulary like "Keren Hishtalmut" gets mangled into English phonetics. The culprit isn't model intelligence — it's training data distribution and subword tokenization.

Words like "Shabbat" appear frequently enough in English-language text to be absorbed as single tokens with learned non-English pronunciation. But "Keren Hishtalmut" — an Israeli financial term — appears only a handful of times in billion-token corpora. Byte Pair Encoding fragments it into pieces like "Hish," "tal," and "mut," each pronounced with English rules because those fragments appear predominantly in English contexts during training.

The problem compounds in speech-to-text, where the decoder's strong English language prior overrides accurate acoustic phoneme capture, producing word salad transcripts. Zipf's law ensures most niche vocabulary lives in the long tail where exposure alone can't train robust mappings. Current workarounds include SSML phoneme overrides and phonetic spelling hacks, both functional but requiring manual effort for every instance.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2982: Why Your TTS Model Nails "Shabbat" but Not "Keren Hishtalmut

Daniel sent us this one — he's asking about code-switching in TTS and speech-to-text models, specifically the weird asymmetry where a word like Shabbat comes through perfectly but Keren Hishtalmut gets butchered. The model defaults to English phonetics for anything it hasn't seen enough times in Hebrew context. He's wondering whether these models will ever handle the long tail of niche multilingual vocabulary without per-use-case fine-tuning, and what the most likely architectural solution looks like.

The timing on this is perfect, because I've been pulling papers on exactly this failure mode. What's wild is how cleanly the problem splits — Shabbat works, Keren Hishtalmut doesn't — and that split tells you almost everything about how these models are actually built under the hood.

Let's unpack why a model that nails Shabbat can't handle Keren Hishtalmut — and what that tells us about how these models actually work.

So code-switching in the TTS and STT context is when you embed tokens from a second language into your primary language stream. You're speaking English, but you drop in a Hebrew word written in Latin characters. The model sees the same twenty-six letters it always sees — there's no script change to signal that something different is happening — but the pronunciation rules are completely different.

Which is the phonetic trap. The word looks English but isn't.

The grapheme-to-phoneme converter — that's the G2P module, the thing that maps written characters to sounds — sees Keren Hishtalmut and applies English rules. Keren becomes "kare-in" instead of "keh-ren.whatever mess the tokenizer produces. And here's the thing — this isn't a bug in the model's reasoning. It's a consequence of how the training data is distributed.

Walk me through the distribution problem.

Think about what's in the training corpus. A word like Shabbat appears constantly in English-language text. It's in news articles, in cookbooks, in interfaith dialogue transcripts. It's effectively been absorbed into English as a loanword. The model has seen Shabbat in thousands of English sentences, often with surrounding context that cues the Hebrew pronunciation. It's not foreign vocabulary anymore — it's English vocabulary with non-English phonetics, and the model has learned that mapping.

Like chutzpah or kibbutz.

Chutzpah is so thoroughly absorbed that most English speakers don't even think of it as foreign. But Keren Hishtalmut? That's an Israeli financial term — a type of savings fund. It appears almost exclusively in Hebrew-language financial documents or in niche English-language content from Israel. In a one-billion-token corpus, you might see it three or four times. And that's not enough for the model to learn a robust phoneme mapping.

The problem isn't that the model can't handle Hebrew phonetics. It's that it doesn't know this particular word belongs to Hebrew.

And the answer to why starts with something deceptively simple: how the model chops up your words before it even tries to say them.

So most modern TTS and STT systems use subword tokenization — Byte Pair Encoding or SentencePiece, things like that. The idea is you don't tokenize at the word level because that gives you an enormous vocabulary with terrible coverage. And you don't tokenize at the character level because that loses all the useful structure. You split words into common subword units. The tokenizer is trained to maximize coverage of the training corpus with a fixed vocabulary size — Whisper uses about fifty thousand tokens, for example.

Shabbat makes the cut.

Shabbat appears as a single token in most multilingual tokenizers. It's frequent enough that the BPE algorithm never splits it. So when the model sees that token, it can associate it with a learned pronunciation that includes the Hebrew phonetics. But Keren Hishtalmut? Let me walk through what actually happens. In a standard fifty-thousand-token BPE vocabulary, Hishtalmut gets fragmented into something like Hish plus tal plus mut. Maybe even more granular depending on the exact tokenizer.

Three separate tokens, each pronounced with English phoneme rules.

Here's the crucial thing: each of those tokens appears in thousands of English words. Hish appears in "hish" — well, that's not a real English word, but the tokenizer doesn't care. It's seen Hish in fragments. Tal appears in "total" and "mental" and "vital." Mut appears in "mutter" and "mutation." So when the acoustic model generates speech, it's pulling from English phoneme distributions for each fragment. You get Hish like "fish" without the F, tal like the first syllable of "talent," mut like "mutt." None of the Hebrew phonology survives.

The attention mechanism doesn't rescue this, because...

Because the attention mechanism in something like a Tacotron2-derived architecture — which is what a lot of these systems are built on — doesn't have an explicit language identification head. It's not looking at a token and asking, "what language is this?" It's looking at a token and its surrounding context and trying to predict the next mel-spectrogram frame. The language signal is implicit in the training data. If a token never appeared in a Hebrew phonetic context during training, the attention mechanism has no way to route it to Hebrew pronunciation rules.

Even though cross-attention is attending across the whole input sequence, it's attending with English-shaped weights.

The key and query matrices were trained predominantly on monolingual English or language-separated multilingual data. In standard multilingual training, you don't mix languages in a single utterance — you train on English sentences, then French sentences, then Hebrew sentences. The model learns to produce English phonemes, French phonemes, Hebrew phonemes, but it never learns to switch between them mid-sentence. Code-switching is an out-of-distribution scenario.

That's the misconception I think a lot of people have — that multilingual models handle code-switching automatically because they "know" multiple languages.

They don't. They know multiple languages in parallel. It's like having three separate radio stations and being able to tune to any of them, but code-switching is asking the radio to play two stations simultaneously. The architecture isn't built for it unless you explicitly train for it.

What about the few cases where it does seem to work? Like, I've heard models handle "gracias" in an English sentence reasonably well. Is that just because "gracias" has crossed the loanword threshold?

That's exactly what's happening. "Gracias" appears in English contexts constantly — in TV shows, in casual conversation transcripts, in social media. It's crossed over. The model has effectively learned it as an English word with Spanish phonetics. But try a less common Spanish word — say, "aguinaldo," which is a Christmas bonus in some Latin American countries. The model has never seen it in an English sentence. It'll pronounce it as "ag-win-al-doe" instead of "ah-gee-nahl-doh." The loanword threshold is the invisible line, and most niche vocabulary falls well below it.

The loanword club has a very exclusive membership.

And it's not just about raw frequency — it's about frequency in the right context. A word could appear thousands of times in monolingual Hebrew training data, but if it never appears in English-Hebrew code-switched utterances, the model never learns the transition. It knows the word in Hebrew and it knows English, but it doesn't know how to stitch them together.

What about the speech-to-text side? Same problem in reverse?

Same problem with an extra twist. In STT, you've got the acoustic model converting audio to features, and then the decoder — which in Whisper is an autoregressive language model — generates the text transcript. That decoder has a strong language model prior toward the dominant language. So when you say "I contributed to my Keren Hishtalmut today" with correct Hebrew pronunciation, the acoustic model might actually capture the phonemes reasonably well. But then the decoder looks at that phoneme sequence and asks: what's the most probable English text that produced these sounds?

The probability of "Keren Hishtalmut" given an English context is near zero.

In the decoder's probability distribution, P of Keren Hishtalmut given English context is vanishingly small. It's going to reach for the nearest English-sounding phrase. Maybe "Karen is tell me to" or something equally nonsensical. I've seen transcripts where Hebrew financial terms get rendered as complete word salad.

Of course there are.

The decoder's confidence scores for those tokens are abysmal, but the beam search doesn't have a better alternative, so it just picks the least improbable option. The Zipfian distribution is absolutely brutal here.

Explain the Zipfian part.

Zipf's law says that in any natural language corpus, the frequency of a word is inversely proportional to its rank. The most common word appears twice as often as the second most common, three times as often as the third, and so on. What this means for code-switched vocabulary is that in a one-billion-token corpus, roughly fifty percent of unique tokens appear exactly once. And that's where all the niche vocabulary lives. Keren Hishtalmut, Pikuach Nefesh, Bituach Leumi — these are all in the long tail. You can't train a model to handle them through exposure because there isn't enough exposure to train on.

Fine-tuning on a few hundred examples per word seems like the obvious fix, but that's not actually practical.

That's the second big misconception. People think, "oh, I'll just fine-tune on my podcast scripts and it'll learn my vocabulary." But to actually shift the model's phoneme distribution for a specific rare word, you need hundreds of examples of that word in context — properly pronounced, properly transcribed. And each new word requires new data. If your podcast uses fifty niche Hebrew terms, you need hundreds of examples for each of them. That's thousands of annotated training examples. And you have to repeat that every time the base model updates.

Which for Chatterbox or ElevenLabs — proprietary APIs — you can't even do.

You can't touch the model weights at all. Your only options are what you can do at inference time. And the current workarounds are... let's say they're functional but not elegant.

Walk me through what's available right now.

Option one: SSML tags with phoneme overrides. SSML is Speech Synthesis Markup Language — it lets you embed pronunciation instructions directly in the text you send to the TTS API. You write something like: open bracket phoneme alphabet equals ipa ph equals forward slash keh-ren heesh-tahl-moot forward slash close bracket, then the word Keren Hishtalmut, then close phoneme. And the model uses your specified pronunciation instead of guessing.

It works perfectly. For that one occurrence. But you have to do it manually for every single instance of every niche word in your script. If you're producing a thirty-minute podcast with twenty code-switched terms, that's a lot of manual annotation. And you need to know the IPA for each word, which most people don't have memorized.

Covering the covers.

There are tools that help — Phonemizer is an open-source library that can convert Hebrew text to IPA automatically. So you could build a preprocessing pipeline: detect Hebrew words in your English script, run them through Phonemizer, wrap them in SSML tags, then send the annotated text to the TTS API. It's clunky but it works.

You write "Keren Heeshtahlmoot" instead of "Keren Hishtalmut" and hope the English G2P rules accidentally produce something close to the Hebrew pronunciation. It's unreliable and it makes your scripts unreadable. You're basically guessing at how the model will interpret your respelling, and different models will interpret it differently.

The glockenspiel of corporate approachability.

I don't even know what that means, but I agree.

Option three — fine-tuning, which we've established doesn't scale for proprietary models.

For open-source models like Coqui TTS, you could theoretically fine-tune on a thousand synthetically generated code-switched sentences. But even that is a significant engineering effort, and you have to maintain your fine-tuned fork as the base model evolves. For most podcasters and content producers, it's just not viable.

What's the actual solution on the horizon? You mentioned a Google paper.

Google published a paper at ICASSP twenty twenty-four — that's the International Conference on Acoustics, Speech, and Signal Processing — proposing a code-switching TTS system with a language-tagging frontend. The idea is elegant: before the text hits the G2P converter, a lightweight classifier tags each token or phrase with a language ID. Then the G2P converter routes the token to the appropriate language-specific phoneme generator. They got about a fifteen percent improvement in pronunciation accuracy on code-switched utterances.

Fifteen percent is meaningful but not solved.

It's progress, not a solution. The limitation is that the language tagger still needs to have seen the word before to know which language it belongs to. For truly rare vocabulary, the tagger itself might misclassify.

Meta's SeamlessM4T?

SeamlessM4T came out in twenty twenty-three, supports over a hundred languages, and was explicitly designed for speech-to-speech and speech-to-text translation. It uses a unified multilingual encoder, which means it processes all languages through the same representation space rather than routing to language-specific encoders. That's actually better for code-switching in theory, because the model isn't forced to pick a language upfront.

In practice, it still shows about a thirty percent word error rate on code-switched utterances with rare vocabulary. The encoder can handle the mixed audio, but the decoder still has that language model bias toward the dominant language. It's a better architecture for the problem, but it hasn't solved the long tail.

We've got incremental improvements from language tagging and unified encoders, but the fundamental issue — the model hasn't seen the word during training — persists.

That's why the most interesting direction, in my view, is retrieval-augmented generation for TTS.

The idea is that at inference time, when the model encounters a token it doesn't recognize — or more precisely, a token with low confidence in its phoneme prediction — it queries an external pronunciation knowledge base. That knowledge base could be a multilingual pronunciation dictionary, or it could be a separate G2P model trained specifically on low-resource languages. The model retrieves the correct phoneme sequence and injects it into the decoder.

Instead of trying to memorize every rare word during training, you offload that to a database lookup at runtime.

And this is conceptually similar to what retrieval-augmented generation does for large language models — instead of storing all factual knowledge in the model weights, you let the model query an external knowledge store when it needs specific information. For TTS, the "knowledge" is pronunciation, and the "store" is a phoneme dictionary.

Has anyone shipped this?

Not in a major TTS system as of May twenty twenty-six. There are research prototypes — there was a paper from a team at Carnegie Mellon last year that demonstrated a proof of concept for Mandarin-English code-switching using a retrieval-augmented pronunciation module. But it added about two hundred milliseconds of latency per query, which is a problem for real-time TTS.

Two hundred milliseconds is noticeable.

It's the difference between natural-feeling speech and that slight delay that makes the listener feel like something's off. For podcast production where you're generating audio offline, that latency doesn't matter at all — you'd happily trade two hundred milliseconds per rare word for correct pronunciation. But the big TTS providers are optimizing for real-time use cases like voice assistants, where latency is critical.

The podcast use case gets deprioritized.

But here's what I think happens: within two to three years, we'll see TTS systems that offer a hybrid mode. Real-time for common vocabulary with cached pronunciations, and an async mode that does RAG lookups for rare terms. Or they'll precompute pronunciations for known rare vocabulary before generation starts, so there's no runtime latency penalty.

There's another dimension to this that Daniel's prompt mentions — the character set issue. When Hebrew words are written in Latin script, the model loses the orthographic signal.

This is a huge factor that doesn't get enough attention. If you're writing Arabic words in an English sentence and you preserve the Arabic script — even just for those words — the model immediately knows something different is happening. The character set is a strong language signal. But when you transliterate Hebrew into Latin characters, the model sees the same A through Z it sees everywhere else. There's no visual cue that says "switch phoneme rules now.

It's like writing French without the accents and expecting the model to know it's French.

That's actually a perfect analogy. French without diacritics is ambiguous — "eleve" could be éleve or élève, and the pronunciation changes. Hebrew transliterated into Latin script is even worse because there's no standardized transliteration system that everyone uses. Is it Shabbat or Shabbos? Keren Hishtalmut or Keren Hishtalmoot? Chanukah or Hanukkah? The same Hebrew word can have five different Latin-script representations depending on who's writing it.

Which means even if you built a perfect pronunciation knowledge base, you'd still have the matching problem — mapping the variable transliteration to the canonical Hebrew form.

You'd need a transliteration normalizer as part of the pipeline. And that's a whole separate research problem. There are tools for this — the open-source library Uroman from the Uniphone project handles transliteration normalization for dozens of languages — but integrating it into a TTS pipeline is non-trivial.

We've got three layers of difficulty. The tokenizer fragments rare words. The attention mechanism has no language routing. And the Latin-script transliteration strips the orthographic signal.

The solutions map to those layers. Language-aware tokenization addresses the fragmentation. Language embeddings or tagging addresses the routing. And RAG-based pronunciation addresses the missing training data. The fully solved system probably needs all three.

Let me ask the practical question. If I'm producing a podcast today with code-switched Hebrew vocabulary, what's my actual workflow?

The most reliable approach right now is the SSML preprocessing pipeline. You maintain a pronunciation dictionary for your niche vocabulary — Hebrew word on the left, IPA phoneme sequence on the right. You write a script that scans your episode text for those words, wraps them in SSML phoneme tags, and sends the annotated text to Chatterbox or whatever TTS API you're using.

For words I haven't added to the dictionary yet, they still get butchered.

They still get butchered. But the dictionary grows over time. After twenty episodes, you've probably covered ninety percent of the Hebrew vocabulary you regularly use. The long tail of one-off terms remains a problem, but it's manageable.

Phonemizer automates the IPA conversion?

Phonemizer can take Hebrew text in Hebrew script and output IPA. But you need the word in Hebrew script first. So if your script has "Keren Hishtalmut" in Latin characters, you need an intermediate step — either a lookup table that maps the transliteration to Hebrew script, or a transliteration tool. It's a multi-step pipeline, but it's all automatable.

The advice is: build the pipeline once, maintain a growing pronunciation dictionary, and accept that one-off rare terms will need manual attention.

For developers building custom TTS systems, there's a slightly different approach. You can do a two-pass pipeline. First pass: run a lightweight language detector — something like langid dot py or FastText — on each sentence or phrase to identify code-switch boundaries. Second pass: route each segment to a language-specific G2P converter before feeding everything to the acoustic model. This works well if you're building on open-source components like Coqui TTS or Piper.

What about the STT side for a podcast producer? If I want my code-switched podcast to have accurate transcripts?

That's harder because you don't control the STT model at inference time the way you do with SSML for TTS. With TTS, you're generating the audio, so you can annotate the input. With STT, you're transcribing existing audio, and you can't inject pronunciation hints.

You're stuck with post-processing.

Post-processing is your main tool. You run Whisper or whatever STT engine you're using, then you have a second pass that identifies likely code-switched terms and corrects them. You can do this with a simple find-and-replace dictionary for your known vocabulary, or with a more sophisticated language model that understands your domain. Some podcasters I know run their transcripts through a second LLM that's been prompted with their common Hebrew vocabulary and told to fix transliterations.

Which is a very "twenty twenty-six" solution — use an LLM to clean up after another model.

The "AI sandwich" approach. One model generates, another model fixes. It works, but it's not elegant.

Let's zoom out to the prediction question. Will TTS and STT models catch up so that the long tail of code-switched vocabulary just works?

I think "just works" is a high bar, but I think we'll get to "works well enough for most use cases" within three to five years. The trajectory is clear: language-aware tokenization is already being integrated into next-generation models. 's Seamless architecture shows that unified multilingual encoders can handle code-switching better than language-separated training. And RAG for pronunciation is a natural extension of the retrieval-augmented paradigm that's sweeping through AI right now.

The question is whether the industry standardizes on one approach or fragments.

That's the open question. Language-aware tokenization and RAG-based pronunciation solve slightly different parts of the problem. Tokenization improvements help the model know that a word might be foreign. RAG gives it the correct pronunciation even for unseen words. You could build a system with either approach and get decent results, but the best system would use both.

The latency tradeoff means different solutions for different use cases.

Real-time systems like voice assistants will probably go the language-aware tokenization route because it doesn't add inference-time latency. Offline systems like podcast TTS generators will adopt RAG because they can afford the lookup time. We might end up with a bifurcated market where the same underlying model has different operating modes.

Which is already how a lot of AI systems work — different inference configurations for different latency budgets.

And I think there's an interesting parallel to how humans handle this. When a fluent bilingual speaker encounters an unfamiliar word from their second language embedded in their first language, they don't always get the pronunciation right either. They might approximate it using the phonetics of their dominant language. The model's behavior isn't that different from human behavior — it's just that humans have the -cognitive ability to recognize when they're uncertain and ask for clarification.

The model just barrels ahead with wrong pronunciation and full confidence.

Which is, in its own way, the most human thing about it.

Alright, so to pull this together — the core problem is that code-switching exposes the monolingual assumptions baked into tokenization, attention, and training data distribution. The solutions are incremental but real: language tagging, unified encoders, and retrieval-augmented pronunciation. For podcasters today, the practical path is SSML annotation with a growing pronunciation dictionary. And the prediction is that within a few years, RAG-TTS systems will make fine-tuning for niche vocabulary unnecessary.

I'd add one thing for listeners who produce code-switched content: try the SSML approach. Build that pronunciation dictionary. It's front-loaded effort, but after a few episodes it becomes routine. And we'd actually love to hear from people who've built these pipelines — what worked, what broke, what tools you used.

Because code-switching isn't a bug in multilingual models. It's a feature of human language that models are only beginning to learn.

Now: Hilbert's daily fun fact.

Hilbert: During the Cold War, the Wai Wai people of Guyana traditionally dyed cotton fibers a vivid crimson using the crushed shells of a specific river snail, but they could only harvest the snails during the dry season when capybaras — which are the snails' primary predator — migrated away from the riverbanks, creating a brief window where snail populations were both abundant and accessible to human gatherers.

...right.

This has been My Weird Prompts. Our producer is Hilbert Flumingtop. Find us at myweirdprompts dot com or wherever you get your podcasts. If you've built a code-switching TTS pipeline, we want to hear about it — drop us a review and tell us what worked.

I'm Herman Poppleberry.

I'm Corn. See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2982: Why Your TTS Model Nails "Shabbat" but Not "Keren Hishtalmut

Downloads

You Might Also Like

#2982: Why Your TTS Model Nails "Shabbat" but Not "Keren Hishtalmut