When we talk about non-Western AI, the conversation almost always starts and ends with China. Which is fair—they've got the scale, the investment, the headlines. But today's prompt is making us look elsewhere, specifically at Russia, India, and Japan, and asking what's actually growing in those gardens. Are these models inherently multilingual, or are we seeing a lot of stubbornly monolingual development?
It's a great question, and one that gets to the heart of a major shift in the economics of AI. The default assumption has been that you train on English, maybe add some multilingual data as a bonus feature, and that's the product. But that calculus is changing fast. When you have a population the size of Russia's, or India's, the value of a natively proficient model starts to outweigh the cost of building it.
And by the way, fun fact for the day—today's script is being powered by Xiaomi MiMo v2 Pro. So if any of our Russian or Hindi translations sound particularly eloquent, you know who to thank.
Or blame. Let's start with Russia, because their approach is fascinating. You have GigaChat, which is Sber's big play. It was trained on a staggering one point three trillion tokens of Russian-language data. That's their core asset. But—and this is the key architectural choice—they built it with parallel English corpora from day one.
So not a Russian model that later had English bolted on, but a model designed to be bilingual from the ground up.
Precisely. That's the crucial distinction. And the reason this matters is Cyrillic tokenization. Russian morphology is incredibly rich. You have case endings, verb aspects, a level of inflection that Latin scripts just don't have to deal with in the same way. If you try to use a standard tokenizer trained on English web text, you get awful performance. Words get chopped into meaningless subword units that lose all grammatical context. Imagine trying to understand English if every time you saw the word "understanding," it was split into "under," "stand," and "ing" as three completely separate tokens with no grammatical relationship. That's the kind of degradation we're talking about.
So they had to build a custom tokenizer just to handle the structure of the language properly. That sounds like a massive foundational undertaking.
They used SentencePiece as a base, but with a heavily customized vocabulary optimized for Russian morphology. It's not just about having Russian words in the dictionary; it's about how the model breaks down and reconstructs meaning at a sub-lexical level. The result is a model that doesn't just speak Russian fluently, it thinks in Russian structures. And because it was trained with English in parallel, it can bridge between the two without that awkward translation layer you see in models where multilingualism is an afterthought.
The parallel training isn't just for capability, it's for efficiency in the bridging process. You're avoiding a middleman.
Right. It avoids what I call the "digital sandwich" problem. That's where you have a speech model, then a text translation model, then another speech model stacked together. Each layer adds latency and error propagation. GigaChat's unified architecture means the translation happens inside the model's latent space, not between discrete components. Think of it like a person who is natively bilingual, versus someone who has to constantly consult a dictionary. The native speaker's flow is seamless.
Okay, so that's Russia. Now, India is a completely different beast linguistically. You're not dealing with one or two languages, you're dealing with twenty-two official languages and hundreds of dialects. How does Sarvam AI approach that?
This is where it gets really clever, and also where you see the tradeoffs starkly. Sarvam uses what they call a "language routing" system. When a query comes in, it first goes through a classifier that identifies the language. That classifier has about ninety-four percent accuracy, which sounds high until you realize what that six percent error rate means in practice.
Six percent of queries get sent to the wrong specialized sub-model. That could be a significant user experience headache.
And that creates these latency spikes and accuracy drops that are really noticeable to users. But the architecture itself is fascinating. Instead of trying to build one giant model that knows all twenty-two languages equally well—which current research shows is nearly impossible without massive compromises—they have specialized sub-models for language clusters. So there's a Dravidian language cluster for Tamil, Telugu, Kannada. An Indo-Aryan cluster for Hindi, Bengali, Marathi. And a routing layer that directs traffic.
So it's more like a federation of specialized models than a single polyglot entity. A union of experts, each with their own domain.
That's a perfect way to put it. And this gets to what researchers call the "multilingual paradox." The data consistently shows that when you train a model on multiple languages, its performance on any individual language is almost always worse than a specialized model trained just on that language. It's the jack-of-all-trades problem. You're sharing parameter capacity across too many tasks.
Which makes intuitive sense. If I'm trying to learn Japanese and Arabic at the same time, I'm probably not going to be as good at either as if I focused on one. But how does that manifest technically in the model? Is it just a general "fuzziness," or are there specific failure modes?
The neural network faces similar capacity constraints. Parameters that could be perfectly tuned for Hindi syntax get partially allocated to Tamil morphology, and neither reaches its full potential. You see it in subtle ways: the model might confuse postpositions common in Indo-Aryan languages with the agglutinative suffixes of Dravidian ones. Or it might generate grammatically correct but stylistically awkward sentences because it's averaging patterns. Sarvam's routing approach is an attempt to get the best of both worlds—shared infrastructure for common reasoning capabilities, but dedicated capacity for linguistic nuance. It’s like having a team of specialists who all attended the same general training bootcamp for core principles, but then each went deep in their own field.
Now, Japan's Sakana AI takes yet another approach, right? They're not even trying to be multilingual. They're going all-in on monolingual specialization.
That's right. And this is where the economics get really interesting. Sakana focuses exclusively on Japanese technical documentation and scientific papers. They have a one hundred and twenty-five million parameter model—which by today's standards is tiny—that outperforms much larger general models on domain-specific Japanese tasks.
How is that possible? It seems counterintuitive that a smaller, narrower model could beat a giant like GPT-4 at anything.
Three reasons. First, the data quality is exceptional. They're not training on random web scrapes; they're training on curated, high-quality Japanese technical literature. It's the difference between learning a subject from a stack of peer-reviewed journals versus a pile of random blog posts and social media comments. Second, the tokenizer is, again, optimized specifically for Japanese, which has its own unique challenges with kanji, hiragana, and katakana all mixed together. A good tokenizer for Japanese needs to understand that a single conceptual word might be represented by a compound of multiple kanji characters, and that the grammatical glue is often in hiragana attached at the end. And third—and this is the economic insight—training a monolingual model costs dramatically less. We're talking about forty percent cheaper than training an equivalent multilingual model.
Forty percent? That's not a rounding error. For a startup or a research lab, that's the difference between launching a project and shelving it.
Not at all. And the savings come from multiple places. Your vocabulary is smaller, so your embedding layer is smaller. Your alignment complexity is reduced because you're not trying to map concepts across dozens of languages. And your data pipeline is simpler. For a country like Japan, which has a massive domestic market but where English proficiency is relatively low, the business case for a hyper-specialized Japanese model is compelling. They're not trying to serve the world; they're trying to serve Japan, and do it exceptionally well.
This makes me think about data moats. We talk about those in the Western context—companies with unique datasets having advantages. But Russia's Runet, the Russian-language internet, is essentially a walled garden from a data perspective. Western models have limited access to that.
It's a genuine competitive advantage. And India's vernacular internet is even more extreme. There are hundreds of millions of people online in India who primarily interact in languages like Tamil, Telugu, or Bengali. That data simply doesn't exist in the English web scrapes that power most Western models. If you want to serve those users well, you either need to partner with local companies who have that data, or you need to build it yourself. It's not just about translating English concepts; it's about understanding the cultural context embedded in the language—how people ask for help, what metaphors they use, what constitutes a polite or formal request.
Which creates these interesting geopolitical dynamics in AI. It's not just about who has the best algorithms, but who has access to which linguistic data pools. Data becomes a form of digital territory.
And we're seeing this play out in benchmarks. Yandex published a study showing GigaChat outperforming GPT-4 on Russian legal document analysis. Not by a little—by a meaningful margin. And that's not because GigaChat is a "better" model in some absolute sense, but because it understands the nuances of Russian legal terminology, the structure of Russian contracts, the specific phrasing used in Russian jurisprudence. It knows that a certain clause format is standard, or that a particular archaic term is still used in property law.
So for a Russian law firm, the choice isn't even close. It's like hiring a local lawyer who knows the courts versus a brilliant foreign lawyer who has to look everything up.
Not close at all. And we're seeing similar patterns in India. Sarvam's models handle Hindi-English code-switching—that natural mixing of languages that happens in Indian conversation—far better than other large multilingual models. Because they've trained specifically on that pattern, they understand when and how speakers switch between languages mid-sentence. For example, they know that in casual conversation, the core verb might be in English ("I completed the assignment") while the surrounding grammar and context is in Hindi. A Western model might get confused by the sudden switch.
That's a perfect example. It's not just vocabulary; it's the rhythm of how people actually talk. What about the second-order effects here? If we're moving toward a world of specialized regional models, what does that mean for global AI accessibility? Does it create more fragmentation?
That's the million-dollar question. On one hand, you get better service for local populations. An elderly person in rural Tamil Nadu can interact with AI in their native language with real fluency. That's transformative for education, healthcare access, government services. On the other hand, you might get silos where models don't communicate well across linguistic boundaries. Imagine a Russian model and a Japanese model trying to collaborate on a global supply chain problem—their underlying assumptions about contract law or quality control metrics might be subtly different.
Or you get a new kind of digital divide where the quality of AI assistance depends on which language you speak. If the best medical diagnostic AI is only truly fluent in English and Mandarin, speakers of other languages are at a disadvantage.
Which is already happening. The research shows that using AI in your non-native language often costs more computationally and delivers worse results. We actually covered this in an episode about why it costs more to talk to AI in your native tongue. The regional model trend could either alleviate that—by making native-language AI more available—or exacerbate it, if the best capabilities remain locked in English-centric models. The hope is that the economics of specialization make high-quality native-language AI viable for more and more languages.
Let's talk about the practical takeaways for our listeners. If you're a developer building for a specific market, say the Russian market, what's the calculus?
If you're building for Russia, and you need deep linguistic competence—legal, medical, technical documentation—a specialized Russian model like GigaChat will almost certainly outperform a multilingual generalist. And you'll save maybe forty percent on compute costs. The tradeoff is you lose some of the broad world knowledge that comes from being trained on English web data. But for domain-specific applications, that's often an acceptable trade. You're not asking it to write a poem about Shakespeare; you're asking it to parse a patent application.
And for researchers?
The multilingual paradox tells us we need better cross-lingual transfer learning, not just bigger datasets. Throwing more languages at a model isn't the solution. We need architectural innovations that allow models to share abstract reasoning capabilities while maintaining linguistic specialization. The routing approach is one direction, but there are others being explored—mixture of experts, modular architectures, things like that. The goal is a model that can learn from a Tamil text and apply that conceptual understanding in Bengali, without getting the grammar mixed up.
For businesses looking at international expansion?
Regional data moats are becoming real competitive advantages. If you want to enter the Indian market, partnering with a company like Sarvam that has deep vernacular data is increasingly strategic. Trying to build that capability from scratch with a Western model will leave you at a significant disadvantage. It's not just a technical hurdle; it's a cultural one. These local models understand local humor, local references, local customer service expectations.
It feels like we're moving from a world where AI was essentially "English plus" to a world where it's a genuine mosaic of linguistic specializations. A patchwork quilt rather than a single, large blanket.
And the economics are driving it. As compute costs drop—and they are dropping fast—the barrier to entry for training native-language models is vanishing. Five years ago, only the biggest tech giants could afford to train a competitive model. Now, with more efficient architectures and cheaper compute, regional players can enter the market and serve their linguistic communities effectively. It's the democratization of AI development, in a sense.
So the future isn't just a handful of global models that happen to speak multiple languages. It's potentially hundreds of specialized models, each optimized for their specific linguistic and cultural context. That's a radically different vision.
And that's not a bad thing, necessarily. It's more democratic in a way. It decentralizes AI capability away from Silicon Valley and Beijing and distributes it to wherever there's a sufficiently large language community with the technical capability to build for themselves. It means AI that truly understands a Kenyan farmer's questions in Swahili, or a Brazilian doctor's notes in Portuguese.
Though it does raise questions about interoperability and standards. If I'm using a Russian model and you're using an Indian model, can we collaborate effectively? How do we ensure they're speaking the same conceptual language?
That's the next challenge. We'll need translation layers, not just of language but of conceptual frameworks. Because these models aren't just speaking different languages—they're thinking in different cultural contexts. A Russian model's understanding of "privacy"—shaped by a different history of state surveillance and personal space—might differ subtly from an Indian model's understanding, trained on different legal traditions and social norms. Bridging that is a profound challenge.
So the real multilingual challenge isn't just linguistic, it's conceptual. It's about aligning worldviews, not just words.
And that's where the interesting research will be in the coming years. Not just in better tokenizers or bigger datasets, but in creating models that can navigate these conceptual differences while maintaining their cultural specificity. It's about building meta-models that can orchestrate these specialists, like a UN translator who doesn't just translate words, but mediates meaning.
Well, Daniel, this was a fascinating rabbit hole. Thanks for sending us down it. It's a reminder that the AI landscape is far bigger and more varied than the headlines suggest.
Indeed. And for our listeners, if you're enjoying these deep dives into the global AI landscape, a quick review on your podcast app helps us reach new listeners who might be interested in exactly this kind of analysis. It really does make a difference.
We'll put links to some of the research we mentioned—the Yandex study, Sarvam's architecture papers—in the show notes. This has been My Weird Prompts, and we'll catch you next time.
Thanks as always to our producer Hilbert Flumingtop, and big thanks to Modal for providing the GPU credits that power this show. Until next time.