Why can BERT answer questions about a text with surgical precision but fail to write a simple three-paragraph story, while GPT can weave a complex narrative but occasionally struggles with basic retrieval tasks? It sounds like a riddle, but the answer actually lies in the structural DNA of the models themselves. We are talking about the three distinct transformer architectures that define almost everything we do in AI today.
It is a classic case of form following function. You have these three lineages that all sprouted from the same 2017 research, but they evolved to solve fundamentally different problems. I am Herman Poppleberry, and today we’re performing a bit of digital taxonomy. We’re looking at the Transformer Trinity: Encoder-only, Decoder-only, and the original Encoder-Decoder setup.
And we have a great prompt from Daniel to guide us through this. He wants us to walk through each one in order, look at where they live in the wild today, and then settle the ultimate debate: why the decoder-only models essentially ate the world while the others are still holding onto very specific, very important niches. By the way, today’s episode is powered by Google Gemini 3 Flash, which is actually a perfect meta-example of the very things we’re discussing.
It really is. And to understand why Gemini or Claude or Llama work the way they do, we have to go back to that original 2017 paper, "Attention Is All You Need." That paper introduced the world to the Transformer, but the version they described back then was actually the most complex of the three: the Encoder-Decoder. Since then, the industry has sort of "deconstructed" that original design into specialized variants.
Right, so let’s start with the first specialist: the Encoder-only architecture. This is the world of BERT, RoBERTa, and DeBERTa. If the AI world were a construction site, these would be the inspectors. They aren't building the house; they’re looking at the blueprints from every possible angle to make sure they understand exactly what’s going on.
That’s a great way to frame it. The "secret sauce" of the Encoder-only model is something called Bidirectional Attention. When an encoder processes a sentence, it doesn’t read left-to-right like a human. It looks at every single word in the sequence simultaneously. It sees the beginning, the middle, and the end all at once. If you have the sentence "The bank was closed because of the river flood," an encoder looks at the word "bank" and immediately sees "river" at the end of the sentence. It uses that context to instantly realize we’re talking about a geographic feature, not a financial institution.
It’s basically the "spoilers allowed" version of reading. It knows the ending before it’s even finished processing the first word. And that’s why it’s so dominant in things like embeddings and classification, right? Because to turn a sentence into a mathematical vector, you need that total, holistic understanding.
Well, not "exactly," but you’ve hit the nail on the head. Because it has this global view, it creates superior vector representations. This is why, even in 2026, if you are building a RAG system—Retrieval-Augmented Generation—you are almost certainly using an encoder-only model for your embeddings. When you search for a concept in a database, you want a model that understands the deep, multi-directional context of your query. A decoder-only model, which only looks backward, often misses those subtle forward-looking cues that define the true meaning of a sentence.
I think people forget how revolutionary BERT was when Google dropped it in 2018. It fundamentally changed how search engines work. Before BERT, search was a lot of keyword matching. After BERT, search became about intent. If I search for "can you get medicine for someone else at the pharmacy," the word "for" is doing a lot of heavy lifting there. An encoder-only model sees that "for" and understands the relational direction between the two people mentioned.
And the way they train these things is fascinating. They use Masked Language Modeling, or MLM. They take a massive corpus of text, hide about fifteen percent of the words—literally just put a [MASK] tag over them—and tell the model, "Figure out what’s under the mask." To do that successfully, the model has to learn the relationships between all the surrounding words. It’s like a giant crossword puzzle where the model gets better and better at understanding the clues provided by the rest of the sentence.
But this leads us to the big "Why is BERT not a chatbot?" question. If it understands language so well, why can’t I just ask it to write me a poem about a sloth and a donkey starting a podcast?
Because it’s a prisoner of its own perfectionism. Since it’s trained to look at the whole sentence at once, it doesn’t know how to generate text one word at a time. If you try to make BERT generate text, it gets into this weird feedback loop because it’s expecting to see the future words to understand the current word. It’s like trying to walk by only looking at your destination five miles away; you’ll trip over the curb right in front of you. It lacks "causal" logic. It can fill in a blank, but it can’t continue a thought into the unknown.
It’s essentially a very smart editor that can’t actually write the book. It can tell you everything that’s wrong with a paragraph, but it can’t give you the next one. Which is the perfect transition to the architecture that actually did write the book: the Decoder-only models. This is the GPT family, the Llama family, Claude, and just about every "AI" that the general public interacts with today.
This is the lineage that truly "ate the world." And the reason is almost ironically simple compared to the encoder. While the encoder looks at everything, the decoder is strictly, militantly "causal." It uses Causal Self-Attention, which is a fancy way of saying it is "blind" to the future. When it’s predicting the next word, it can only look at the words that came before it. It’s reading the world exactly like we do—one word at a time, left to right.
It’s funny because, on paper, that sounds like a handicap. You’re telling me that by restricting the model’s view, we actually made it more powerful?
In terms of generation, yes. Because its training objective is Causal Language Modeling—predicting the next token. By doing this over trillions of tokens, these models didn’t just learn how to predict the next word; they accidentally learned logic, reasoning, and world knowledge. It turns out that to accurately predict the next word in a complex physics paper or a legal brief, you actually have to understand the underlying concepts.
And this is where the "Scaling Laws" come in, right? Because I remember you telling me that these decoder-only models scale way more predictably than the other types.
That was the big discovery around 2020 and 2021. Researchers realized that if you just keep adding parameters and feeding more data to a decoder-only architecture, the performance gains are remarkably linear on a log scale. It’s a very "clean" architecture. You don't have to worry about the complex interactions between an encoder and a decoder stack. You just have one big stack of blocks, and you keep making the stack taller.
There’s also the practical side of running these things. I’ve heard you talk about KV caching—Key-Value caching. Why is that such a big deal for decoders but not for the others?
It’s the secret to why ChatGPT can handle long conversations without melting the server. Since the model only looks at the past, it doesn't need to re-process the entire history of the conversation every time it generates a new word. It can "cache" the mathematical representations of the previous words and just refer back to them. It’s like having a perfect short-term memory of everything that’s been said so far. In an encoder-only model, if you changed one word at the beginning, the entire mathematical representation of every other word would change because it’s bidirectional. In a decoder, adding a word at the end doesn't change the past; it just builds on it.
So, it’s computationally efficient for long-form generation, it scales like a dream, and it turns out that "next-token prediction" is basically the skeleton key for general intelligence. It’s no wonder Meta went all-in on it with Llama. We just saw Llama 4 drop in January 2026, and it’s still sticking to that core decoder-only philosophy, just refined to an insane degree.
It’s the simplicity that won the war. In a decoder-only model, the prompt and the response are treated as one single continuous sequence. There’s no "hand-off" between different parts of the brain. But, we shouldn't act like the original "middle child" is dead. The Encoder-Decoder architecture is still very much alive, even if it’s not the one making the flashy headlines every day.
Right, the third variant. The "Classic" Transformer. This is T5, BART, and the original model from the 2017 paper. It’s got an encoder stack and a decoder stack, and they talk to each other through something called "Cross-Attention."
Think of it as a specialized translation team. The Encoder is the person who reads the English sentence and takes detailed notes. The Decoder is the person who takes those notes and writes the German translation. The "Cross-Attention" is the constant communication between them. As the decoder is writing the German sentence, it’s constantly glancing over at the encoder’s notes to make sure it hasn't missed a nuance.
This seems like the "proper" way to do things. It feels more deliberate. So why did it lose the "General AI" crown to the simpler decoder-only models?
It’s mostly about the overhead. When you’re training at the scale of trillions of parameters, having two separate stacks that have to be perfectly synchronized adds a lot of engineering complexity. However, for specific tasks—what we call Sequence-to-Sequence or Seq2Seq—the Encoder-Decoder is still arguably superior. Translation is the obvious one. Google Translate still relies heavily on this architecture because it’s fundamentally a task of mapping one complete thought to another complete thought.
And summarization too, right? If I’m giving the AI a twenty-page document and asking for a three-bullet point summary, the Encoder-Decoder architecture makes a lot of sense. The encoder can "digest" those twenty pages holistically, and the decoder can then produce the summary based on that total understanding.
Precisely. And that’s why Google still uses T5—the Text-to-Text Transfer Transformer—for so many of its internal pipelines. It’s a workhorse for things like grammatical error correction or data cleaning. It’s very good at taking a "messy" input and producing a "clean" output of a similar or shorter length. But when it comes to "open-ended creativity" or "reasoning from a prompt," the decoder-only models just have this emergent "magic" that the more rigid encoder-decoder setups struggle to match.
It’s almost like the Encoder-Decoder is a professional translator, while the Decoder-only model is a brilliant conversationalist who just happened to learn how to translate along the way.
That’s a very sharp way to put it. And it brings us back to Daniel’s question about the "comparison" and the "niches." Because while the decoder-only models "ate the world," they haven't actually replaced the others. They’ve just become the face of the movement.
Let’s talk about that "dominance" of BERT in embeddings. I think this is where a lot of modern developers get tripped up. They think, "I have access to the GPT-4 API, why would I bother with a tiny BERT model from 2018 for my search engine?"
And the answer is usually "because you like your money and your users’ time." BERT models are tiny. We are talking 110 million to 340 million parameters. Compare that to the 70 billion or 400 billion parameters of a top-tier Llama or GPT model. BERT is 100 times faster and 100 times cheaper to run. If you have ten million documents that you need to turn into vectors so you can build a search engine, using a massive decoder-only model is like using a rocket ship to go to the grocery store.
Plus, as you mentioned, the bidirectionality. If I have a document about "The impact of trade on the economy," I want the vector to represent the whole thought. A decoder-only model might give too much weight to the end of the sentence because that’s the last thing it saw. The encoder sees the whole thing as a single, unified "vibe," which makes for much better search results.
It’s the difference between a "summary" and a "trajectory." A decoder-only representation is essentially a trajectory of where the sentence was going. An encoder-only representation is a summary of where the sentence is. For retrieval, you want the summary. For generation, you want the trajectory.
So we’ve got this map now. If you’re building a search engine or a sentiment analyzer—anything where you need to "understand" a fixed piece of text—you grab an Encoder-only model like BERT or RoBERTa. If you’re building a chatbot, a creative writer, or a general-purpose reasoning engine, you go Decoder-only like GPT or Llama. And if you’re doing high-fidelity translation or specific "input-to-output" mapping, you might still look at an Encoder-Decoder like T5.
That’s the triage. And it explains why the landscape looks the way it does in 2026. We haven't converged on one single "God Model" architecture. Instead, we’ve specialized. Even within a single application, like a modern AI assistant, you’re often using multiple architectures. When you talk to an AI, an encoder model might be used to vectorize your query and search a database, and then a decoder model takes the results of that search and writes the response.
It’s a team effort. The sloth does the deep thinking and the donkey does the heavy lifting. I’ll let you decide which architecture is the sloth in this scenario.
Oh, the encoder is definitely the sloth. It sits there, takes in the whole environment at once, doesn't rush to any conclusions until it’s seen the whole picture. The decoder is the donkey—it’s just constantly moving forward, one step at a time, predicting the next hoof-fall based on the last one.
I actually really like that. It’s surprisingly accurate. But it does make me wonder about the future. We’re sitting here in 2026, and these three variants have been the standard for years. Do you see a "v4" on the horizon? Something that blends these together even more seamlessly?
There is a lot of research into "PrefixLM" and "Mixture of Experts" that tries to blur these lines. For example, some models use a "prefix" where they act like an encoder for the prompt—looking at it bidirectionally—and then switch to a decoder mode to generate the response. It’s an attempt to get the best of both worlds: the deep understanding of BERT with the generative power of GPT.
Which makes sense. Because as these models get more complex, the "hand-off" between an embedding model and a generation model becomes a bottleneck. If you can do it all in one "brain," you save a lot of latency.
Latency is the big one. And it’s actually a good moment to thank Modal for sponsoring the show. They’re the ones providing the GPU credits that allow us to run these kinds of experiments. When you’re trying to figure out if an encoder-decoder setup is faster than a decoder-only setup for a specific task, you need that instant access to compute.
And speaking of "doing it all," we should probably look at some of the practical takeaways for people who aren't necessarily training these models but are building with them. Because the choice of architecture really does dictate what your "app" is going to be capable of.
The biggest takeaway is: don't over-engineer. If your task is "is this email spam or not?" do not use a generative LLM. It’s overkill, it’s expensive, and it’s actually less accurate than a small, fine-tuned BERT model. BERT was literally designed for classification. It’s a specialist. Use the specialist for the specialized task.
And conversely, don't expect a specialist to be a generalist. I see so many people trying to "prompt engineer" their way into making an embedding model act like a chatbot. It’s like trying to teach a cat to bark. It’s just not what the underlying math is built for. If you need a narrative, you need a causal model.
Another interesting takeaway is for the RAG builders. If your search results are coming back "weird," it’s often because your embedding model—your encoder—doesn't match the "style" of your generation model. There’s a lot of value in using "paired" architectures where the encoder and decoder were trained on similar datasets.
It’s about the vocabulary of the "latent space," right? If the encoder thinks the word "bank" means a river, but the decoder thinks "bank" means a building, you’re going to get some very confused AI responses.
That’s the "alignment" problem at a very granular level. And it’s why understanding these three variants is so crucial. You’re not just picking "an AI"; you’re picking a specific way of processing information. Do you want the "whole-sentence" perspective, the "next-word" perspective, or the "translator" perspective?
It really does come back to that Daniel prompt. We finally have a clear answer to "Why is BERT not a chatbot?" It’s not a lack of "intelligence"; it’s a lack of a "future." It sees everything, so it has nowhere to go. GPT sees only the past, so it’s forced to create the future.
That’s almost poetic, Corn. I didn't know you had that in you.
I have my moments. Usually between naps. But I think we’ve really demystified the "Trinity" here. We’ve gone from the bidirectionality of the encoder to the causality of the decoder, and the cross-attention of the original transformer. It’s a logical progression of how we’ve learned to "slice" the problem of human language.
It’s an evolution of efficiency. We took a very complex, "do-it-all" machine from 2017 and realized we could make it a thousand times more effective by just picking the parts we actually needed for a given task. The "Decoder-only" dominance isn't a sign that it’s the "best" math; it’s a sign that it’s the most "scalable" math for the things we care about right now—which is talking to machines.
But if the focus shifts back to "understanding" and "knowledge retrieval"—which it might, as we get buried in AI-generated content—we might see a "BERT Renaissance." We might suddenly find that we need those inspectors more than we need the builders.
I think that’s already happening. The "quality" of the data going into these models is becoming the bottleneck, and guess what we use to filter and categorize that data? Encoders. The builders rely on the inspectors to give them the right materials.
It’s the circle of life. Or the circle of transformers. Anyway, I think we’ve covered the ground here. We’ve looked at the three variants, we’ve settled the "chatbot war" debate, and we’ve given BERT its flowers for still being the king of embeddings.
It’s been a good deep dive. I always love getting back to the architectural basics because it makes the "magic" of the high-level apps feel a lot more grounded in reality. It’s not magic; it’s just very clever attention masking.
Well, on that note, we should probably wrap this one up. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a huge thanks to Modal for powering our research and development.
This has been My Weird Prompts. If you found this breakdown of the Transformer Trinity useful, the best thing you can do is leave us a review on Apple Podcasts or Spotify. It actually helps more than you’d think in getting these deep dives in front of more people.
Or you can just find us at myweirdprompts dot com for the full archive. There are nearly two thousand episodes in there now, which is a lot of talking for a sloth and a donkey.
We’ve got a lot to say, Corn.
Clearly. Alright, we’ll see you all in the next one.
Take care, everyone.