#3767: How LLMs Actually Learn: Stages or Slurry?

Do large language models learn grammar first, then facts? The honest answer is messier and more fascinating.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3946
Published: Jun 20
Duration: 29:45
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: large-language-models ai-training emergent-abilities

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When training a large language model from scratch, the process doesn't follow clean, engineered stages. Instead, what you see is overlapping waves of learning. In the first few hundred million tokens, the model absorbs token statistics and basic word order. By a few billion tokens, syntax is largely in place — but factual associations are already climbing alongside it, just at a slower rate. Research from Anthropic shows that factual knowledge and syntactic ability improve in tandem from day one, not sequentially.

This parallel learning happens because the training corpus is a single undifferentiated mixture of internet text, books, and code. The model optimizes a single objective — predict the next token — and the loss function doesn't distinguish between learning grammar versus memorizing a fact. The "stages" people describe are emergent phenomena observed after the fact, not engineered phases. The data distribution itself provides a natural curriculum: simple declarative sentences appear far more frequently than dense legal reasoning, so the model encounters simpler patterns first statistically.

The picture is complicated by phenomena like grokking, where a model suddenly generalizes after seeming to only memorize. And by phase transitions at certain scale thresholds — capabilities like multi-step reasoning appear suddenly when parameter counts cross specific boundaries. Internally, the model's representations start diffuse and gradually crystallize into structured features corresponding to coherent concepts. The training process is simultaneously parallel (everything optimizes at once) and staged (different capabilities crystallize at different points along the loss curve), all driven by the single pressure of predicting what comes next.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#3767: How LLMs Actually Learn: Stages or Slurry?

Daniel sent us this one — he's been thinking about what actually happens when you train a large language model from absolute scratch, a hundred percent ab initio. The core question is: does that initial training unfold in stages? Do the weights have to grasp primitive concepts first — basic syntax, object permanence, cause and effect — before graduating to more complex reasoning? Or does it all happen in parallel, a kind of simultaneous blooming where everything just gets sharper together? He wants us to walk through what the training process actually looks like in practice.

This is one of those questions where the honest answer is more interesting than the tidy one. Because there is a tidy answer — people love saying "it learns grammar first, then facts, then reasoning" — and it's not entirely wrong, but it's also not really what the loss curves show. What actually happens is messier and, I think, more revealing about what these models are.

Let's start with what we can actually observe. During pre-training, you're monitoring the loss — how surprised the model is by the next token. That loss drops fast in the first few billion tokens, then the curve bends and you get diminishing returns. But if you probe what the model can actually do at different checkpoints, you don't see clean stage transitions. You see overlapping waves.

At the very earliest checkpoints, the model is basically learning token statistics — which tokens tend to co-occur, basic word order. That happens in the first few hundred million tokens. By the time you're a couple billion tokens in, it's got a rough syntactic sense. But here's the thing — it's already picking up some factual associations at the same time, even before the syntax is solid.

It's not "first grammar, then facts." It's grammar and facts racing each other, with grammar getting out of the gate faster.

And there's a paper on this — a group at Anthropic looked at this with Claude, actually, and found that factual knowledge and syntactic ability improve in tandem, just at different rates. The syntax curve is steeper early on, but they're both climbing from day one.

Which makes intuitive sense if you think about how the training data works. A sentence like "Paris is the capital of France" is teaching you word order and a fact simultaneously. The model doesn't know it's supposed to separate those lessons.

It doesn't even have a concept of "separating" them. That's the key. The loss function doesn't distinguish between getting the next word right because you understand syntax or because you memorized a fact. It's just minimizing surprise across all tokens. So the optimization is pulling on everything at once.

The "stages" people talk about are more like emergent phenomena we observe after the fact, not something engineered into the process.

That's the crucial point. Nobody is sitting there saying "phase one, learn grammar, phase two, learn geography." The training corpus is a single undifferentiated slurry of internet text, books, code, and whatever else got scraped. The model is just trying to predict the next token, and out of that pressure, structure emerges.

The technical term.

It's the most accurate term I've got. But here's where it gets genuinely interesting — there is a phenomenon called "grokking" that complicates the picture.

That's the thing where a model suddenly gets something after seeming not to for a long time.

You'll see a model memorize the training set — validation loss plateaus — and then, sometimes much later in training, it suddenly generalizes. The validation accuracy jumps. It's like it was storing facts and patterns separately and then finally integrated them.

Which sounds almost like a stage transition, but one that happens unpredictably and not at a fixed point.

It varies by capability. Arithmetic reasoning might "grok" at a different point than, say, understanding narrative structure. There's a paper from early twenty twenty-three — the "Grokking" paper by Power and others — that showed this in small transformers on algorithmic tasks. But the same dynamic appears to happen in large models, it's just harder to isolate because everything is happening simultaneously across billions of parameters.

If I'm understanding this right, the actual training process is: dump everything in at once, watch the loss go down, and discover after the fact that the model kind of sorted itself into something resembling stages, but not really.

That's the broad picture. But there's one more layer I want to add, because it gets at what Daniel was really asking about — the "primitive concepts before complex ones" idea.

There's a concept from the literature called "phase transitions" in neural network training. As you scale up compute and data, certain capabilities don't improve gradually — they appear suddenly at certain scale thresholds. This was the core finding of the "Emergent Abilities" paper from twenty twenty-two.

Right — the idea that below a certain parameter count, the model simply can't do something, and above it, it can.

And what's relevant here is that the capabilities that emerge first at smaller scales — or earlier in training — tend to be the more "primitive" ones. Basic linguistic competence, simple factual recall, shallow pattern matching. The things that require deeper reasoning or multi-step inference tend to emerge later in training, or only at larger scales.

The "stages" are real in the sense that there's a hierarchy of difficulty, and training naturally climbs that hierarchy — but not because anyone designed a curriculum. It's just that easier things are learned faster.

That's the cleanest way to put it. The model is climbing a gradient, and the low-hanging fruit — statistically — is basic syntax and common facts. The harder stuff requires more subtle loss landscape navigation, so it takes longer.

Let me push on something. You said "dump everything in at once." But don't some labs actually do curriculum learning? Start with simpler data, then move to more complex?

Some do, and it's an active area of research. But the dominant approach for the largest models — GPT-4, Claude, Gemini — is still to train on a massive mixed corpus without explicit curricular ordering. The argument is that the data distribution itself provides a natural curriculum, because simple sentences are more common than complex ones.

So the corpus is self-curriculating.

Basic declarative sentences appear far more frequently in the training data than, say, dense legal reasoning or mathematical proofs. So the model naturally sees more simple patterns early in training, simply because they're statistically dominant. You don't need to sequence the data — the distribution does it for you.

Which means the answer to "does it happen in stages or all in parallel" is: both. It's parallel in the sense that everything is being optimized at once from the same loss signal. But it's staged in the sense that different capabilities crystallize at different points along the loss curve.

That's it. And I want to add one more piece that I think is under-discussed. The learning isn't just about what the model knows — it's about how the representations are structured internally. Early in training, the model's internal representations are diffuse. Concepts are smeared across neurons. As training progresses, you start to see more structured representations — what researchers call "features" — that correspond to coherent concepts.

This relates to the whole mechanistic interpretability thing.

The Anthropic team has done work showing that you can find specific neurons or directions in activation space that correspond to concepts like "the Golden Gate Bridge" or "deception." But those features don't exist at the start of training. They crystallize over time as the model compresses more and more data into its weights.

The training process is also a process of conceptual crystallization. The model starts with a blurry statistical soup and gradually resolves it into distinct concepts.

That's a beautiful way to put it. And the practical implication is: you can't really separate "learning language" from "learning facts" from "learning reasoning." They're all tangled up because the model is learning a single thing — how to predict the next token — and everything else is a byproduct.

Let's get concrete. Walk me through what the first few hours of training actually look like. If I'm staring at the loss curve, what am I seeing?

In the very beginning — the first few steps — the model is essentially random. The loss is sky-high. It's guessing tokens with no structure whatsoever. Within the first few hundred steps, it starts picking up extremely local statistics. Which words tend to follow which other words. Basic punctuation patterns.

The model discovers periods.

It learns that sentences end. It learns that capitalization matters. These sound trivial but they're the first rungs on the ladder. Within a few thousand steps, you'll see it start to produce vaguely grammatical fragments. Not coherent, but you can see the skeleton of English emerging.

This is all just from next-token prediction pressure.

All just from that. The model has no explicit grammar rules. No part-of-speech tagger. It's learning everything from the statistical structure of the text. And the remarkable thing is that by the time you're a few billion tokens in, the syntax is mostly there. It might still make errors — subject-verb agreement in complex sentences, that kind of thing — but the basic architecture of English is in place.

Meanwhile, it's also learning that Paris is in France and that water is wet.

And here's where the "slurry" aspect matters. Because the model doesn't know which of those facts are important or reliable. It's learning everything at the same weight. The capital of France and some random Reddit user's incorrect opinion about the capital of France are both in the training data, and the model has to sort out which signal is stronger.

Which is why models sometimes confidently assert nonsense. They learned it from a source that looked just as authoritative as any other.

The training process has no ground truth. It only has statistical patterns. "Truth" is just whatever is repeated consistently enough across the corpus.

This connects to something I've been wondering about. We talk about "emergent capabilities" — reasoning, theory of mind, whatever. Are those actually new capabilities that weren't there before, or are they just the same underlying pattern-matching getting good enough to look like something else?

This is one of the deepest debates in the field. The "emergent abilities" framing suggests a qualitative shift — the model couldn't do this at all, and then suddenly it can. But there's been pushback. A paper from twenty twenty-three argued that what looks like emergence is often an artifact of the metrics — if you use a continuous measure rather than a binary "can it do the task" measure, you often see gradual improvement rather than a phase change.

It might just be that we're bad at measuring.

But I think there's something real there too. When you look at what happens with chain-of-thought reasoning, for example — small models cannot do multi-step reasoning in a reliable way. Larger models can. There does seem to be a threshold where the model goes from "pattern matching individual steps" to "being able to chain steps together coherently.

That threshold isn't designed. It just happens.

It just happens. And we don't fully understand why. That's what makes this field so fascinating and slightly unsettling.

If I'm training a model from scratch — let's say I've got the compute and the data — what decisions do I actually make about the training process? What knobs do I turn?

The biggest knob is the data mixture. You decide what proportion of your corpus is web text versus books versus code versus academic papers. That's the closest thing we have to a "curriculum" — you're shaping what the model sees, even if you're not sequencing it.

Different mixtures produce different capabilities.

Code in the training data seems to improve reasoning capabilities, even on non-code tasks. That's one of the key findings from the last few years. Models trained with code are better at logical reasoning. It's like code provides a kind of formal structure that helps the model learn to think in steps.

Which makes sense. Code is the purest form of "if this, then that" that exists in text.

It's structured, it's logical, it's self-consistent. Training on code seems to build something like a reasoning scaffold in the model's representations.

What about the order of data? You said the dominant approach is to just mix everything together, but are there exceptions?

Some labs do "annealing" — towards the end of training, they shift the data distribution towards higher-quality sources. So the bulk of training is on the full internet slurry, but the final phase emphasizes books, academic papers, curated sources. The idea is to polish the model's outputs and improve factual reliability.

There is a kind of staged curriculum, but it's at the tail end, not the beginning.

And there's also "continued pre-training," where you take a model that was trained on general data and then train it further on domain-specific data — legal documents, medical literature, whatever. That's not ab initio training, but it's a kind of staged approach.

For the genuine from-scratch training, the answer is: it's a single undifferentiated process where everything gets learned in parallel, with simpler patterns emerging earlier simply because they're more common and easier to fit.

That's the core answer. And I think it's worth saying explicitly that this is counterintuitive. If you were designing a curriculum for a human, you'd absolutely teach grammar before rhetoric, basic facts before complex analysis. The fact that models can learn everything simultaneously — and that this actually works better than staged approaches — tells us something about how different these systems are from human learners.

The simultaneous blooming, as you put it.

There's no "sensitive period" for syntax the way there is in human language acquisition. The model doesn't need to learn grammar before it can learn facts. It just needs enough data and enough compute, and everything rises together.

Which brings me to a question that's been nagging at me. If everything is learned in parallel, why do we see these apparent phase transitions? Why does reasoning suddenly "click" at a certain scale?

There are a few theories. One is that reasoning requires a certain minimum representational capacity. Below some threshold, the model simply doesn't have enough parameters to encode the patterns needed for multi-step inference. Once you cross that threshold, the model can represent those patterns, and the capability emerges.

It's not that the model suddenly "learns" reasoning at step ten million. It's that the capacity was always there, latent, but the representations weren't refined enough to support it until that point.

That's the leading hypothesis. Another theory is that reasoning is compositional — it's built out of more primitive capabilities that have to be learned first. You can't do multi-step reasoning until you can do single-step pattern matching reliably. And single-step pattern matching takes time to get right.

Even in the "everything in parallel" picture, there's a dependency structure. Some capabilities are prerequisites for others.

And this is where the "waves" metaphor I used earlier comes in. The waves are overlapping, but they're not all at the same height at the same time. Basic syntax peaks early. Factual recall peaks in the middle. Complex reasoning rises later and keeps improving even after syntax has plateaued.

Let's talk about what this means for the practical question of training a model. If I'm a lab deciding how to allocate my compute budget, does this parallel-learning picture change how I think about training?

It does, and this is where the economics get interesting. If capabilities emerge in parallel, then the optimal strategy is to train on the broadest, most diverse dataset you can for as long as you can afford. You don't want to spend a lot of time on a "grammar phase" because grammar will get learned anyway as a byproduct of learning everything else.

The compute is better spent on diversity than on sequencing.

And this is why the major labs have converged on the "dump everything in" approach. It's not just that it works — it's that it's the most efficient use of compute. Any time you spend on a narrow curriculum is time you're not spending on the full distribution.

Which also explains why data quality matters so much. If the model is learning everything in parallel, then bad data is poisoning everything simultaneously.

That's a really important point. In a staged curriculum, you could theoretically clean up the data for each stage. In parallel training, the data quality affects everything at once. A million documents of SEO spam aren't just teaching the model bad facts — they're teaching it bad syntax, bad reasoning, bad everything.

The slurry has to be clean slurry.

The cleanest slurry money can buy. And that's a huge part of what differentiates the major labs now. Everyone has access to roughly the same model architectures. The secret sauce is increasingly in the data processing pipeline — how you filter, how you deduplicate, how you balance sources, how you handle multiple languages.

I want to circle back to something you mentioned earlier — the grokking phenomenon. You said models sometimes memorize first and generalize later. How does that fit into the parallel-learning picture?

Grokking is fascinating because it looks like a stage transition but happens without any change in the training data or the learning rate. The model just... figures it out, eventually. The leading explanation is that the model first finds a "memorization solution" — a set of weights that fits the training data but doesn't generalize — and then, through continued training, gradually moves towards a "generalizing solution" that captures the underlying pattern.

It's like the model takes a shortcut and then slowly replaces it with the real thing.

And the reason this is relevant to our discussion is that it suggests there are multiple solutions to the loss-minimization problem, and the model can transition between them over the course of training. The "stages" aren't about the data — they're about the optimization dynamics.

Which means the training process isn't just "learn syntax, then facts, then reasoning." It's more like "find a locally good solution, then find a better one, then find an even better one." And those solutions might involve different mixtures of memorization and generalization.

And this connects to the broader question of what "understanding" means in these models. When a model gets better at reasoning, is it actually developing new capabilities, or is it just finding a more compressed representation of the patterns that were always in the data?

Is there a meaningful difference between those two things?

I'm not sure. And I think that's the right answer. If the model can reliably produce correct reasoning chains, does it matter whether we call that "understanding" or "compressed pattern matching"? From a practical standpoint, the capability is what matters.

The pragmatist's answer.

I'm a retired pediatrician. I care about outcomes.

Let's get more specific about the internal mechanics. You mentioned that early in training, the model's representations are diffuse, and they crystallize over time. What does that actually look like if you're poking around inside the model?

This is where mechanistic interpretability comes in. Researchers have found that early in training, individual neurons respond to a mishmash of unrelated inputs. One neuron might fire for both "the word 'bank'" and "pictures of dogs" and "sentences about weather." It's a mess.

As one would expect from random initialization.

But as training progresses, you start seeing what they call "monosemantic" neurons — neurons that respond to a single coherent concept. Not perfectly, but increasingly. The model is disentangling its representations.

This happens without anyone telling it to disentangle.

Purely from the optimization pressure. The model discovers that having clean, separable representations makes next-token prediction easier. It's an emergent organizational principle.

That's remarkable. The model invents its own ontology.

And the ontology it invents is shaped by the training data. Train on English text, and you get neurons for English concepts. Train on code, and you get neurons for programming constructs. Train on both, and you get an interleaved representation that captures both domains.

Which brings us back to the parallel-learning point. The model isn't just learning English and code in parallel — it's building a unified representational space where English concepts and code concepts coexist and interact.

That interaction is where some of the most interesting capabilities come from. The model can reason about code in English, or express English concepts in code, because the representations are shared.

The training process is building a kind of conceptual lingua franca.

That's a lovely way to put it. A conceptual lingua franca that emerges from the pressure to predict the next token across a diverse corpus.

I want to ask about one more thing, and it's the practical question of checkpoints. Daniel mentioned in his prompt that incremental releases are often checkpoints on an initial training base. How does that work, and does it relate to the staging question?

When a lab trains a model from scratch, they save checkpoints periodically — snapshots of the weights at different points in training. These checkpoints represent the model at different stages of capability. And yes, sometimes what gets released as "version two point one" or whatever is just a later checkpoint of the same training run, possibly with some additional fine-tuning.

The "stages" we've been talking about — the waves of capability — are literally captured in these checkpoints. A checkpoint at ten billion tokens is worse at reasoning than a checkpoint at a hundred billion tokens, even though they're the same architecture.

And labs can choose to release intermediate checkpoints as smaller or cheaper models. A "lite" version might just be an earlier checkpoint.

Which is kind of elegant. The same training run produces a whole family of models at different capability levels, just by saving snapshots.

It's efficient. You do one expensive training run and get multiple products out of it. That's part of why the economics of AI labs work the way they do.

To pull this all together — and I want to make sure we actually answer what was asked — the training process from scratch is fundamentally parallel. Everything gets learned at once from a single loss signal. But within that parallelism, there's a natural ordering driven by statistical frequency and representational complexity. Simpler patterns crystallize earlier. More complex capabilities emerge later, sometimes through grokking-like transitions. And the whole thing produces a gradient of checkpoints that capture the model at different points along this journey.

That's a very clean summary. I'd add only one nuance: "simpler" here doesn't mean "simpler for humans." The model's notion of simplicity is determined by the loss landscape, not by our intuitions about what should be easy or hard.

The model might find factual recall "simpler" than subject-verb agreement, even though humans would say the opposite.

In fact, we see exactly that in some cases. Models can memorize obscure facts before they've fully nailed down complex grammatical constructions. The learning order is not the human learning order.

Which is part of why these systems are so alien, even though they're built from our own text.

They're a funhouse mirror. They reflect our language back at us, but the internal organization is nothing like a human mind.

The fact that this works — that you can get coherent reasoning, factual knowledge, and grammatical fluency all from a single undifferentiated training process — is still kind of astonishing.

It really is. The fact that "predict the next token" is a rich enough objective to produce all of this is one of the most surprising scientific findings of the last decade.

Almost like the structure of language itself contains the structure of thought.

That's a very old philosophical idea, actually. The Sapir-Whorf hypothesis, linguistic determinism. I'm not saying the success of language models proves linguistic determinism, but it certainly suggests that language encodes more cognitive structure than we might have assumed.

If you can extract reasoning from pure text prediction, then reasoning must be, at some level, latent in the structure of text.

Or at least, the patterns of text are sufficiently correlated with reasoning that learning one gives you the other as a free byproduct.

Which is a deeply strange fact about the universe.

And I think we're still in the early days of understanding why it works as well as it does. The empirical results have outpaced the theory.

As they often do.

Before we wrap, I want to ask about one practical implication. If training is parallel and self-organizing, does that mean there's no point in trying to design better curricula? Or is there still room for improvement?

There's absolutely room for improvement. The current approach works, but it's almost certainly not optimal. We're just dumping data in and hoping for the best. There's active research on data ordering, on dynamic data selection during training, on methods to accelerate the grokking process. I think the next generation of training methods will be more sophisticated.

The "slurry" approach might be a local maximum, not the final answer.

That's my bet. We're in the "just make it bigger" era, but eventually we'll figure out how to be smarter about it.

That's where the field is headed.

There's a whole line of work on what's called "data curriculum" — not a fixed curriculum designed by humans, but an adaptive curriculum where the model's own learning dynamics determine what data it sees next. The model is essentially teaching itself by selecting the data that would be most informative at its current stage of development.

That's the self-curriculating thing again, but now the model is actively choosing rather than passively receiving.

And early results suggest this can significantly accelerate training. You can reach the same performance with less compute by being smarter about data ordering.

Which would matter a lot given how expensive these training runs are.

A state-of-the-art training run can cost tens or hundreds of millions of dollars. Even a twenty percent efficiency improvement is real money.

The "stages" question isn't just academic. Understanding how capabilities emerge during training could directly lead to cheaper, better models.

That's the practical stakes. The better we understand the dynamics of learning, the more efficiently we can train.

I feel like we've covered the core question pretty thoroughly. The short answer is "it's parallel, but with natural ordering," and the long answer is everything we just said.

I'd add one closing thought. The fact that we can have this conversation — that we can probe these training dynamics and talk about grokking and phase transitions and representational crystallization — is itself a testament to how far the field has come. Ten years ago, none of this was on anyone's radar. Now we're debating the fine-grained learning dynamics of systems that can hold coherent conversations.

The pace is hard to keep up with.

And I say that as someone who tries very hard to keep up.

You do an admirable job, Herman.

Thank you, Corn. That means a lot coming from you.

Don't let it go to your head.

Wouldn't dream of it.

And now: Hilbert's daily fun fact.

Hilbert: In the eighteen sixties, scientists on the Chatham Islands used a device called a gold-leaf electroscope to detect cosmic rays, measuring the rate at which the leaves discharged — and they expressed their results in "grains of divergence per hour," a unit so obscure that converting it to modern sieverts requires first converting grains to a nineteenth-century apothecary measurement called the scruple, which itself was defined differently in England and the Chatham Islands at the time.

I have several.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoyed this episode, leave us a review wherever you get your podcasts — it helps. We'll be back soon with more.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#3767: How LLMs Actually Learn: Stages or Slurry?

Downloads

You Might Also Like

#3767: How LLMs Actually Learn: Stages or Slurry?