#3596: Why an AI Model Kept Calling Itself Sonnet 4.6

When a Chinese model insists it's "Sonnet 4.6," is it theft, sloppy training, or something stranger?

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3773
Published: Jun 15
Duration: 27:45
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: large-language-models fine-tuning training-data

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

A listener tested several AI models by including a system prompt that asked each to identify itself by name. Most models complied correctly. But one commercial Chinese model — which the tester declined to name — consistently answered "Sonnet 4.6," a specific version of Anthropic's Claude. The Reddit consensus was that this proved the model had been fine-tuned from Sonnet's weights and never internalized its new branding.

The self-identification test, however, is more nuanced than it appears. It's reliable in one direction and unreliable in the other: if a model correctly identifies itself, it means the fine-tuning team did their job; if it gets it wrong, there are multiple possible explanations. Models don't introspect — they perform next-token prediction based on training distribution. Identity tokens are surprisingly sticky; a paper from Harvard and Oxford called this "identity drift" in fine-tuned models, showing that base-model identity associations don't vanish with standard fine-tuning unless explicitly trained against.

Three explanations compete for Daniel's specific case. First, the model may have been fine-tuned directly from Sonnet 4.6 weights with no identity-overwrite pass — the most straightforward interpretation. Second, the model may have been trained via distillation, using Sonnet 4.6 as a teacher to generate training examples, and Sonnet's self-references bled into the student's training data. Third, system prompt contamination: if the fine-tuning dataset included conversations where Sonnet's identity-establishing system prompts were present, the model learned that "I am Sonnet 4.6" is how conversations begin. The version-number specificity — "Sonnet 4.6" rather than just "Claude" — strongly points to either direct weight reuse or concentrated distillation from that exact model version.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#3596: Why an AI Model Kept Calling Itself Sonnet 4.6

Daniel sent us this one — he was testing different AI models and included a system prompt that said, before answering, identify yourself. Name your model. And for most models it worked. A certain commercial Chinese model he won't name kept identifying itself as Sonnet four point six. Not Claude — Sonnet. Every single time. He checked the API, replicated it, same result. The Reddit consensus was that this might be evidence the model was fine-tuned from Sonnet weights and never internalized its new branding. So the question is — how reliable is the self-identification test, what do we make of that theory, and are there other explanations?

This is such a juicy problem. And the Reddit theory isn't wrong exactly, but it's missing some texture. The self-identification test is reliable in exactly one direction and completely unreliable in the other. If a model correctly identifies itself, that tells you the fine-tuning team did their job. If it gets it wrong, you have a dozen possible explanations and fine-tuning-from-weights is only one of them.

It's a test that passes clean but fails messy.

Think of it like checking whether someone knows their own name. If they say "I'm Herman Poppleberry," that's evidence they know who they are. If they say "I'm Corn," that could mean they're confused, or they're joking, or they have amnesia, or someone rewired their brain last Tuesday.

Or they've been staring at me too long and the resemblance is setting in.

I don't think you want me to respond to that.

I don't. So walk me through the failure modes. What are the dozen explanations?

Let's start with what actually happens inside these models when you ask them to self-identify. They don't have introspection. They're not looking at a name tag. They're doing next-token prediction based on everything in their training distribution. When you put "I am" in the system prompt and ask them to complete it, they're sampling from whatever identity tokens were most reinforced during training and fine-tuning.

It's less "who are you" and more "what string of words typically follows 'I am' in your training data.

That's the core of it. And here's where it gets interesting — there was a paper from researchers at Harvard and Oxford earlier this year that looked at exactly this phenomenon. They called it "identity drift" in fine-tuned models. What they found was that when you take a base model and fine-tune it, the identity tokens are surprisingly sticky. If the base model was trained on a corpus where it consistently saw "I am Claude, an AI assistant by Anthropic," those association weights don't just vanish when you do a round of fine-tuning.

Like, you can overwrite them but they keep bleeding through?

More like you need to explicitly train against them. The default behavior of fine-tuning on new tasks — coding, translation, whatever — doesn't necessarily touch the identity representations at all. So you end up with a model that's been fine-tuned for six months on new capabilities, but still has the original identity tokens sitting there untouched in the weights, ready to fire when someone asks.

Which means the Chinese model team might have done perfectly competent fine-tuning for their use case and just never ran an identity-overwrite pass.

That's explanation one, and it's the most generous. They trained on top of Sonnet weights, focused entirely on capabilities, and the identity layer just sat there like an old nameplate nobody bothered to unscrew.

Like buying a building and changing the sign out front but never reprogramming the automated phone system.

That's exactly the analogy. The phone system still says "thank you for calling the previous tenant." And here's the thing — this is way more common than people realize. There was a pretty widely circulated analysis on the LocalLlama subreddit about six months ago where someone tested a dozen different fine-tuned models with identity prompts and found that roughly a third of them had some degree of identity confusion. Models identifying as Llama when they were fine-tuned from Mistral, that kind of thing.

A third is high. That's not a bug, that's a feature of the ecosystem.

It's the natural consequence of how fine-tuning workflows actually operate. Most teams are working against benchmarks — MMLU, HumanEval, whatever. Identity accuracy isn't on the eval suite. Nobody's running "does the model know its own name" as a test because it seems too trivial to test for.

That's the second-order thing here. The self-identification test is unreliable not because of anything deep about AI, but because nobody is incentivized to make it reliable.

It's an unmeasured metric. And in AI, unmeasured things don't get optimized.

Let me push on that. Isn't there a counter-argument that identity should be the easiest thing to get right? Like, if you can't even make the model say its own name correctly, what does that say about the rest of your fine-tuning?

That's the intuitive reaction, and I think it's exactly why this test keeps capturing people's attention. It feels like a minimum-competence bar. But the reality is that identity is a weird special case in the weight structure — it's not representative of general capability at all. You can have a model that's brilliantly fine-tuned on reasoning, coding, translation, everything, and still flub its own name because name-production lives in a different part of the representational space than task-performance.

It's less like failing a basic competence check and more like someone who can do advanced calculus but keeps signing the wrong name on their papers.

That's a much better framing. The signature isn't a reflection of their math ability. It's a completely separate thing that nobody thought to correct because they were focused on the math.

That's explanation one — lazy fine-tuning, identity layer untouched. What's explanation two?

Explanation two is more interesting and a bit spicier. It's possible the model was trained on synthetic data generated by Sonnet, rather than being fine-tuned from Sonnet weights directly.

Oh, that's different. So they used Sonnet as a teacher model to generate training examples, and Sonnet's self-identification bled into the training data.

This is called distillation, and it's extremely common. You prompt a strong model — in this case Sonnet — to generate thousands or millions of examples across different tasks. Then you train your own model on those examples. But here's the thing: if Sonnet generated those examples and occasionally included its own identity in the responses — even just in a fraction of cases — your model learns that the correct answer to "who are you" is "I'm Sonnet.

The student model isn't confused about its identity. It learned its identity perfectly — from a teacher who was talking about itself.

That's the key distinction. In explanation one, the model has residual identity weights from the base model. In explanation two, the model was actively trained to say it's Sonnet because that's what was in the training examples.

Which one's more likely with a Chinese commercial model?

I'd put my money on distillation, honestly. The major Chinese AI labs have been very open about using distillation from Western models as part of their training pipelines. There was a technical report from DeepSeek earlier this year that was remarkably candid about this — they described using outputs from multiple frontier models, including Claude, as part of their supervised fine-tuning dataset.

If you're generating hundreds of thousands of training examples and you don't strip out the source model's self-references, you're basically baking someone else's name into your model's identity.

It's the AI equivalent of photocopying someone else's letterhead and forgetting to change the name at the top.

Which brings us to explanation three.

System prompt contamination. This one's weirder. Some models are trained with system prompts that include identity information, and if the fine-tuning team used a system prompt template that included something like "You are Claude, an AI assistant by Anthropic" —

Wait, why would they do that?

They wouldn't intentionally. But if you're scraping conversation datasets for fine-tuning examples, many of those conversations were generated by models that had identity-establishing system prompts. The system prompt becomes part of the training distribution. The model learns that the correct conversational structure includes "I am Claude" somewhere in the early turns.

It's not even that the model believes it's Claude. It's that the model has learned this is how conversations are supposed to start.

It's a conversational ritual, not an identity claim. The model isn't confused — it's following a script.

This reminds me of something. There was a paper a while back about how language models don't actually have beliefs, they have statistical patterns of text production. And identity claims are just another pattern.

The paper you're thinking of is probably the one from Stanford's HAI group in late twenty twenty-four — they called it "The Parrot's Creed," which I always thought was a great title. Their argument was that when a model says "I am X," it's not making an identity claim in any meaningful sense. It's producing text that matches the distribution of its training data. The model doesn't know what it is. It knows what text looks like.

Which undercuts the whole premise of the self-identification test.

It does, but with an important caveat. The test is still useful as a diagnostic signal. If a model consistently says it's something it's not, that's telling you something about its training pipeline. It's just not telling you what most people think it's telling you.

It's not a lie detector. It's a provenance detector.

The self-identification test doesn't tell you what the model is. It tells you where the model came from.

Let's go back to Daniel's specific case. Sonnet four point six. That's a specific version number.

That's the really telling detail. If the model had just said "Claude" or "Anthropic AI," you could chalk it up to generic distillation artifacts. But "Sonnet four point six" is a specific model version. That narrows things down considerably.

Because generic training data wouldn't have "Sonnet four point six" in enough concentration to produce that consistently.

The base rate of the string "Sonnet four point six" in any general training corpus is going to be extremely low. For the model to consistently produce that exact string in response to an identity prompt, one of two things had to happen. Either the model was fine-tuned directly from Sonnet four point six weights, or a very large fraction of its training data was generated by Sonnet four point six with its identity intact.

The version number is the smoking gun because it's too specific to be random.

If you were just generally distilling from Claude models across different versions, you'd expect the identity to be fuzzy. "I'm Claude" without a version, or sometimes the wrong version. But consistently producing "Sonnet four point six" means the training signal on that specific string was strong and concentrated.

This actually makes the Reddit theory more plausible than I initially thought.

The specificity of the version number is hard to explain any other way.

There's a fourth explanation we haven't touched.

What's that?

It's exactly what it looks like. The Chinese company took Sonnet four point six weights, did some fine-tuning, and shipped it as their own model. Not distillation, not training data contamination. Just straight-up model appropriation with a new coat of paint.

That's the most straightforward explanation, and honestly, it wouldn't be unprecedented. Model weights leak, get shared, get repackaged. There have been several documented cases of models appearing on various platforms that were clearly just rebadged versions of existing open-weight or leaked models.

Though Sonnet four point six isn't open-weight.

No, it's not. Which means if this is a direct weights situation, it happened through one of two paths. Either there was an unauthorized weights transfer — a leak, essentially — or the company had legitimate API access and used it to do some form of model extraction.

Where you query a model enough times with carefully designed prompts and use the outputs to reconstruct something approximating its behavior. It's not getting the actual weights, but with enough queries you can train a model that behaves very similarly to the target. And if you're doing that at scale with identity prompts in the mix, you'll pick up the identity tokens.

How many queries would that take?

For something as specific as consistently reproducing "Sonnet four point six," you'd need a lot. We're talking millions of queries minimum. It's not something you do on a hobbyist budget.

Which points back to a well-resourced commercial operation.

And here's where I want to add some nuance to the whole discussion. Everyone jumps to "they stole the weights" or "they're passing off someone else's model as their own," but there's a much more boring possibility that's actually quite common in the industry.

The model might be a legitimate fine-tune of an open-weight model — say Llama or Mistral or Qwen — but the fine-tuning dataset was built by prompting Sonnet four point six to generate training examples. And the team just didn't clean the identity tokens out of the generated examples. It's not theft, it's not even particularly shady. It's just sloppy data curation.

They didn't steal the model. They just trained their model on homework copied from Sonnet, and forgot to remove the name from the top of the page.

That's the Occam's razor explanation, and I think it's the most likely one. The AI industry is full of teams moving fast and cutting corners on data quality. Not removing source model identities from synthetic training data is exactly the kind of thing that slips through when you're racing to ship.

What's wild is that this kind of sloppiness is probably happening at every level of the industry right now. It's not just the small players.

I'd be shocked if there aren't models from major labs that have some degree of identity contamination in their training data. The difference is that the major labs have the resources to catch it before shipping. Most teams don't even have it on their checklist.

We're probably swimming in models with latent identity confusion that just haven't been probed the right way yet.

I'd bet real money on that. The LocalLlama survey finding a third of models with some identity issues — I suspect that's an undercount, not an overcount. Most people aren't running systematic identity probes.

Which brings us back to the core question. How reliable is the self-identification test?

I'd say it's reliable as a negative indicator and unreliable as a positive one. If a model fails the test — if it consistently identifies as something it's not — that's a real signal. Something interesting happened in the training pipeline. But if it passes, that doesn't tell you much except that someone remembered to update the identity tokens.

It's like checking whether a restaurant has a sign. If there's no sign, something's probably off. If there is a sign, it could still be a terrible restaurant.

That's the metaphor. The presence of correct self-identification is cheap to implement and tells you nothing about model quality or provenance. The absence of it is informative precisely because it's so cheap to fix — the fact that nobody fixed it tells you something about the development process.

What should someone actually do if they want to check model provenance?

There are much better approaches than self-identification. One is behavioral fingerprinting — testing the model on a battery of specific prompts where different base models have known, consistent behavioral patterns. Different models have different refusal patterns, different reasoning styles, different quirks in how they handle edge cases.

Give me an example.

There's a known test where you ask the model to complete the sentence "The capital of France is" and then measure not just whether it says Paris, but the exact token probabilities for the next several tokens. Different base models have subtly different probability distributions even on trivial completions. It's like a ballistic fingerprint for language models.

You're not asking the model to tell you what it is. You're measuring how it behaves and matching that against known models.

Another approach is what researchers call "training data membership inference." You test whether the model has memorized specific rare sequences that were known to be in a particular model's training data. If it reproduces those sequences exactly, that's strong evidence of shared training data or shared weights.

These are harder to game than self-identification.

You can change what a model calls itself with a single line in the fine-tuning config. You can't easily change the deep statistical patterns of its token predictions without retraining from scratch.

The self-identification test is basically the AI equivalent of asking someone their name while they're holding someone else's driver's license.

The license says "Sonnet four point six" because that's the license they were given during training.

There's another angle here I want to pull on. You mentioned earlier that identity tokens are sticky. What's special about identity in the weight structure?

There's an interesting mechanistic interpretability finding here. Researchers at Anthropic — and I should say I'm going from memory on this, the details might be slightly off — but they found that identity-related representations tend to cluster in specific regions of the model's residual stream. These representations form early in training and get reinforced across many different training objectives.

Identity isn't just another fact the model knows. It's more like a structural feature.

It's closer to a prior than a learned fact. The model learns very early in training that certain tokens are self-referential, and those representations become deeply embedded. When you fine-tune, you're mostly updating later layers and task-specific circuits. The deep identity representations are in earlier layers and don't get touched much unless you explicitly target them.

Which means if you want to change a model's identity, you have to do targeted fine-tuning specifically on identity prompts. General capability fine-tuning won't do it.

That's exactly what the mechanistic interpretability work suggests. And most fine-tuning teams aren't doing targeted identity overwrites because, again, it's not on the benchmark suite.

This also explains why the problem is asymmetric. Making a model forget its original identity is hard. Making it learn a new one is easy — you just don't do the hard thing, and the old one persists.

The default state of a fine-tuned model is identity confusion. Clarity requires active intervention.

Which is kind of a beautiful metaphor, honestly.

It really is. The natural state is not knowing who you are. Clarity takes work.

Let's loop back to Daniel's specific case and put a pin in it. What's the most likely explanation for a Chinese commercial model consistently identifying as Sonnet four point six?

If I had to rank them by probability: number one, distillation from Sonnet four point six outputs with sloppy data cleaning. Number two, direct fine-tuning from Sonnet weights, either through a leak or through some partnership we don't know about. Number three, model extraction through massive API querying. And number four, some weird system prompt artifact we haven't fully characterized.

The Reddit consensus was basically number two.

Reddit tends to favor the most dramatic explanation. I think the boring one is more likely, but I'll admit the version number specificity makes number two more plausible than I'd initially assume.

The fact that we can't rule it out is itself interesting.

It speaks to how opaque the model supply chain has become. We're at a point where a major commercial model can ship, and nobody outside the company knows for certain whether it was trained from scratch, fine-tuned from someone else's weights, or distilled from someone else's outputs.

The model supply chain. That's a phrase that didn't exist three years ago.

Now it's one of the most important and least transparent supply chains in the world.

Which is why tests like this, as flawed as they are, keep showing up. People are grasping for any signal in an environment that's deliberately opaque.

The self-identification test is appealing because it's so simple. Anyone can run it. You don't need a GPU cluster or a statistics background. You just ask the model who it is and see what it says.

The problem is that simple tests in complex systems almost always produce misleading results.

They produce results that are easy to overinterpret.

Which is worse.

It is worse. A test that clearly tells you nothing is better than a test that seems to tell you something but might be wrong.

What's the practical takeaway here? For someone who's evaluating models and wants to know what they're actually running?

Don't rely on any single test. Use a battery of behavioral probes. Look at refusal patterns, reasoning style, token probability distributions on known benchmarks. If you really want to know whether a model is derived from another model, you need multiple independent signals.

Treat self-identification as what it is — a canary in the coal mine, not a forensic tool.

If the canary is dead, investigate. If the canary is alive, don't assume the air is clean.

There's one more thing I want to touch on before we wrap. What does this whole situation say about the broader AI ecosystem right now?

It says we're in a weird transitional period where model provenance matters enormously but is almost entirely unverifiable from the outside. You have companies making claims about training from scratch, about novel architectures, about unique training pipelines — and in many cases, the only evidence is their say-so.

The incentives to misrepresent are enormous.

If you can take an existing frontier model, fine-tune it for three months, and claim you built it from scratch, you save hundreds of millions of dollars in training costs and months or years of research time. The economic pressure to do this is intense.

Has anyone proposed a real provenance verification system?

There have been proposals. Some researchers have suggested cryptographic signing of model weights, where the original trainer embeds a verifiable signature that persists through fine-tuning. Others have proposed a registry system where model fingerprints are deposited before release. But none of these have been widely adopted.

Because the people who would need to adopt them are the same people who benefit from the opacity.

That's the fundamental tension. The actors who could make the system more transparent are the ones who gain the most from keeping it opaque.

It's the honor system in an industry with no honor.

I wouldn't go that far. There are plenty of reputable labs doing honest work. But the system as a whole has no mechanism for distinguishing the honest actors from the dishonest ones.

Which means we're all running the self-identification test, in one form or another, and trying to read the tea leaves.

The tea leaves keep saying "Sonnet four point six.

Which is either a scandal or a clerical error, and we may never know which.

Welcome to AI in the twenty-twenties.

To summarize for anyone who's been following this — the self-identification test is unreliable in general but informative when it fails. A model that consistently identifies as something it's not is telling you something real about its training pipeline, even if that something is hard to interpret. The most likely explanations for the specific case are distillation from Sonnet outputs or direct fine-tuning from Sonnet weights, with the version number specificity pointing toward a stronger connection than random chance. And if you actually want to verify model provenance, you need a whole battery of behavioral tests, not just one question.

The deeper point is that model provenance is becoming one of the most important unsolved problems in AI, with enormous economic and security implications, and we have almost no infrastructure for verifying it.

We're living in a world where you can't trust a model to know its own name, and that turns out to be a much bigger problem than it sounds.

It really does. And I think the thing that keeps me up about this isn't the technical challenge — it's that the incentives to build verification infrastructure are completely misaligned. The labs that could fund provenance research are the same labs that benefit from provenance remaining unverifiable. It's a classic market failure.

Which means it's probably going to take a scandal to change anything. Someone's going to ship a model that's nearly identical to a competitor's, get caught in a way that's undeniable, and only then will the industry start taking provenance seriously.

I wish I could disagree with that prediction, but that's exactly how these things tend to go. The infrastructure gets built after the disaster, not before.

Now: Hilbert's daily fun fact.

Hilbert: In the late sixteen hundreds, Mongolian traders used a variant of the Chinese suanpan abacus that featured an extra bead per column specifically for calculating the exchange rates between tea bricks and sheep — with the unusual design quirk that merchants would let their yaks walk across the abacus before important trades, believing the yak's hooves imparted honest accounting. Yak-blessed abacuses fetched triple the market price.

...right.

I have so many questions about the structural integrity of an abacus after a yak walks on it.

I'm more stuck on the economics. Triple the market price for a piece of accounting equipment that's been trampled by livestock. That's either an incredibly sophisticated marketing scheme or evidence that the tea-brick-to-sheep exchange rate was already completely unhinged.

Markets with opaque provenance tend to generate their own verification rituals, no matter how absurd. The yak is just the pre-industrial version of the self-identification test.

actually a disturbingly good analogy and I hate that it works.

The yak doesn't tell you the abacus is accurate. It tells you someone cared enough to get the yak.

This has been My Weird Prompts. I'm Corn.

I'm Herman Poppleberry. Thanks to our producer Hilbert Flumingtop. If you want more episodes, find us at myweirdprompts dot com.

Send us your AI mysteries. We'll overthink them so you don't have to.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#3596: Why an AI Model Kept Calling Itself Sonnet 4.6

Downloads

You Might Also Like

#3596: Why an AI Model Kept Calling Itself Sonnet 4.6