#2007: AI Grading AI: The Snake Eating Its Tail

We asked an AI to write this script. Then we asked another AI to grade it. Here’s what happens when the judges have biases.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2163
Published: Apr 4
Duration: 22:15
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: llm-as-a-judge hallucinations ai-ethics

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Scaling Bottleneck and the Rise of the AI Judge
The pace of AI development has outpaced human capacity for review. With thousands of model iterations and massive amounts of synthetic text, developers can no longer rely on manual “vibe checks” to assess quality. The solution gaining traction is LLM-as-a-Judge: using a high-reasoning model like GPT-4 or Claude to automatically grade the outputs of smaller or newer models. This approach, central to benchmarks like MT-Bench and frameworks like G-Eval, promises speed and scale, but it also introduces a unique set of challenges, from bias to the risk of a stylistic monoculture.

How LLM-as-a-Judge Works
At its core, LLM-as-a-Judge automates the subjective evaluation that developers once performed by hand. Instead of metrics like BLEU or ROUGE, which rely on word overlap, these judge models assess intent, tone, factual consistency, and logic. There are three primary implementation patterns:

Single-Point Scoring: The judge rates a single response on a scale (e.g., 1–5) based on a rubric. While this provides a quantifiable score, LLMs can be “moody” and inconsistent, often drifting toward the middle of the scale to avoid conflict.

Pairwise Comparison: This is the gold standard, modeled after human preference platforms like Chatbot Arena. The judge compares two responses (A and B) and declares a winner. This relative evaluation is more consistent than assigning absolute scores and forms the basis of Reinforcement Learning from AI Feedback (RLAIF).

Reference-Based Grading: Used for tasks with a “correct” answer, such as Retrieval-Augmented Generation (RAG). The judge compares a model’s output against a ground-truth reference, checking for faithfulness and relevance.

The Mechanics and the “Secret Sauce”
Effective judging isn’t just about asking, “Is this good?” The most accurate systems use Chain-of-Thought prompting, forcing the judge to articulate its reasoning step-by-step before assigning a score. This deliberation significantly improves accuracy and transparency.

However, this process is far from perfect. The biggest headaches come from judge bias:

Verbosity Bias: LLM judges tend to favor longer, more verbose responses over concise, correct ones. They equate word count with quality, rewarding “fluffy” answers.

Position Bias: In pairwise comparisons, the judge is more likely to pick the first response it sees, regardless of quality. To mitigate this, developers must run evaluations twice, swapping positions, which doubles API costs.

Self-Enhancement Bias: Models often prefer outputs that mirror their own stylistic patterns—what Corn calls the “Narcissus Effect.” If a judge model is GPT-4, it may favor responses that sound like GPT-4, creating a self-reinforcing loop.

These biases threaten to create a monoculture where all models optimize for the same narrow definition of “good,” losing behavioral diversity.

Benchmarks and the Future of Evaluation
Old benchmarks like MMLU test knowledge (e.g., passing a bar exam), but they don’t measure how useful a model feels in conversation. Benchmarks like MT-Bench use LLM judges to grade multi-turn interactions, testing a model’s ability to write code, explain it, and debug—all in a flow. While Chatbot Arena provides ground truth via human votes, MT-Bench offers a scalable lab version for developers.

Frameworks like G-Eval attempt to solve the “black box” problem by outputting probability distributions over scores, giving developers a sense of the judge’s certainty. Yet, the industry’s reliance on RLAIF and Constitutional AI—where judge models enforce rules like “be helpful, harmless, honest”—highlights a deeper tension: speed versus diversity.

As we automate evaluation, we must ask: Are we building better models, or just models that please a biased judge? The answer will shape the next era of AI development.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2007: AI Grading AI: The Snake Eating Its Tail

Imagine you are walking into a high-stakes job interview, but instead of a panel of tired HR managers, you are facing a sleek, silent terminal. On the other side of that terminal is an artificial intelligence. And here is the kicker: you are an AI too. This is not the plot of a sci-fi novel from the nineties; it is the current state of play in model development. Today’s prompt from Daniel is about LLM-as-a-Judge, and honestly, it is the only way the industry is keeping its head above water right now.

It really is the scaling solution for the "vibe check" problem. Herman Poppleberry here, and I have been deep in the documentation for things like MT-Bench and G-Eval lately. The reality is that we are producing so much synthetic text and so many model iterations that human beings literally cannot read fast enough to tell us if the models are getting better or worse. We have reached a point where we need a "Teacher" model to grade the "Student" model just to keep the lights on.

It sounds a bit like the snake eating its own tail, though. If we use AI to grade AI, don't we just end up in a hall of mirrors? By the way, fun fact for everyone listening: Google Gemini 3 Flash is actually the one writing our script today. Hopefully, it is not grading itself too leniently.

That is actually the perfect entry point into the mechanics of this. When we talk about LLM-as-a-Judge, we are usually talking about taking a very high-reasoning model—think GPT-4o or Claude 3.5 Sonnet—and using it as a proxy for human preference. We have moved past the era where we can just use math to grade language. Back in the day, we used metrics like BLEU or ROUGE, which basically just looked at how many words in the AI's answer matched the words in a human's answer.

Right, the "keyword matching" approach. But that is useless for something like poetry or even a complex coding explanation. You can have two sentences that mean the exact same thing but don't share a single noun. A math-based metric would give that a zero, even if it is a perfect answer.

Which is why we need the "Judge." The judge model doesn't just look at word overlap; it looks at intent, tone, factual consistency, and logic. It is essentially automating the subjective "vibe check" that developers used to do by hand. Instead of a developer sitting there for four hours clicking "Good" or "Bad" on a hundred prompts, they send ten thousand prompts to an LLM judge and get a spreadsheet back in five minutes.

Okay, let's get into the "how." If I am a developer and I want to set this up, I'm not just asking the judge, "Hey, is this good?" There has to be more structure to it than that. What are the actual implementation patterns people are using in the wild?

There are three main ways this happens. The first, and arguably the most common for quick checks, is Single-Point Scoring. This is the "Rubric" approach. You give the judge a single response from your model and a set of criteria. You might say, "On a scale of one to five, how helpful is this response? Use the following criteria: Is it concise? Does it answer the user's specific question? Is the tone professional?"

I can see the appeal there because it gives you a hard number you can put on a graph. But humans are notoriously bad at consistent numbering. If I ask you how much you like a movie on a scale of one to ten, your "seven" might be my "five." Does the LLM suffer from that same kind of "score drift"?

It absolutely does. LLMs can be "moody" based on how the prompt is phrased. If you don't give them a very strict rubric, they might give everything a four out of five because they are "people pleasers" in a sense. They stay in the middle. That is why the industry has largely shifted toward the second pattern, which is Pairwise Comparison. This is the "A/B Test" or the "Side-by-Side."

This is what we see in the Chatbot Arena, right? You get two boxes, you ask a question, and you pick the winner.

That is the gold standard. In an automated pipeline, you show the judge model the original prompt, then you show it Response A and Response B. The judge's only job is to say "A is better," "B is better," or "It’s a tie." This is much easier for a model—and a human, for that matter—to do consistently. It removes the need for an absolute scale. You are just looking for relative improvement. If you are testing a new version of your model, you run it against the old version ten thousand times. If the new version wins sixty percent of the time, you know you have a winner.

It’s like an Elo rating for AI. But what about when there is a "right" answer? If I am building a RAG system—Retrieval-Augmented Generation—where the AI is supposed to look at my company's PDFs and answer questions, I don't want a "vibe" check. I want to know if it actually found the right data.

That brings us to the third pattern: Reference-Based Grading. This is the "Answer Key" approach. Here, you provide the judge with a "ground truth" reference—maybe a paragraph from your documentation or a gold-standard answer written by a human expert. The judge then looks at the model's output and grades it based on how well it aligns with that reference. It is looking for "faithfulness" and "answer relevance." Did the model hallucinate something that wasn't in the source? Did it miss a key detail that was in the human's reference?

So the judge is acting like a TA grading a history exam against the professor's notes. I like that. But let's talk about the prompt itself. I imagine the "System Prompt" for these judge models is massive. You can't just say "Be a judge." You have to tell it how to think.

The "secret sauce" is often Chain-of-Thought prompting. You don't just ask for a score. You tell the judge, "First, extract the main claims from the response. Second, check each claim against the reference document. Third, evaluate the tone. Finally, provide a score and a justification." When the judge has to explain its reasoning before it gives the grade, the accuracy of that grade shoots up significantly. This is what frameworks like G-Eval are doing—they are forcing the model to deliberate.

It’s basically making the model show its work. But let’s play devil’s advocate here. If I am using GPT-4 to judge my smaller, cheaper model, aren't I just training my smaller model to mimic GPT-4's specific quirks? We know these models have "personalities" or at least stylistic tendencies. If GPT-4 likes long-winded, polite answers, and I use it as my judge, my small model is going to start sounding like a Victorian butler just to get a high score.

You have hit on the biggest headache in the field right now: Judge Bias. Specifically, Verbosity Bias. It has been proven in multiple papers—most notably the one that introduced MT-Bench—that LLM judges have a massive crush on long responses. If Response A is a concise, correct three-sentence answer, and Response B is a rambling, slightly repetitive two-paragraph answer that eventually says the same thing, the judge almost always picks Response B. It equates "more words" with "more effort" or "higher quality."

That is incredibly frustrating. We are essentially rewarding the AI for being "fluffy." It’s like a student who knows they don't have the answer to the essay question so they just write three pages of beautifully phrased nonsense hoping the teacher gets tired and gives them a B-plus.

And it gets worse. There is also Position Bias. In those Pairwise Comparisons we talked about, the judge is more likely to pick whichever response it sees first. If you put the same two responses in front of it but swap the order, the judge might change its mind. It’s a bizarre quirk of how these models process tokens. To fix it, developers have to run the evaluation twice for every single pair, swapping the positions, and then only counting it as a win if the model wins in both positions. It doubles your API costs instantly.

That sounds like a nightmare for your cloud bill. But there is a third bias that I find even more fascinating, and that is Self-Enhancement Bias. Or as I like to call it, the "Narcissus Effect." Does GPT-4 actually prefer answers that sound like GPT-4?

Statistically, yes. Models tend to favor the stylistic patterns, the specific formatting, and even the common "refusals" of their own family. If a model was trained on a specific way of saying "As an AI language model..." and it sees that in a response it is grading, it essentially thinks, "Ah, this sounds like someone who knows what they're talking about." It’s a self-reinforcing loop. If we aren't careful, we could end up with an entire ecosystem of models that all sound exactly the same because they are all being "vibe-checked" by the same three judge models.

It’s a monoculture problem. If everyone is using the same judge, then everyone is optimizing for the same narrow definition of "good." We lose the diversity of model behavior. But despite these flaws, the industry is moving full steam ahead with this. Why? Is it just the cost, or is there something deeper?

It is the speed of the iteration loop. Think about the Reinforcement Learning from Human Feedback—RLHF—process. Traditionally, you need thousands of humans to sit in a room and rank outputs so you can train a "Reward Model." That Reward Model then teaches the main LLM what humans like. But humans are slow, they get bored, and they disagree with each other.

So we replace the humans with an LLM judge to create "RLAIF"—Reinforcement Learning from AI Feedback.

And that allows you to run those training loops much, much faster. Anthropic has been very open about using this for Claude. They call it "Constitutional AI." They give the judge model a "Constitution"—a set of rules like "be helpful, be harmless, be honest"—and the judge uses those rules to evaluate and refine the model's behavior. It allows them to scale safety and alignment in a way that would be impossible if they had to wait for human reviewers to look at every single edge case.

I want to circle back to the practical side of this for a second. We mentioned MT-Bench and Chatbot Arena. For someone who isn't a researcher, why should they care about these benchmarks? Are they actually better than the old school benchmarks?

They are fundamentally different. Old benchmarks like MMLU—which is a massive multiple-choice test—are basically checking for "knowledge." Can the model pass a bar exam or a biology test? But being a walking encyclopedia doesn't make you a good assistant. MT-Bench is different because it uses an LLM judge to grade multi-turn conversations. It asks the model to write code, then asks it to explain that code, then asks it to find a bug.

It’s testing the "flow" of conversation.

Right. And because it uses a judge model to score those responses on a scale of one to ten, it gives us a much better sense of how "useful" the model feels in a real-world scenario. The Chatbot Arena is the "ground truth" because it uses real human votes, but MT-Bench is the "lab version" that developers can run locally to see if their latest fine-tune is actually making the model smarter or just making it better at multiple-choice tests.

You mentioned something called G-Eval earlier. How does that differ from just "prompting GPT-4 to be a judge"?

G-Eval is a specific framework that tries to solve the "black box" problem of LLM judging. It uses a technique where the judge doesn't just output a score; it outputs a probability distribution over the scores. It basically says, "I am seventy percent sure this is a four, and thirty percent sure it is a five." By looking at those probabilities, developers can get a sense of how "certain" the judge is. If the judge is split fifty-fifty, that’s a sign that the prompt is ambiguous or the two responses are indistinguishable. It adds a layer of statistical rigor to what is otherwise a very "vibey" process.

So it’s not just a gut feeling; it’s a quantified gut feeling. I love that. But what about the cost? Using GPT-4o as a judge for every single experiment must be prohibitively expensive for a startup. Are there "Budget Judges" appearing?

That is where models like Prometheus or JudgeLM come in. These are open-source models—often based on Llama or Mistral—that have been specifically fine-tuned on "evaluation data." Instead of being a general-purpose assistant, their entire "brain" is dedicated to following rubrics and being a fair judge. The goal is to get GPT-4 level judging capability but in a smaller, seven-billion or thirteen-billion parameter model that you can run on a single GPU in your own office.

That seems like a massive win for privacy, too. If I am a law firm or a hospital and I want to evaluate a model on sensitive data, I can't be sending all those "Student" responses to an external API just to be graded. I need a "Judge in a Box" that stays behind my firewall.

We are seeing a lot of movement there. But even with a "Judge in a Box," you still have to worry about the "Human Agreement" metric. Every time a new judge model or framework comes out, the first thing researchers do is check how often the AI judge agrees with a panel of human experts. If the AI judge agrees with humans eighty-five percent of the time, and two humans only agree with each other eighty percent of the time, then the AI judge is technically more "human" than a human is.

That is a terrifying thought. We have reached "Peak Human" and now we're just trying to simulate it. But it does raise a question about the "five percent" or "ten percent" where they disagree. What are the humans seeing that the AI is missing? Is it something like "soul" or "creativity," or is it just that the AI is too rigid?

It’s often about nuance and real-world common sense. A judge model might see a response that is technically perfect but socially tone-deaf. Or it might fail to catch a very subtle "hallucination" because the lie is buried in a very confident-sounding sentence. Humans are still better at spotting when something feels "off" in a way that defies a rubric. But for "Does this Python script work?" or "Is this summary accurate?", the AI judge is often more meticulous than a tired human reviewer.

I’m thinking about the future of this. If we have models training on data judged by other models, are we creating a feedback loop that eventually drifts away from what humans actually want? It’s like that game of "Telephone." If the judge has a slight bias toward being overly polite, and it grades the student model, and then we use that student model's best outputs to train the next generation... eventually, the AI is just going to be a giant, hyper-polite blob that doesn't actually answer our questions anymore.

That is the "Model Collapse" theory. If you train on too much synthetic data that has been filtered by an AI judge, you lose the "edge cases" of human language. You lose the slang, the weird metaphors, the grit. You get this "Average AI" style. To prevent that, we have to keep "injecting" real human preference data back into the system. You can't just let the AI grade itself forever. You need a "Human-in-the-loop" audit every few months to make sure the judge hasn't lost its mind.

So the humans are the "North Star," and the AI judge is just the high-speed compass trying to point us there. I can live with that. Let's talk takeaways. If our listeners are building something with LLMs right now—maybe a small RAG app or a customer service bot—how should they start using this?

The first step is to stop just "looking" at your outputs and start "measuring" them. Pick twenty "Golden Prompts"—the most important things your bot needs to do well. Get a human to write the perfect answer for those twenty. Then, every time you change your code or your model, use a judge model (like GPT-4o or even a specialized model like Prometheus) to compare your new output to that "Golden" human reference. Don't worry about complex scoring yet. Just ask the judge: "Is this new answer as good as the human's? If not, why?"

That "Why" is the most important part. Don't just take the score. Read the justification. If the judge says, "The model missed the customer's account number," that gives you a concrete bug to fix. It’s moving from "I think it’s better" to "I know it’s better because it’s catching ten percent more account numbers."

And if you are doing pairwise comparisons, remember the "Swap the Order" trick. It sounds simple, but it is the easiest way to make your evaluations ten times more reliable. If the judge picks A when it's on the left, but picks B when it's on the right, you have a tie. Throw that data point out. It’s noise.

It’s like being a good scientist. You have to control for the variables, even if the "variable" is just the AI's weird preference for the left side of the screen. What about the "Verbosity" issue? Is there a way to tell the judge, "Hey, don't be fooled by the length"?

You can actually put penalities into your rubric. You can tell the judge, "If the answer is longer than two hundred words, deduct one point for every fifty extra words unless the length is absolutely necessary." You have to be the judge's boss. You have to tell it exactly what you value. If you value speed and conciseness, you have to build that into the "Law" of your evaluation script.

It really does come back to the prompt. Even the judge needs a good manager. I think the big takeaway for me is that LLM-as-a-Judge is not a "set it and forget it" thing. It’s a tool that needs to be calibrated. You have to judge the judge.

Run a small batch of your data through the AI judge and a human at the same time. If they agree ninety percent of the time, great! You can trust the AI for the next ten thousand rows. If they only agree fifty percent of the time, your rubric is broken or your judge model isn't smart enough for the task.

It’s about building a hierarchy of trust. Humans trust the "Super Judge," the "Super Judge" grades the "Expert Judge," and the "Expert Judge" grades the "Student." It’s turtles all the way down, but at least the turtles are moving at the speed of light.

We are entering an era where "Evaluation" is going to be a bigger job than "Development." Writing the code for an AI app is relatively easy now. Proving that the AI app is safe, accurate, and better than it was yesterday? That is the real work. And LLM-as-a-Judge is the only tool we have that is powerful enough to do that work at scale.

It’s a fascinating shift. We’re moving from being "Coders" to being "Critics." We are spending our days defining what "Good" looks like so that the machines can go off and try to achieve it.

And that brings us back to Daniel's prompt. This isn't just a technical niche. This is how the next generation of AI–the ones that will be handling our medical records and our legal contracts–is being built and vetted right now. If we don't understand how the "Judge" works, we won't understand why the AI is behaving the way it is.

Well, I for one welcome our new robot HR department. As long as they don't have a "Position Bias" against sloths. I usually prefer the "Response B" position anyway—it feels more relaxed.

I think you're safe, Corn. Most judges would find your "conciseness" very efficient, even if it's just because you're too lazy to type more.

Hey, that’s "Conciseness Bias," and I am here for it. This has been a great deep dive. It’s one of those topics that sounds "inside baseball" until you realize it’s actually the foundation of the whole stadium.

It really is. Without automated evaluation, the whole LLM revolution would have stalled out six months after GPT-4 launched. We just couldn't have kept up.

Well, I think we've given everyone a lot to chew on. From the Narcissus Effect to the Victorian Butler problem, the world of AI judging is a lot weirder than a simple "one to five" scale.

It’s a mirror of us, really. All our biases, all our preferences, just codified into a system that can run a million times faster than we can.

On that note, we're going to wrap this one up. Big thanks to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a huge thank you to Modal for providing the GPU credits that power this show. They are the reason we can run these complex models and get these insights out to you every week.

If you found this useful, or if you're a developer who has discovered a new bias in your own judge models, we'd love to hear from you. You can find us at myweirdprompts dot com. We have the full archive there, plus all the ways to subscribe so you never miss an episode.

And if you're listening on Spotify or Apple Podcasts, do us a solid and leave a review. It really does help the "Algorithm Judge" decide that we're worth showing to more people. Apparently, even podcasts are subject to the "vibe check."

See you next time.

Stay curious. This has been My Weird Prompts. Bye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2007: AI Grading AI: The Snake Eating Its Tail

Downloads

You Might Also Like

#2007: AI Grading AI: The Snake Eating Its Tail