#2347: AI Model Spotlight: ** Mercury 2

How Inception Labs' diffusion-based Mercury 2 trades latency for throughput—and where it outperforms autoregressive models.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2505
Published: Apr 20
Duration: 21:20
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: large-language-models ai-models hardware-acceleration

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Mercury 2: Diffusion’s Speed Play

Inception Labs’ Mercury 2 is betting on diffusion-based language models as a faster alternative to traditional autoregressive architectures. Unlike models that generate tokens sequentially, Mercury 2 drafts and refines outputs in parallel, achieving throughput of up to 1,200 tokens/second on GPUs—though real-world API performance hovers around 145 tokens/second due to overhead.

Key Features

Architecture: Discrete diffusion allows parallel token refinement, reducing GPU idle time.
Pricing: Competitive at $0.25/M input tokens, with cache reads driving effective costs lower.
Benchmarks: Strong mid-tier performance (91.1% on AIME 2025 math, 67.3% on LiveCodeBench) but struggles with high-end reasoning (15.5% on Humanity’s Last Exam).

Tradeoffs

Latency: Time-to-first-token is ~3.5 seconds (slower than autoregressive models), but streaming thereafter is rapid.
Verbosity: Generates ~2.6x more tokens than peers in testing, potentially offsetting cost savings.
Use Cases: Ideal for agent loops where throughput compounds; less suited for low-latency chat.

With $50M backing from NVIDIA and Microsoft, Mercury 2 signals growing industry interest in non-autoregressive approaches—even if it’s not yet a frontier-model killer.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2347: AI Model Spotlight: ** Mercury 2

Welcome to My Weird Prompts. I'm Corn, my brother Herman is here as always, and today we are doing an AI Model Spotlight. The model is Mercury 2, built by a lab called Inception. Herman, set the scene for us. Who are these people?

Inception, full name Inception Labs, is based in Palo Alto. The company was founded by Stefano Ermon, who is a Stanford computer science professor, and the core thesis of the lab is pretty specific. They are not trying to build a general-purpose frontier model that competes with OpenAI or Anthropic on every dimension. They are betting that diffusion-based language models are a fundamentally better architecture for certain classes of problems, and they have been building that out since the beginning.

Diffusion is the word we are going to keep coming back to today.

We will get into the mechanics in the next segment. But for context on the lab, the model lineup right now is three models. There is Mercury, the original, which was their first public diffusion language model. There is Mercury Coder, which is the code-focused variant. And now there is Mercury 2, which is what we are covering today. So this is not a one-off release. There is a clear product family developing here.

What does the funding picture look like? Because that tells you something about how seriously the industry is taking the bet.

Fifty million dollars raised, led by Menlo Ventures. The names that participated are worth noting. NVIDIA's venture arm, NVentures, is in. Microsoft's venture fund, M12, is in. Snowflake Ventures is in. When you see NVIDIA writing a check into a company that is specifically claiming to use GPU hardware more efficiently than standard autoregressive models, that is an interesting signal. It does not prove the architecture works at scale, but it suggests people who understand the hardware layer think there is something real here.

On the enterprise side, they have started landing actual platform integrations.

Amazon Bedrock, Azure AI Foundry, and a few others. So this is not purely a research lab releasing weights and waiting. They are actively building a commercial distribution layer around the technology.

Let us talk about what Mercury 2 actually is under the hood, because the architecture is genuinely different from what most people are working with day to day.

Right, and this is the part where we have to spend a minute on the underlying mechanism, because the speed claims and the benchmark numbers only make sense once you understand why the generation process works differently. Standard large language models, the ones most people are building on, are autoregressive. That means they produce one token at a time, left to right, each token conditioned on everything that came before it. It is sequential by design. You cannot generate token five until you have generated token four.

That sequential dependency is what creates the throughput ceiling.

The GPU is sitting there waiting on each step before it can do the next one. Mercury 2 is a discrete diffusion model. The mechanism is different. Instead of generating tokens one at a time in sequence, it generates a draft of the full output and then refines it iteratively. Multiple tokens are being worked on in parallel across those refinement passes. The GPU utilization profile is fundamentally different.

Inception is calling it a reasoning diffusion LLM, which is a category they are claiming to have created.

That is the claim, yes. Mercury 2 is described as the first reasoning dLLM. The reasoning capability is not bolted on as a separate model. It is integrated into the diffusion process, and importantly, the reasoning level is tunable via the API. There is a reasoning parameter that developers can adjust. Now, the model card does not give us the full mechanics of what that parameter controls, no documented range, no examples of what low versus high reasoning looks like in practice, so that is a gap we cannot fill in from the available documentation. But the fact that reasoning tokens are exposed separately in the API response, in a reasoning details array, suggests this is a first-class feature rather than a wrapper.

What does the context window look like?

One hundred and twenty-eight thousand tokens context, with a maximum output of fifty thousand tokens. Those are competitive numbers. The fifty thousand token output ceiling is notably generous compared to a lot of models in this tier, which matters if you are generating long documents or running extended agentic loops.

On the integration side?

OpenAI API compatible, which means existing tooling largely works without modification. Native tool use is supported. Structured JSON output with schema alignment is supported. Cache read is supported, which implies some form of key-value cache equivalent is in play, even though the generation mechanism is not autoregressive in the traditional sense. And the data policy is zero prompt retention, no prompt training, with moderation left to the developer.

One thing we should flag for listeners: parameter count is not disclosed. The model card does not list it.

We cannot do an apples-to-apples size comparison with other models in this tier. That is a real gap. Whether that is competitive positioning or just an omission, we do not know.

Let's talk about what it costs to run this thing, because the pricing story is actually part of the pitch here.

It is, and before we get into the numbers, I should flag the caveat we always run on this series. All pricing we are about to cite is as of April twenty, twenty twenty six. These numbers shift, sometimes weekly, so treat them as a snapshot rather than a guarantee.

What are we looking at?

Through OpenRouter, which is currently the only host platform listed, input is twenty-five cents per million tokens and output is seventy-five cents per million tokens. Cache reads come in at two and a half cents per million tokens.

That cache read price is very low.

It is, and it matters more than it might look at first glance. The observed cache hit rate on OpenRouter telemetry over the past hour was forty-eight point one percent. So roughly half of input tokens are being served from cache at a tenth of the standard input price. If you are running workloads with repetitive system prompts, shared context, or high-volume agent loops hitting similar prefixes, the effective input cost drops meaningfully below that twenty-five cent headline figure. The weighted average input price observed over that same window was about fourteen cents per million, which reflects that cache effect in practice.

How does that stack up against the models Inception is explicitly comparing themselves to?

The supplementary research puts Claude 4.5 Haiku and GPT-5 Mini somewhere in the range of two and a half to six and a half times more expensive on input. We do not have an exact figure from Inception themselves, they use the phrase "fraction of the cost" without quantifying it, but the directional claim holds up against the numbers we do have.

No tiered pricing, no batch discounts mentioned?

One provider, one price tier, as of the date we pulled this.

Let's get into what the benchmarks actually show, because the speed story is where this model leads, and it is also where the numbers need some unpacking.

Right, so the headline claim from Inception is greater than one thousand tokens per second on standard GPUs, and they frame that as five times or more faster than Claude 4.5 Haiku and GPT-5 Mini. Independent testing from reviewers at Awesome Agents and Artificial Analysis puts the observed throughput somewhere between six hundred and sixty and twelve hundred tokens per second in direct hardware testing, which broadly supports the lab's claim under controlled conditions.

The OpenRouter telemetry we pulled tells a different story.

The observed API throughput through OpenRouter is around one hundred and forty-five tokens per second. That is roughly seven times below the claimed figure. Now, I want to be careful here because this is not necessarily the lab being misleading. There are a few plausible explanations. API overhead, request batching, network round-trips, the measurement methodology for raw hardware throughput versus what you actually see at the API layer. These are different things. But if you are an engineer evaluating this for a latency-sensitive production system, one hundred and forty-five tokens per second at the API level is the number you need to plan around, not the headline figure.

What about time to first token? Because for some workloads that matters as much as throughput.

This is the honest trade-off in the diffusion architecture. Because the model generates and refines a full draft before streaming begins, time to first token is around three and a half seconds in independent testing. That is two to three times slower than Claude 4.5 Haiku or GPT-5 Mini. The OpenRouter telemetry shows zero point two eight seconds, which seems inconsistent with those independent measurements, so I would treat that figure with some caution until we have more data points. The practical upshot is that once the model starts streaming, it is very fast, but the wait before streaming begins is longer than comparable autoregressive models.

It is a different latency profile, not a uniformly better one.

Exactly the right framing. Now on quality benchmarks, the picture is competitive in the mid-tier. AIME 2025, which is a competitive mathematics benchmark, comes in at ninety-one point one percent. That is above Claude 4.5 Haiku at roughly eighty-five percent and GPT-5 Mini at roughly eighty percent according to the same independent testing. LiveCodeBench for coding is sixty-seven point three percent, again ahead of both comparables. GPQA Diamond, graduate-level science reasoning, lands between seventy-three and seventy-seven percent depending on the run, which is solid for this tier.

Where does it fall short?

The upper end of reasoning is where you see the ceiling. Humanity's Last Exam, which is designed to be extremely hard, comes in at fifteen point five percent. CritPt, which is research-level physics, is zero point eight percent. These are not failures exactly, most models in this tier score similarly, but they confirm that Mercury 2 is not competing with frontier reasoning models like o3 or Gemini 3.The Artificial Analysis Intelligence Index puts it at thirty-two point eight, which is better than seventy-two percent of compared models, and the coding index at thirty point six beats seventy-seven percent. Those are solid mid-tier numbers, not top-of-leaderboard numbers.

The verbosity issue that reviewers flagged?

One reviewer observed that Mercury 2 generated around sixty-nine million tokens on an evaluation suite where the median was twenty-six million. That is a significant verbosity gap. Whether that affects your use case depends on what you are building, but if you are paying per output token and the model is generating two to three times more tokens than necessary, that changes the cost calculus we talked about in the last segment.

Let us talk about where you would actually reach for this. Given everything we have covered on the architecture and the latency profile, who is the natural customer here?

The clearest fit is agent loops. When you have a pipeline where the model is being called repeatedly, sometimes dozens of times per task, the throughput advantage compounds in a way that matters. The verbosity issue we flagged is a real cost consideration, but if the model is completing steps faster and you are chaining those steps, the wall-clock time on the overall task can still come out ahead. OpenClaw and Hermes Agent are the two highest-volume users on the OpenRouter telemetry, and both are explicitly agentic workloads. OpenClaw is an AI agent for messaging apps handling commands, web browsing, file management, email. Hermes Agent from Nous Research is a persistent self-improving agent with over forty tools. These are not chat interfaces. They are systems where throughput is the bottleneck, not the first-token wait.

What about the coding use case? The lab lists that prominently.

Coding workflows where latency compounds is how they phrase it, and that framing is doing some work. The argument is that if you are running a code generation loop, a test, a fix cycle, the time savings per iteration stack up. ZimmWriter is the second-highest token consumer in the usage data, which is a content and writing tool rather than pure code, but the pattern is similar. High-volume, repeated generation where you want throughput over interactivity. LiveCodeBench at sixty-seven point three percent and the Coding Index placing it better than seventy-seven percent of compared models gives you some confidence that the quality is there to support those workflows, not just the speed.

Where would you steer people away from it?

Interactive chat is the obvious one. If your application depends on a response starting to appear quickly, the three and a half second time to first token is a meaningful user experience problem. A human sitting at a chat interface waiting three and a half seconds before anything appears is going to feel slow regardless of how fast the tokens arrive after that. Voice interfaces have the same issue. The model is not designed for that latency profile.

There is no multimodal path here at all.

No vision input, no audio, no image understanding. This is a text-in, text-out model. If your workload involves processing documents with embedded images, screenshots, or any kind of visual context, Mercury 2 is not in the conversation. That is not a criticism, it is just a scope boundary you need to know before you start evaluating it.

Clean lines on both sides then. Fast pipelines yes, interactive and multimodal no.

How has the industry actually landed on this one? You have been tracking the coverage since launch.

The reception splits pretty cleanly along a fault line. On one side you have the production engineering crowd, people building pipelines and agent systems, and they are excited. The framing that keeps coming up in reviews and coverage is something like diffusion revolution for production workflows, and that is not just marketing language being repeated back. The throughput numbers are real enough that engineers are taking them seriously. Awesome Agents ran a hands-on review and clocked output throughput between roughly six hundred and twelve hundred tokens per second, and described the experience of watching it stream as uncanny fast once it gets going. That matches what the lab claims directionally, even if the API-level numbers we cited from OpenRouter are considerably lower.

The speed story is holding up under independent testing, at least in some conditions.

In raw throughput terms, yes. The caveat that keeps appearing alongside those numbers is the time to first token. The Leave It to AI review put it at three point four six to three point four eight seconds, and explicitly flagged that as two to three times slower than Claude Haiku. That is the honest benchmark caveat that reviewers are surfacing, and it is the right one to surface. The model is fast in the way a freight train is fast. Once it is moving, nothing touches it. But the departure from the station takes longer than a sports car.

Verbosity came up in the brief as well.

It did, and Artificial Analysis flagged it in their analysis. Mercury 2 generated around sixty-nine million tokens on their evaluation suite, against a median of roughly twenty-six million for comparable models. That is not a trivial difference. If you are paying per output token, and you are, verbosity is a direct cost multiplier. It is worth stress-testing on your specific workload before you assume the price advantage holds.

On the enterprise side, the partnerships are notable.

AWS Bedrock, Azure AI Foundry, SageMaker JumpStart. Those are not partnerships you get announced without some level of enterprise validation. The fifty million dollar raise with Menlo Ventures leading and NVIDIA's venture arm, Microsoft's M12, and Snowflake Ventures participating also signals that institutional money has looked at the technology and decided it is credible. That is not a guarantee of anything, but it is a meaningful signal.

Any red flags in the coverage worth naming?

The intelligence ceiling is the honest one. Artificial Analysis puts the Intelligence Index at thirty-three out of one hundred, which places it well above average for its price tier but well below the frontier models. Reviewers are consistent on this. Mercury 2 is not competing with o3 or Gemini at the high end. It is competing with Haiku and Mini class models, and in that comparison it looks strong on speed and price and roughly comparable on quality. That framing matters. If someone evaluates it expecting frontier-level reasoning and finds mid-tier results, they will be disappointed. If they evaluate it as a fast, cost-efficient workhorse for high-throughput pipelines, the evidence supports that case.

Let us land this. You have spent the last twenty minutes walking through the architecture, the benchmarks, the speed story, the caveats. If someone is listening to this and they are deciding whether Mercury 2 goes on their shortlist, what is the honest version of when you reach for it?

The honest version is that it is a workload-specific tool, not a general-purpose upgrade. The profile it fits is high-throughput, cost-sensitive, and latency-tolerant on the front end. Agent loops are the clearest case. If you are running a pipeline where the model is called repeatedly, where output volume compounds across many turns, and where the bottleneck is tokens-per-second rather than time to first token, Mercury 2 is interesting. The throughput numbers hold up under independent testing. The pricing undercuts comparable models by a meaningful margin. And the observed forty-eight percent cache hit rate on OpenRouter suggests real-world usage patterns are already benefiting from the cache read pricing, which at two and a half cents per million tokens is very low.

The cases where you would not reach for it?

Interactive chat is the obvious one. If your user is watching a cursor and expecting a response to start appearing in under a second, the three and a half second time to first token is going to feel slow regardless of what happens after. That is not a flaw in the model, it is a consequence of the architecture. The diffusion approach drafts and refines before it streams, and that process takes time. The freight train analogy holds. Also, if your task requires frontier-level reasoning, the intelligence index score is honest about where this sits. It is a strong mid-tier model. It is not o3. It is not Gemini at the high end. Expecting it to perform at that level will lead to disappointment.

The verbosity issue is worth naming one more time before we close.

Sixty-nine million tokens on the evaluation suite against a median of twenty-six million. If your cost calculation is built on the assumption that output volume will be typical, test that assumption before you commit. The price advantage is real, but verbosity can erode it faster than the headline numbers suggest.

The short version: agent pipelines, real-time RAG, high-volume code generation, cost-sensitive throughput work. Not interactive chat, not frontier reasoning tasks, and stress-test output volume before you price the project.

That is the verdict. It is a narrow fit, but within that fit, the case is solid.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2347: AI Model Spotlight: ** Mercury 2

Mercury 2: Diffusion’s Speed Play

Key Features

Tradeoffs

Downloads

You Might Also Like

#2347: AI Model Spotlight: ** Mercury 2