#2348: AI Model Spotlight: ** Mercury 2

Explore Inception Labs’ Mercury 2, a groundbreaking diffusion-based language model that rethinks text generation and reasoning.

0:000:00
Episode Details
Episode ID
MWP-2506
Published
Duration
19:37
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
Claude Sonnet 4.6

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Introducing Mercury 2: A New Approach to Text Generation**
Mercury 2, developed by Inception Labs, represents a significant departure from traditional autoregressive language models. Instead of generating text sequentially, Mercury 2 employs a diffusion-based architecture, inspired by techniques used in image generation. This allows it to generate and refine multiple tokens in parallel, unlocking potential speed advantages over conventional models.

Architectural Innovation
At its core, Mercury 2 is a diffusion large language model (dLLM). While diffusion models are well-established in image generation, applying them to text at commercial scale is groundbreaking. The model’s parallelism reduces reliance on sequential dependencies, enabling faster processing on existing GPU infrastructure. This architectural shift is supported by strategic investments from NVIDIA and Microsoft, signaling confidence in its viability.

Reasoning and Capabilities
Mercury 2 introduces tunable reasoning, allowing developers to adjust the level of reasoning applied to each API call. This flexibility is particularly useful for applications that require varying degrees of computational effort. The model also supports native tool use, structured outputs, and an OpenAI-compatible API, making it accessible to developers already familiar with existing ecosystems.

Performance and Benchmarks
Mercury 2 claims to generate over 1,000 tokens per second on standard GPUs, though real-world observations show lower throughput in single-request scenarios. Benchmarks reveal strong performance in structured reasoning tasks, such as mathematics and code generation, but weaker results in broad knowledge retrieval and frontier research-level problems.

Use Cases
The model excels in latency-sensitive applications, such as coding workflows and real-time voice interfaces, where its fast token generation provides a notable advantage. However, its limitations in knowledge breadth make it less suitable for tasks requiring deep world understanding.

Pricing and Efficiency
Mercury 2’s pricing is competitive, with significant cost savings enabled by high cache hit rates. While its headline performance claims may not align with observed data, its architectural innovations and tunable reasoning make it a compelling option for specific use cases.

Conclusion
Mercury 2 is a bold experiment in rethinking text generation, offering speed and flexibility for targeted applications. While it may not replace autoregressive models entirely, it represents a promising alternative for developers seeking efficiency and parallelism in their workflows.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2348: AI Model Spotlight: ** Mercury 2

Corn
Welcome to My Weird Prompts. I'm Corn, my brother Herman is here as always, and today we are doing an AI Model Spotlight. The model is Mercury 2, built by a lab called Inception, also known as Inception Labs. Herman, you brought this one to the table. Give us the lab first.
Herman
Inception Labs is a Palo Alto-based AI startup, and the thing that makes them worth paying attention to is that they are not building another autoregressive transformer. They came out of stealth in early 2025 with a specific thesis: that diffusion models, the same broad family of techniques behind image generation, could be applied to text at commercial scale. That is their whole identity as a lab.
Corn
They have been funded to actually pursue that thesis, not just blog about it.
Herman
They raised fifty million dollars in a seed round in November 2025, led by Menlo Ventures. The notable co-investors are NVentures, which is NVIDIA's venture arm, and M12, which is Microsoft's. There was also an earlier Mayfield-led investment before that round. So the backing is serious and the strategic investors are exactly who you would want if you are betting on a new hardware-adjacent inference paradigm.
Corn
The NVIDIA piece is interesting. That is not just financial validation.
Herman
It signals that the approach works on existing GPU infrastructure, which matters a lot for adoption. The CEO is Stefano Ermon, who is a Stanford professor and has published extensively on diffusion models. So the lab has genuine research lineage, not just a product team that licensed someone else's work.
Corn
Mercury 2 sits in a family of models, not just a one-off release.
Herman
The Mercury family currently has three members: the original Mercury, Mercury Coder which is their code-focused variant, and now Mercury 2, which is the flagship reasoning model. Mercury 2 released on March 4, 2026. It is the one we are looking at today.
Corn
Let's get into what this model actually is, because the architecture is the whole story here. This is not another fine-tune on a transformer base.
Herman
No, it is genuinely different at the architectural level. Mercury 2 is what Inception calls a diffusion large language model, or dLLM. The reasoning part is new with this release, which is why they are calling it the first reasoning dLLM. But to understand why that matters, you have to understand what diffusion means in this context.
Corn
Walk us through that.
Herman
Standard autoregressive language models generate text one token at a time, left to right, each token conditioned on everything before it. That is the fundamental loop. Diffusion models work differently. The conceptual origin is in image generation, where you start with noise and iteratively refine it toward a coherent output. Inception has applied that same principle to text. Mercury 2 generates and refines multiple tokens in parallel rather than producing them sequentially.
Corn
The claim is that parallelism is where the speed comes from.
Herman
That is the core thesis, yes. If you are not bottlenecked by a sequential dependency chain, you can do a lot more work per unit of time on the same hardware. We will get into what the observed numbers actually look like when we hit the benchmarks segment, but architecturally that is the mechanism.
Corn
What about the reasoning capability specifically? Because diffusion for text generation is not brand new, but reasoning on top of it apparently is.
Herman
Right, the original Mercury family demonstrated the diffusion approach for general text and for code. Mercury 2 adds a reasoning layer, and importantly it exposes that as a tunable parameter through the API. So developers can dial the reasoning level rather than getting a fixed amount of chain-of-thought compute on every call. The model card mentions that reasoning tokens are tracked separately in the API response, which matters for cost accounting if you are building something that only needs heavy reasoning some of the time.
Corn
Though we should say we do not have a lot of detail on what those levels actually are in practice.
Herman
The API exposes a reasoning parameter, but the page does not specify how many levels there are, what the latency or cost delta looks like between them, or what the practical difference in output quality is. That is a gap. If you are evaluating this for a production system, you would need to test that yourself.
Corn
What else is notable on the capability side?
Herman
Native tool use is supported, which is table stakes for agentic work but worth confirming. Schema-aligned JSON and structured outputs are supported natively, not bolted on. The context window is one hundred and twenty-eight thousand tokens with a maximum output of fifty thousand tokens. It is OpenAI API compatible, so the integration lift is low if you are already in that ecosystem. And cache read is supported, which we will see reflected in the pricing numbers.
Herman
The model card does not list it, so we cannot do a direct apples-to-apples size comparison with other models in this tier.
Corn
What does Mercury 2 actually cost to run?
Herman
Before I get into the numbers, I should flag the caveat we always put on this segment. All pricing we are about to cite is as of April 20, 2026. These numbers shift, sometimes weekly, so check the current rates on the OpenRouter pricing page before you build anything around them.
Herman
Standard input is twenty-five cents per million tokens. Output is seventy-five cents per million tokens. Those are the headline rates. Cache read drops to two and a half cents per million tokens, which is a tenth of the standard input price.
Corn
There is a weighted average figure in there too.
Herman
Right, and this is worth paying attention to. The observed weighted average for input over the last hour of data we have is about fourteen cents per million tokens, not twenty-five. The reason is a forty-eight point one percent cache hit rate. Nearly half of all input tokens being served are coming from cache, so the effective cost is substantially lower than the rack rate.
Corn
That is a meaningful gap.
Herman
If your workload has significant prompt repetition, shared system prompts, or you are running a lot of similar queries, you could see effective input costs closer to that fourteen cent figure than the twenty-five cent headline. Output weighted average is essentially flat against the standard rate, seventy-four point nine cents versus seventy-five, so caching is not moving the needle on the output side.
Corn
What about tiered or batch pricing?
Herman
On the hosting side, OpenRouter is the sole listed provider at this point. There is no direct API pricing from Inception shown on this page, and no self-hosting option mentioned, so we cannot compare those alternatives.
Corn
One provider, no batch discounts, but that cache hit rate is doing some real work on the effective input cost.
Corn
Let us get into the performance numbers. What is Inception claiming on speed?
Herman
The headline claim is over one thousand tokens per second on standard GPUs. They also claim Mercury 2 is five times faster or more than Claude 4.5 Haiku and GPT 5 Mini. One review we found put the real-world figure even higher, citing a range of roughly six hundred and sixty to nearly twelve hundred tokens per second depending on conditions, and end-to-end latency of about one point seven seconds compared to fourteen to twenty-three seconds for autoregressive peers.
Corn
What does the observed data on OpenRouter actually show?
Herman
The provider card shows average throughput of one hundred and thirty-eight to one hundred and forty-five tokens per second, with an average end-to-end latency of zero point six two seconds and a first-token latency of zero point two eight seconds. So there is a significant gap between the lab claim of over one thousand tokens per second and what the OpenRouter performance tab is recording.
Corn
That is a big discrepancy. What explains it?
Herman
The page does not explain it directly, and we should be honest about that. The most likely explanation is that the lab's benchmark figure is measured under high-throughput batch conditions on specific hardware, while the OpenRouter observed figure reflects real-world single-request or low-concurrency traffic. Different measurement conditions produce very different numbers. Neither figure is necessarily wrong, but they are measuring different things, and if you are designing a system around throughput expectations, you need to test under your own load profile, not take either number at face value.
Corn
What about quality benchmarks?
Herman
The picture is mixed in an interesting way. GPQA Diamond, which tests graduate-level scientific reasoning, comes in at seventy-seven percent. That is a strong result for a model in this price tier. AIME 2025, the competitive mathematics benchmark, scored ninety-one point one percent according to one review, which is competitive with the Haiku and Mini class. Instruction following on IFBench is sixty-nine point eight percent, and the agentic index sits at thirty-nine point seven on Artificial Analysis, placing it above seventy-three percent of compared models.
Corn
Reasoning and math are holding up. Where does it fall down?
Herman
CritPt is the one that stands out. That is a research-level physics reasoning benchmark, and Mercury 2 scored zero point eight percent. That is not a rounding error, that is a genuine floor. HLE, Humanity's Last Exam, came in at fifteen point five percent, and the AA-Omniscience Accuracy figure is twenty point five percent, which suggests real limitations in knowledge breadth. GDPval-AA, which tries to measure performance on economically valuable tasks, is twenty-three percent. So the pattern is: strong on structured reasoning within a defined domain, noticeably weaker on broad knowledge retrieval and frontier research-level problems.
Corn
The tool use error rates?
Herman
The observed tool call error rate is four point eight nine percent, and structured output errors are coming in at two point five seven percent. For a model being positioned heavily at agentic and tool-use workloads, those are numbers you would want to pressure-test before committing to a production pipeline. Not disqualifying, but not something to wave past either.
Corn
Let us talk about where you would actually reach for this. Given everything we have just covered, the speed profile, the benchmark pattern, the error rates, what does the use case map look like?
Herman
The clearest fit is anywhere latency compounds. Coding workflows are the obvious one. If you are running a loop where a developer is waiting on model output before they can take the next action, shaving that latency has a multiplier effect on the whole experience. Mercury 2's first-token latency of under three hundred milliseconds is useful there, even if the throughput figure is closer to the observed one hundred and forty-five tokens per second than the lab's headline number.
Corn
The page itself calls out real-time voice and search interfaces.
Herman
Yes, and that makes architectural sense. Diffusion-based parallel generation means you are not waiting for a sequential chain to resolve before you get output. For a voice assistant or a search-augmented retrieval pipeline where you need to surface something fast and the query is reasonably bounded, that latency profile is a real advantage. The question is always whether the quality is sufficient for the task, and for search and retrieval augmented generation, where the model is largely synthesising retrieved content rather than drawing on deep world knowledge, the knowledge breadth limitations we flagged are less of a problem.
Corn
What about the agentic use case? The top apps by token volume are interesting here.
Herman
OpenClaw, which is described as an AI agent for messaging, file, and email automation, is the top consumer by volume at roughly one point four five billion tokens this month. Agent Zero, which is positioned as autonomous AI agents, is third at around six hundred and thirty-one million tokens. Hermes Agent, which apparently has memory and over forty tools, is fourth. So the actual usage pattern is heavily agentic, which tracks with the model's native tool use support and the low per-token cost. In a long-running agent loop, cost and latency both accumulate, and Mercury 2's pricing at twenty-five cents per million input tokens and seventy-five cents per million output tokens makes it economical to run at volume.
Corn
We should note there are two other top apps, ZimmWriter and Wire Pyramid Engine, that we cannot characterise because the page does not describe what they do.
Herman
They are significant by token volume, but we are not going to speculate about their use cases.
Corn
The hard limits on what it does not do.
Herman
No vision input, no audio, no embeddings, no reranking. If your application needs any of those, this is not your model. It is a text-in, text-out reasoning system, and the page gives no indication that is changing.
Corn
What is the broader reception looking like? Engineers, press, anyone who has actually put it through its paces.
Herman
Broadly positive, with some useful nuance. The clearest signal is from a detailed review on Awesome Agents, which rated it seven point four out of ten. The headline finding there was speed described as "uncanny," and they put some specific numbers on it: ten times faster than Claude four point five Haiku, fourteen times faster than GPT-5 Mini in their testing. They also flagged the cost comparison, calling it two and a half to six and a half times cheaper than peers at those speed tiers.
Corn
That tracks with what we have been saying about the pricing, though I want to be careful about the speed figures because we have already flagged the gap between the lab's headline number and what OpenRouter is actually observing.
Herman
Right, and the review numbers sit somewhere in between. Artificial Analysis has benchmarked throughput in the range of roughly six hundred and sixty to nearly twelve hundred tokens per second depending on conditions, which is meaningfully higher than the one hundred and thirty-eight to one hundred and forty-five tokens per second we see on the OpenRouter provider card. The honest read is that real-world throughput is hardware and load dependent, and none of these figures are the same number. What they agree on is that it is fast relative to autoregressive alternatives.
Corn
Is there a consensus view on where the quality sits?
Herman
Yes, and it is consistent across sources. Artificial Analysis places it at thirty-three on their Intelligence Index, which they note is well above the average of around nineteen to twenty for models in its price tier, and above seventy-two percent of compared models. The framing from multiple reviewers is mid-tier intelligence, competitive with the Haiku and Mini class, not competing with frontier models. Nobody is claiming otherwise, and to be fair, Inception is not claiming otherwise either.
Corn
Any criticisms worth noting?
Herman
The most substantive one is the verbosity flag. The Artificial Analysis data shows Mercury 2 generating around sixty-nine million tokens on the Intelligence Index, against a median of around twenty-six million for compared models. That is a lot of output to get to an answer, which has cost and latency implications depending on how you are using it. It is not a disqualifying issue, but if you are running high-volume inference, you want to account for it.
Corn
No major controversies, no red flags from the engineering community?
Herman
Nothing surfaced in the coverage we reviewed. The reception is positive without being credulous. The consistent note is that the speed advantage is real, the quality is appropriate for the price tier, and the knowledge breadth limitations we discussed are acknowledged rather than contested. For a model that has been out since March of this year, that is a reasonably clean early record.
Corn
Alright, let's land this. Herman, if someone is listening to this and they are trying to decide whether Mercury 2 belongs in their stack, what is the short version?
Herman
The short version is: if latency is a first-class constraint in your system, Mercury 2 is worth serious consideration. Real-time voice assistants, agentic loops where you are chaining multiple calls and the delays compound, high-throughput coding workflows where you are waiting on the model constantly. Those are the cases where the speed advantage translates directly into a better product or meaningfully lower infrastructure cost.
Corn
The structured output and tool use support makes it a practical choice for those agent pipelines, not just a theoretical one.
Herman
The tool call error rate of just under five percent and the structured output error rate of around two and a half percent are not perfect, but they are workable numbers for production agentic use, and the native support means you are not bolting something on. The real-world adoption data points the same direction. The top token consumers on OpenRouter right now are agent frameworks and automation tools. The market is already voting with its usage.
Corn
When do you not reach for it?
Herman
If your task requires deep research-level reasoning, the benchmarks are honest about the ceiling. The CritPt score of under one percent, the HLE score of fifteen and a half percent, the Omniscience accuracy of twenty and a half percent. Those are not numbers you paper over. If you are building something where the model needs broad, reliable factual knowledge or graduate-level scientific reasoning across domains, Mercury 2 is not the right tool. You want a frontier model and you are going to pay for one.
Corn
No vision, no audio, no open weights, licence terms we do not know. If any of those are requirements, the answer is also no.
Herman
Exactly the right framing. It is a focused tool. It does a specific set of things very well and it is honest, or at least the evidence is honest, about what it does not do.
Corn
For AI professionals building latency-sensitive text-based systems, it is a legitimate option at a competitive price point. For everything else, the gaps are real and the alternatives exist. That is Mercury 2 from Inception Labs. Thanks for listening to My Weird Prompts.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.