#2160: Claude's Latency Profile and SLA Guarantees

Claude is measurably slower than competitors—and Anthropic's SLA promises are even thinner than the latency numbers suggest. What enterprises actua...

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2318
Published: Apr 12
Duration: 24:57
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: claude-sonnet-4-6
Topics: latency ai-inference anthropic

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Understanding Claude's Latency Problem

When people complain that Claude is slow, they're usually vague about it. But "slow" means different things depending on what you're measuring. Understanding the distinction between these metrics is essential for anyone building production systems on Claude's API.

The Five Latency Metrics That Matter

The inference community has standardized on five core measurements:

TTFT (Time to First Token) is the delay between sending a request and seeing the first character appear. For interactive chat applications, this dominates the user experience—it's what determines whether an interface feels responsive or frozen.

ITL (Inter-Token Latency) measures the time between successive tokens once generation has started. This affects how quickly text streams in after the first token arrives.

End-to-end latency is the total time from request to final token. This matters most for batch pipelines and non-interactive systems.

Tokens per second measures generation speed—how fast the model produces output once it's started.

Requests per second measures concurrency capacity—how many simultaneous requests the system can handle.

A model could have terrible TTFT but decent generation speed once it starts, or vice versa. The metrics capture different dimensions of performance.

Why the Median is Misleading

Here's the critical insight most people miss: the median is almost meaningless for production systems. What actually matters is the p95—the ninety-fifth percentile. If your median TTFT is one second but your p95 is four seconds, roughly one in twenty requests feels dramatically slower than what you tested in your demo.

In a system handling a thousand requests per hour, that's fifty bad experiences per hour. High p95 variance is what generates user complaints, not the median. You can have an acceptable average and still have a product that feels broken to a meaningful percentage of your users.

The Benchmark Reality

A March benchmark by engineer Kunal Ganglani ran a rigorous head-to-head comparison across five models with multiple prompt sizes. The findings for Claude Sonnet 4 on long prompts were striking:

Median TTFT: 1,216 milliseconds
P95 TTFT: 4,288 milliseconds

That's 3.5 times the median—a massive spread.

Compare this to competitors:

GPT-4.1 on long prompts showed median TTFT of 1,670ms and p95 of 1,833ms—only about 10% higher than the median. Gemini 2.5 Flash had median 1,885ms and p95 2,014ms—basically flat. The variance was dramatically tighter.

Interestingly, Claude Haiku 4.5 was the fastest model in the entire benchmark, with median TTFT of 610ms on long prompts and p95 of 843ms. On short prompts, it hit 597ms—faster than OpenAI and Google. The problem is that Haiku isn't the model most enterprises default to for serious work.

Looking at provider-level speed rankings from BenchLM, Anthropic averages 52 tokens per second and a 3.3-second average TTFT. NVIDIA inference delivers 260 tokens per second. Mistral hits 126. Even DeepSeek, at 48 tokens per second, is in Anthropic's neighborhood. Anthropic ranks second-slowest by tokens per second among major providers.

Infrastructure Signals

In late March, an Anthropic engineer working on Claude Code posted on X that Anthropic was adjusting session limits during peak hours (5 AM to 11 AM Pacific, US business hours). About seven percent of users were hitting caps they wouldn't have hit before. Notably, this change affecting paying customers—Pro and Max subscribers—was announced informally on social media, not through official status pages or changelogs.

Two weeks earlier, Anthropic had run a promotion doubling usage limits during off-peak hours. This is a classic load-balancing move: incentivizing users to shift usage to times when GPU clusters have idle capacity. But the promotion applied only to Claude's app surfaces (web, desktop, mobile, Claude Code)—not the API. The developers actually building production systems got nothing.

This signals that Anthropic's priority in that moment was consumer ecosystem lock-in, not API customer relief.

What Anthropic Actually Guarantees

Standard Tier (default for all API users): Nothing. No uptime guarantee, no latency guarantee, no throughput guarantee. The API docs are explicit: all rate limits represent maximum allowed usage, not guaranteed minimums. Anthropic does not guarantee uninterrupted service. You're on best-effort infrastructure with no contractual recourse.

Priority Tier: Anthropic's committed-use product. You contact sales and commit to a specific number of input and output tokens per minute for 1, 3, 6, or 12 months. In exchange, you get one contractual commitment: a target of 99.5% uptime with prioritized computational resources. That's the only performance commitment. There's no latency guarantee, no TTFT guarantee, no minimum tokens per second. What Priority Tier actually does is prioritize your requests over standard-tier requests during peak periods. You're buying a place at the front of the queue. If Anthropic's infrastructure is running slow, you're slow too—just less likely to get a 529 server overloaded error.

This is queue priority, not performance. The documentation mentions "enhanced service levels," but what you're actually getting is reduced probability of being throttled, not guaranteed speed.

Enterprise Tier: Custom-negotiated SLAs that are not publicly disclosed. Pricing analysis suggests five hundred to fifteen thousand dollars per month depending on deployment size. You get dedicated account management, priority support, audit logging, and compliance APIs. Presumably some negotiated performance commitments exist, but they're not public.

The Fast Mode Revelation

For Opus 4.6, Anthropic offers a fast mode in beta. Standard Opus 4.6 pricing: $5 per million input tokens, $25 per million output tokens. Fast mode: $30 input, $150 output. That's a six-times premium specifically for lower latency.

The existence of a six-times premium latency tier is an implicit admission that faster is possible. It's a resource allocation choice, not a technical ceiling. Standard-tier latency isn't the fastest Anthropic can go—it's the fastest they're willing to go at standard-tier margins.

The Bottom Line

Claude is measurably slower than GPT-4 and Gemini at the p95 percentile—the metric that actually matters for production systems. Anthropic offers no latency guarantees even at Priority Tier. The six-times Fast Mode premium reveals that faster responses are possible; standard latency is a deliberate product decision, not an infrastructure limitation. For enterprises building production systems, the SLA situation is thinner than most cloud infrastructure would allow.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2160: Claude's Latency Profile and SLA Guarantees

Alright, here's what Daniel sent us this week. He's asking about Claude's latency problem — specifically, how you'd actually measure it objectively, whether Anthropic offers any real SLA guarantees on enterprise tiers, and how those compare to standard compute SLAs across the rest of the industry. And the subtext here is pretty pointed: are developers being asked to build production systems on infrastructure that would get laughed out of any traditional cloud procurement meeting?

Oh, this one's been sitting in my head for weeks. Because the answer to that last question is basically yes, and the reasons why are genuinely fascinating from a systems design perspective.

I should mention — today's script is courtesy of Claude Sonnet 4.6, which is a delightful level of recursion given we're about to spend twenty-five minutes complaining about Claude's latency.

The model generating the script about its own slowness. There's something almost philosophical there.

Or just ironic. Let's start with the measurement question, because I think this is underappreciated. When people say Claude is slow, they're usually gesturing at a vague feeling. What does "slow" actually mean in a way you can put a number on?

So the inference community has converged on five core metrics, and they each capture a different dimension of the problem. The first is TTFT — Time to First Token — which is the delay between sending your request and seeing the first character appear. For interactive chat apps, this is the dominant metric. It's what determines whether the interface feels alive or frozen.

And that's separate from how fast it actually generates the response?

Completely separate. The second metric is ITL, Inter-Token Latency — the time between successive tokens once generation has started. Then you have end-to-end latency, which is the total time from request to final token. That's what matters for batch pipelines. And then throughput metrics: tokens per second for generation speed, and requests per second for concurrency capacity.

So a model could have terrible TTFT but decent generation speed once it starts, or vice versa.

Which is exactly the pattern you see in the benchmarks. And here's the thing most people miss when they're evaluating model speed: the median is almost meaningless for production systems. What you actually care about is the p95 — the ninety-fifth percentile. If your median TTFT is one second but your p95 is four seconds, roughly one in twenty requests feels dramatically slower than what you tested in your demo.

And in a system handling a thousand requests an hour, that's fifty bad experiences per hour.

Fifty requests per hour where your user is staring at a frozen screen for four seconds instead of one. There's a benchmark from March this year — a software engineer named Kunal Ganglani ran a pretty rigorous head-to-head across five models, three prompt sizes, three runs each. And the finding on Claude Sonnet 4 was striking. Median TTFT on long prompts: one thousand two hundred sixteen milliseconds. P95 TTFT on the same prompts: four thousand two hundred eighty-eight milliseconds. That's three and a half times the median.

Three and a half times. At p95.

And Ganglani's framing of it was exactly right — he wrote that high p95 variance is what generates user complaints, not the median. You can have an acceptable average and still have a product that feels broken to a meaningful percentage of your users.

How does that compare to the other models in that benchmark?

Dramatically different. GPT-4.1 on long prompts: median TTFT one thousand six hundred seventy milliseconds, p95 one thousand eight hundred thirty-three. So the p95 was about ten percent higher than the median. Gemini 2.5 Flash: median one thousand eight hundred eighty-five, p95 two thousand fourteen — basically flat. Claude Haiku 4.5 was actually the best performer in the whole test — median six hundred ten milliseconds, p95 eight hundred forty-three. Tight spread, fast absolute numbers.

Wait, so Claude Haiku is the fastest model in the benchmark? Faster than OpenAI and Google?

On TTFT, yes, in that specific test. Five hundred ninety-seven milliseconds on short prompts. That's the fastest of any model tested. The latency conversation almost always focuses on Sonnet and Opus, but Haiku 4.5 is genuinely fast. The problem is it's not the model most enterprise customers are defaulting to.

So Anthropic has a fast model, it's just not the one people are actually using for serious work.

That's a real tension in their lineup. And then at the other end, there's a BenchLM speed leaderboard — they pull data from Artificial Analysis — and as of earlier this month, Anthropic as a provider averages fifty-two tokens per second and a three point three second average TTFT. For context, NVIDIA inference is at two hundred sixty tokens per second. Mistral is at one hundred twenty-six. Even DeepSeek at forty-eight tokens per second is in Anthropic's neighborhood.

So Anthropic is near the bottom of the provider speed rankings.

Second slowest by tokens per second. And there's an important caveat on OpenAI's numbers — their average TTFT looks terrible at forty-two seconds, but that's heavily skewed by o3 and o4, which are reasoning models that literally think before responding. Strip those out and the comparison looks different. But for Anthropic, there's no comparable asterisk. The flagship models are just slow.

Okay so we've established that the slowness is real and measurable. Now — why is it happening? Because I want to understand the infrastructure reality before we get into the SLA question.

The peak-hour problem is the most visible symptom. In late March, an Anthropic engineer — Thariq Shihipar, who works on Claude Code — posted on X that Anthropic was adjusting session limits during peak hours. Five AM to eleven AM Pacific, which is US business hours. The limits were being consumed faster during that window. About seven percent of users were hitting caps they wouldn't have hit before.

And this was announced by an individual engineer on X, not through official Anthropic channels.

That's the detail that should make enterprise procurement teams uncomfortable. A change affecting paying customers — Pro and Max subscribers, not just free tier — announced informally on social media. Not a status page update, not an email, not a changelog. An engineer's personal post.

Meanwhile, two weeks earlier, Anthropic had run a promotion doubling usage limits during off-peak hours.

Which is a classic load-balancing move. You're incentivizing users to shift their usage to times when your GPU clusters have idle capacity. GPU infrastructure runs twenty-four seven whether you're using it or not, so filling off-peak hours is basically free capacity. The New Stack analyzed this as a dual-purpose play: smooth out the traffic curve and generate goodwill with a "thank you" framing. But here's the telling detail — that promotion applied to Claude's app surfaces. Web, desktop, mobile, Claude Code. Not the API.

So the developers actually building production systems got nothing.

The people most affected by latency and throughput constraints got excluded from the capacity relief. That's a deliberate product decision. And what it signals is that Anthropic's priority in that moment was consumer ecosystem lock-in, not API customer relief.

Let's get into the SLA question then, because this is where it gets really interesting. What does Anthropic actually promise?

For standard tier — which is every API user by default — the answer is nothing. No uptime guarantee, no latency guarantee, no throughput guarantee. The API docs are explicit: all rate limits represent maximum allowed usage, not guaranteed minimums. The standard terms say Anthropic does not guarantee uninterrupted service. You're on best-effort infrastructure with no contractual recourse.

Which is fine for a side project but is a genuinely alarming basis for a production system.

And the Reddit consensus in the enterprise communities reflects exactly that. The framing you see repeatedly is: from Anthropic, there are no SLAs, no guarantees of service. Full stop.

So what's the Priority Tier?

Priority Tier is Anthropic's committed-use product. You contact sales, you commit to a specific number of input tokens per minute and output tokens per minute, for one, three, six, or twelve months. The pricing isn't public. In exchange, you get one thing contractually: a target of ninety-nine point five percent uptime with prioritized computational resources. That's the only performance commitment.

What about latency? What about TTFT?

Nothing. No latency guarantee. No TTFT guarantee. No minimum tokens per second. What Priority Tier actually does is prioritize your requests over standard-tier requests during peak periods. You're buying a place at the front of the queue. If Anthropic's infrastructure is running slow, you're slow too — just less likely to get a 529 server overloaded error.

So it's queue priority, not performance.

That's the critical distinction that I think gets obscured in how it's marketed. The phrase "enhanced service levels" appears in the documentation. But what you're actually getting is reduced probability of being throttled, not guaranteed speed. There's also a mechanical quirk worth understanding: Priority Tier requests pull from both your committed capacity and your regular rate limits simultaneously. If either is exhausted, the request is declined. So you have two independent ways to get blocked.

The "capacity reservation illusion" is a good way to frame it. Because compare that to AWS Reserved Instances — which also don't guarantee performance — but at least the underlying hardware is dedicated to you. Priority Tier is shared infrastructure with priority queuing.

And that's a meaningful distinction for enterprises who think they're buying something closer to dedicated capacity. They're not.

What about enterprise tier? Custom pricing, account managers, that kind of thing?

Enterprise tier does include SLAs — but they're custom-negotiated and not publicly disclosed. Third-party analysis puts pricing somewhere around five hundred to a thousand dollars per month for ten to twenty-five users, scaling to five thousand to fifteen thousand or more for larger deployments. You get dedicated account management, priority support, audit logging, compliance APIs. And presumably some negotiated performance commitments. But we don't know what those look like because no one publishes them.

And then there's Fast Mode for Opus.

This one is genuinely revealing. For Opus 4.6, Anthropic offers a fast mode in beta. Standard Opus 4.6 pricing: five dollars per million input tokens, twenty-five dollars per million output tokens. Fast mode: thirty dollars input, one hundred fifty dollars output. That's a six times premium specifically for lower latency.

Six times.

And the implication of that pricing is something Anthropic would never say explicitly but the math says for them: they can deliver faster responses. It's a resource allocation choice, not a technical ceiling. Standard-tier latency isn't the fastest they can go — it's the fastest they're willing to go at standard-tier margins.

So when your Claude response takes three seconds, that's a product decision, not an infrastructure limitation.

At least partially, yes. The existence of a six-times premium latency tier is an implicit admission that faster is possible. They're choosing to allocate the capacity differently at different price points.

Okay, let's zoom out and do the industry comparison. Because I want to know whether this is an Anthropic problem or an industry-wide problem.

Both, honestly, but in different ways. Start with traditional cloud SLAs. AWS EC2 in a multi-availability-zone configuration: ninety-nine point nine nine percent uptime SLA. Single instance: ninety-nine point five percent. Azure VMs, GCP Compute — similar numbers. These are the gold standard of enterprise cloud procurement.

And their latency SLAs?

Don't exist. AWS does not guarantee latency. Azure does not guarantee latency. GCP does not guarantee latency. None of them.

So the comparison to Anthropic's Priority Tier at ninety-nine point five percent uptime is actually... roughly comparable to a single AWS instance?

On uptime, yes. Which is interesting because a single EC2 instance is not considered enterprise-grade — you'd run multi-AZ for anything production-critical. But here's the deeper point: the AWS SLA defines "unavailable" as having no external connectivity. If your EC2 instance is reachable but responding in forty seconds, that's technically available under the SLA. You have no contractual recourse for a slow response, only for a completely unreachable one.

So the SLA concept as it exists in traditional cloud is really just an availability guarantee, not a quality-of-service guarantee.

That's the structural gap that nobody has solved yet. You can have ninety-nine point nine nine percent uptime and still have a model that's effectively unusable during peak hours because responses take forty-five seconds. Technically available. Practically broken. And no SLA framework — from AWS, Azure, GCP, OpenAI, or Anthropic — covers that gap.

What's OpenAI's situation?

There's a help center article that has apparently been unchanged for years. The exact text is: "We will be publishing SLAs and are working hard to get there." That's the entire answer to whether OpenAI offers latency SLAs. They do have a Priority service tier for enterprise customers with SLA-backed latency guarantees, but the specific terms aren't publicly disclosed. And there's a community thread that captures the practical problem perfectly — the title is essentially "OpenAI Priority Tier SLA, No Way to Measure Latency" — because even if they promise you something, there's no transparent, observable metric to validate compliance. How do you enforce a latency SLA you can't independently verify?

Which is a genuinely important point. An SLA that you can't audit is basically a marketing document.

The measurement problem and the SLA problem are actually connected. For an SLA to have teeth, you need agreed-upon measurement methodology. P95 TTFT over what window? Measured from client or server? Under what load conditions? None of the current offerings define this. Even if Anthropic or OpenAI committed to a latency number, the industry hasn't standardized on how to measure it in a way that both parties would accept.

There's an interesting angle here on Amazon Bedrock. Because if you're accessing Claude through Bedrock rather than the direct Anthropic API, does that change your SLA situation?

It potentially does, and this is underappreciated. Amazon Bedrock has its own SLA listed on AWS's SLA page. So if you're an enterprise customer accessing Claude through Bedrock, you may have different contractual protections than someone using Anthropic's API directly. The same model, two different front doors, potentially different legal frameworks. For a procurement team, that's actually a meaningful consideration.

The Bedrock arbitrage.

It's not widely discussed. Most of the developer conversation assumes you're hitting the Anthropic API directly, but for large enterprises with existing AWS relationships and established procurement frameworks, Bedrock might be the smarter path purely from a contractual standpoint.

Let's talk about what the industry actually needs here. Because we've established that traditional cloud SLAs cover availability but not quality of service, and AI API providers are basically doing the same thing. What would a proper LLM SLA actually look like?

The framing I keep coming back to is "Quality of Service SLA" as a distinct category from availability SLA. An availability SLA says: the service will be reachable this percentage of the time. A QoS SLA would say: when the service is reachable, here is the minimum quality of that service. Concretely, that means: p95 TTFT will not exceed X milliseconds. Minimum output throughput will not fall below Y tokens per second. And critically — these guarantees hold during peak hours, not just off-peak.

Which is exactly the period when current guarantees break down.

Because the whole value proposition of a peak-hour guarantee is that peak hours are when it matters. An off-peak latency guarantee is almost useless — infrastructure is never stressed at three AM.

Does anyone offer this?

Not publicly. There are rumors that some of the very large enterprise contracts — think Fortune 50 companies spending tens of millions annually on AI infrastructure — have negotiated custom terms that include something like this. But nothing is publicly documented, nothing is standardized, and there's no way for a mid-market company to access this kind of protection.

So the practical situation for a developer team building something serious on Claude is... what? What do you actually do with this information?

A few things. First, instrument your application properly. Don't measure from the Anthropic console alone — use something like SigNoz or even Python's time module to capture TTFT and p95 latency from your actual production requests. Build your own performance baseline. If you don't have your own data, you're flying blind.

And that baseline will tell you whether you have a median problem or a variance problem.

Which changes your mitigation strategy entirely. If you have consistently slow performance, that's a capacity or model selection issue — maybe you switch from Sonnet to Haiku for latency-sensitive paths. If you have high variance, that's a peak-hour infrastructure problem — maybe you implement request queuing with retry logic, or you shift batch workloads to off-peak hours.

The model selection point is actually underrated. Because if Claude Haiku 4.5 is genuinely the fastest model in the benchmark — faster than GPT-4.1, faster than Gemini Flash — then for a lot of use cases you're paying a latency penalty for capability you might not need.

The latency arbitrage between Haiku and Sonnet is three-point-two times on TTFT in that benchmark. For classification tasks, short-form generation, routing logic, anything that doesn't need Sonnet's reasoning depth — Haiku is both faster and dramatically cheaper. The capability-latency tradeoff is almost never discussed explicitly, but it should be part of every architecture conversation.

What about the Priority Tier question — is it worth it?

The honest answer is: it depends on your failure mode. If your primary pain point is 529 server overloaded errors during peak hours — requests failing entirely — then Priority Tier is probably worth it. You're buying queue priority, and that's valuable if you're currently getting rejected requests. But if your pain point is slow responses that still complete, Priority Tier does nothing for you. You'll be at the front of the slow queue instead of the back.

Which is a distinction that should be front and center in how it's marketed, and currently isn't.

The "enhanced service levels" framing implies performance improvement. The actual mechanism is failure-rate reduction. Those are different products.

From a procurement standpoint, what's the ask here? What should enterprise customers actually be pushing Anthropic for?

Three things. First: transparent, standardized latency reporting. Publish p95 TTFT by model by hour. Make it observable. Second: contractual QoS SLAs at the Priority and Enterprise tiers — not just uptime targets, but minimum throughput commitments that hold during peak hours. Third: a measurement framework that both parties agree on before signing a contract, so that if performance degrades, there's an objective way to determine whether an SLA was breached.

None of which Anthropic currently offers.

And to be fair, none of which AWS, Azure, GCP, or OpenAI currently offer either. This is a gap in the entire industry's maturity. We're in a period where AI APIs are being treated as enterprise infrastructure — woven into production systems, customer-facing products, financial workflows — but the contractual frameworks governing them are still at the "we'll try our best" stage.

The "available but useless" problem. The service is technically up. Your responses are just taking forty-five seconds.

And every existing SLA framework in the industry would say: that's fine, nothing to see here. That's the fundamental mismatch between how these products are being sold and how they're being governed.

I think the March throttling announcement is actually the clearest illustration of the governance gap. Not even the throttling itself — that's a legitimate infrastructure management decision. It's the communication channel. An engineer's personal X post, not an official announcement. For enterprise customers, that's the kind of thing that triggers a procurement review.

The opacity is a risk vector that doesn't show up in the technical benchmarks. You can have great p95 numbers and still have a governance problem if the company's communication strategy for service changes is "hope someone sees the tweet."

Alright, let's land this. Where does this leave us?

The core finding is that Claude's latency problem is real, measurable, and concentrated in two places: Sonnet and Opus during peak hours, and specifically in p95 variance rather than median performance. The tools to measure it properly exist — TTFT, p95, ITL, tokens per second — and the benchmarks paint a clear picture. Anthropic's flagship models are in the slow tier by industry standards. Haiku is a genuinely fast outlier that gets underutilized.

And on the SLA side — the short version is that Anthropic's Priority Tier is a queue-priority product, not a performance guarantee. Ninety-nine point five percent uptime target is the only commitment. No latency. No throughput. And that's actually comparable to the rest of the industry, because nobody offers QoS SLAs. AWS doesn't. OpenAI doesn't. The whole industry is operating on availability guarantees while selling performance-dependent products.

The six-times premium for Fast Mode is the detail I keep coming back to. That's not a benchmark number or a policy document — that's Anthropic's own pricing revealing that faster is possible. Standard-tier latency is a product decision. And until customers start demanding contractual performance commitments — not just uptime, but p95 latency and minimum throughput under load — there's no market pressure to change that decision.

Thanks as always to our producer Hilbert Flumingtop for keeping this whole operation running. And a genuine thank you to Modal for providing the GPU credits that power the pipeline behind this show.

This has been My Weird Prompts. If you want to follow along, search for us on Telegram to get notified when new episodes drop.

See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2160: Claude's Latency Profile and SLA Guarantees

Understanding Claude's Latency Problem

The Five Latency Metrics That Matter

Why the Median is Misleading

The Benchmark Reality

Infrastructure Signals

What Anthropic Actually Guarantees

The Fast Mode Revelation

The Bottom Line

Downloads

You Might Also Like

#2160: Claude's Latency Profile and SLA Guarantees