Skip to main content
Back to Tools

LLM Latency Comparator

Estimate TTFT, tokens/sec, and total response time across 9 frontier LLM APIs. Tune prompt size, output length, and streaming mode.

Step 1 — Your request shape

1010,000
120k (use input for more)
Streaming?On

Toggle affects which column the table highlights as the latency users actually feel.

Step 2 — Estimated latency & cost per request

Gemini 2.5 Flash
Google
2202002.72
TTFT 220ms felt
$0.0015
Claude Haiku 4.5
Anthropic
2001802.98
TTFT 200ms felt
$0.0035
GPT-5 mini
OpenAI
2801503.61
TTFT 280ms felt
$0.0045
Qwen3 Coder Next
Alibaba
3501104.90
TTFT 350ms felt
$0.0015
DeepSeek V3
DeepSeek
400905.96
TTFT 400ms felt
$0.0013
Claude Sonnet 4.6
Anthropic
450806.70
TTFT 450ms felt
$0.0105
GPT-5
OpenAI
600608.93
TTFT 600ms felt
$0.0300
Gemini 2.5 Pro
Google
720559.81
TTFT 720ms felt
$0.0175
Claude Opus 4.7
Anthropic
8504511.96
TTFT 850ms felt
$0.0525

Why this matters — streaming vs. non-streaming

In streaming mode, users feel TTFT — how long until the first token appears. Right now that's lowest for Claude Haiku 4.5 at ~200ms. In non-streaming mode, the whole response must finish before anything renders, so tokens/sec and output length both matter more than TTFT. Rule of thumb: chatbots and autocomplete should optimise for TTFT; batch pipelines and eval jobs should optimise for total time and cost. Overall fastest for this workload: Gemini 2.5 Flash.

Why LLM Latency Matters More Than You Think

Latency is two numbers pretending to be one. The first is TTFT — Time To First Token, the gap between sending a request and the first character streaming back. The second is tokens-per-second, which governs how fast the rest of the response arrives once streaming starts. Both are latency, but they feel completely different to a human sitting at a keyboard. A model with 200ms TTFT and 45 tok/s feels snappy because the stream starts instantly; a model with 900ms TTFT and 180 tok/s can finish the same response sooner but feel laggy, because the user stared at a blinking cursor for nearly a second first.

Knowing which one to optimise for comes down to the shape of your product. For any user-facing surface — chat, autocomplete, live coding assistants, voice — TTFT dominates perceived speed. Users judge the app in the first 500ms, long before output length matters. That's why Haiku 4.5, GPT-5 mini, and Gemini 2.5 Flash are so popular for conversational products: even when the larger siblings finish sooner in total time, the smaller models feel better because the response appears instantly. For batch processing, long-form document generation, evaluation pipelines, and anything where a human is not actively watching the stream, total time and throughput matter far more. Cost per token also starts to dominate at batch scale, which is why frontier labs keep shipping cheap fast models alongside their flagship reasoners.

A few realistic caveats about the numbers above. They are April 2026 snapshots from typical US-region deployments under normal load. Cold starts on serverless tiers can add 1-3 seconds. Very long prompts (50k+ tokens) can push TTFT up meaningfully because the model still has to process the prefix before producing anything. Heavy tool use, vision inputs, and structured output modes all add their own overhead. Regional routing also matters — EU and APAC users routinely see 50-150ms of extra network latency on top of compute time.

The fastest win for most teams is prompt caching. Both Anthropic and OpenAI support it; Google's equivalent is also available. If your requests share a long system prompt or a repeated document context, caching typically drops TTFT by 50-85% on repeat hits and cuts input cost by 50-90% on the cached portions. For most production apps, turning on prompt caching is a bigger latency and cost win than switching models. Pair it with streaming by default and you get the smallest number of noticeable delays for the lowest spend.

Disclaimer: These benchmarks are a snapshot from typical production deployments in April 2026, not peer-reviewed measurements. Your real latency will vary by region, load, prompt complexity, output length, tool-use overhead, and rate-limit backoff. Always benchmark against your own traffic before committing architecture to any specific model.

Frequently asked questions

What is TTFT and why does it matter for LLM apps?
TTFT (Time To First Token) is how long you wait from sending a request to seeing the first token stream back. For chat, autocomplete, and any user-facing stream, TTFT dominates perceived speed — a model with high tokens/sec but 2s TTFT still feels slow.
Why do Haiku and Flash models have lower TTFT than Opus or Pro?
Smaller models have fewer parameters to route activations through, less KV-cache warmup, and usually run on lower-latency routing tiers. They also tend to have higher tokens/sec because each decode step is cheaper.
Are these numbers guaranteed in production?
No. These are typical April 2026 snapshots from US-region deployments under normal load. Cold starts, long prompts, region, tool-use overhead, and rate-limit backoff can add 1-5s. Always benchmark against your own traffic pattern.
How much does prompt caching help?
Anthropic and OpenAI both offer prompt caching. For repeat prefixes (system prompt, long context), TTFT typically drops by 50-85% and input cost drops by 50-90% on cached portions. Worth it for any app that reuses context.

This tool runs entirely in your browser. No data is sent to any server.

We use analytics to understand how visitors use the site — no ads, no cross-site tracking. Privacy Policy