AI API Latency Tracker
Time to first token, throughput, and tail-latency across every major model. Built for the moment your API call hangs and you need to know whether it's you or them.
Time to first token
Output throughput
Full benchmark table
| Provider | Trend | ||||||
|---|---|---|---|---|---|---|---|
| Llama 4 70B on Cerebras | Cerebras | 70 ms | 680 | 0.36 s | 0.72 s | 1.15 s | ▼Improving |
| Llama 4 70B on Groq | Groq | 95 ms | 520 | 0.48 s | 0.88 s | 1.40 s | →Stable |
| Gemini 3 Flash Lite | 110 ms | 280 | 0.82 s | 1.50 s | 2.40 s | ▼Improving | |
| Gemini 3 Flash | 180 ms | 215 | 1.10 s | 2.00 s | 3.30 s | ▼Improving | |
| Claude 4.5 Haiku | Anthropic | 240 ms | 145 | 1.60 s | 2.90 s | 4.40 s | →Stable |
| GPT-5 Mini | OpenAI | 320 ms | 118 | 2.00 s | 3.50 s | 5.50 s | ▼Improving |
| Claude 4.6 Sonnet | Anthropic | 420 ms | 95 | 2.60 s | 4.40 s | 6.80 s | ▼Improving |
| Gemini 3 Pro | 580 ms | 84 | 2.90 s | 4.80 s | 7.40 s | →Stable | |
| Grok 4 | xAI | 720 ms | 89 | 2.90 s | 5.20 s | 8.40 s | →Stable |
| GPT-5 | OpenAI | 760 ms | 71 | 3.50 s | 6.30 s | 9.80 s | →Stable |
| DeepSeek V4 | DeepSeek | 850 ms | 58 | 4.10 s | 7.40 s | 11.60 s | →Stable |
| Claude 4.7 Opus | Anthropic | 980 ms | 62 | 4.20 s | 7.10 s | 11.20 s | →Stable |
| o3 Mini (reasoning) | OpenAI | 4200 ms | 38 | 18.00 s | 32.00 s | 48.00 s | →Stable |
Measurements are medians of 20 runs from a us-east-1 vantage point against each provider's official API. Workload: 500-token input, 200-token output, no streaming optimisations. Numbers will differ from your real-world experience depending on geography, payload size, provider load, and whether you're hitting a cached endpoint. We re-run this benchmark weekly.
Methodology
For each model, we run 20 sequential API calls with a 500-token input and a target 200-token output. We record TTFT, total wall-clock time, and derive tokens-per-second from output length and post-TTFT duration. All measurements happen from us-east-1 (Virginia) to remove geographic variance.
We do not optimise for the benchmark. No streaming tricks, no cached-prompt shortcuts, no batch APIs. The goal is "what does a normal API call from a normal application feel like".
How we update this
Benchmarks re-run weekly. We'll add a note next to any model whose median latency moves more than 25% week-on-week.
Frequently asked questions
What is TTFT and why does it matter?
Time to first token (TTFT) is the latency between sending a request and receiving the first output byte. For chat UIs and streaming applications, TTFT is what users actually feel as 'speed'. A model that produces tokens fast but takes 5 seconds to start feels slower than one that starts in 200ms and streams steadily.
Why is reasoning so much slower?
Models like o3 and Claude Deep Thinking spend extra compute on internal deliberation before emitting their final answer. A typical reasoning model trades 10 to 30 seconds of latency for measurably better answers on hard math, code, and logic tasks. For chat or summarisation, this trade-off is rarely worth it.
How does Groq / Cerebras beat the frontier models on speed?
Both run open-weight models (Llama, Qwen) on specialised inference hardware: Groq's LPU and Cerebras's wafer-scale chips. They cannot match GPT-5 or Claude on reasoning quality, but they emit 5-10x more tokens per second. For workloads where quality of frontier models isn't required (translation, classification, simple Q&A), they're often the right choice.
How does Pakistan latency compare to US?
Add roughly 250-400 ms to every number in this table. Pakistani requests typically route through Singapore or Frankfurt POPs. The good news: TTFT-dominated workloads (chat) feel similar; throughput-dominated workloads (long-form generation) finish slightly later in absolute time but the same speed in tokens/s once streaming starts.
Why do my real-world numbers differ?
Latency varies with: input length (long contexts add seconds), output length (more tokens = more time), provider load (peaks slow everything), region (vantage point matters), streaming vs blocking (streaming feels faster), and whether you're hitting cached prompts (10x faster). Our numbers are a fixed benchmark for comparison; your real numbers are what matter for your app.
Related on Meridian48
The week in AI, Pakistan tech, and global business.
One email, every Friday morning. Curated and written by Faizan Khan. No filler. No tracking pixels. Unsubscribe in one click.