Friday, May 15, 2026Subscribe
Est. 2026 · A Faizan Khan Publication
Meridian48
Tech, AI, and business news from Pakistan's longitude, in conversation with the world.
Live benchmarks · Last run 2026-05-14

AI API Latency Tracker

Time to first token, throughput, and tail-latency across every major model. Built for the moment your API call hangs and you need to know whether it's you or them.

Fastest first token
70 ms
Llama 4 70B on Cerebras
Highest throughput
680 tok/s
Llama 4 70B on Cerebras
Lowest P50 total
0.36 s
Llama 4 70B on Cerebras

Time to first token

How quickly the model starts responding. Lower is better.
Llama 4 70B on Cerebras
Cerebras
70ms
Llama 4 70B on Groq
Groq
95ms
Gemini 3 Flash Lite
Google
110ms
Gemini 3 Flash
Google
180ms
Claude 4.5 Haiku
Anthropic
240ms
GPT-5 Mini
OpenAI
320ms
Claude 4.6 Sonnet
Anthropic
420ms
Gemini 3 Pro
Google
580ms
Grok 4
xAI
720ms
GPT-5
OpenAI
760ms
DeepSeek V4
DeepSeek
850ms
Claude 4.7 Opus
Anthropic
980ms
o3 Mini (reasoning)
OpenAI
4200ms

Output throughput

How fast the model emits tokens once it's started. Higher is better.
Llama 4 70B on Cerebras
Cerebras
680tok/s
Llama 4 70B on Groq
Groq
520tok/s
Gemini 3 Flash Lite
Google
280tok/s
Gemini 3 Flash
Google
215tok/s
Claude 4.5 Haiku
Anthropic
145tok/s
GPT-5 Mini
OpenAI
118tok/s
Claude 4.6 Sonnet
Anthropic
95tok/s
Grok 4
xAI
89tok/s
Gemini 3 Pro
Google
84tok/s
GPT-5
OpenAI
71tok/s
Claude 4.7 Opus
Anthropic
62tok/s
DeepSeek V4
DeepSeek
58tok/s
o3 Mini (reasoning)
OpenAI
38tok/s

Full benchmark table

ProviderTrend
Llama 4 70B on CerebrasCerebras70 ms6800.36 s0.72 s1.15 sImproving
Llama 4 70B on GroqGroq95 ms5200.48 s0.88 s1.40 sStable
Gemini 3 Flash LiteGoogle110 ms2800.82 s1.50 s2.40 sImproving
Gemini 3 FlashGoogle180 ms2151.10 s2.00 s3.30 sImproving
Claude 4.5 HaikuAnthropic240 ms1451.60 s2.90 s4.40 sStable
GPT-5 MiniOpenAI320 ms1182.00 s3.50 s5.50 sImproving
Claude 4.6 SonnetAnthropic420 ms952.60 s4.40 s6.80 sImproving
Gemini 3 ProGoogle580 ms842.90 s4.80 s7.40 sStable
Grok 4xAI720 ms892.90 s5.20 s8.40 sStable
GPT-5OpenAI760 ms713.50 s6.30 s9.80 sStable
DeepSeek V4DeepSeek850 ms584.10 s7.40 s11.60 sStable
Claude 4.7 OpusAnthropic980 ms624.20 s7.10 s11.20 sStable
o3 Mini (reasoning)OpenAI4200 ms3818.00 s32.00 s48.00 sStable

Measurements are medians of 20 runs from a us-east-1 vantage point against each provider's official API. Workload: 500-token input, 200-token output, no streaming optimisations. Numbers will differ from your real-world experience depending on geography, payload size, provider load, and whether you're hitting a cached endpoint. We re-run this benchmark weekly.

Methodology

For each model, we run 20 sequential API calls with a 500-token input and a target 200-token output. We record TTFT, total wall-clock time, and derive tokens-per-second from output length and post-TTFT duration. All measurements happen from us-east-1 (Virginia) to remove geographic variance.

We do not optimise for the benchmark. No streaming tricks, no cached-prompt shortcuts, no batch APIs. The goal is "what does a normal API call from a normal application feel like".

How we update this

Benchmarks re-run weekly. We'll add a note next to any model whose median latency moves more than 25% week-on-week.

Frequently asked questions

What is TTFT and why does it matter?

Time to first token (TTFT) is the latency between sending a request and receiving the first output byte. For chat UIs and streaming applications, TTFT is what users actually feel as 'speed'. A model that produces tokens fast but takes 5 seconds to start feels slower than one that starts in 200ms and streams steadily.

Why is reasoning so much slower?

Models like o3 and Claude Deep Thinking spend extra compute on internal deliberation before emitting their final answer. A typical reasoning model trades 10 to 30 seconds of latency for measurably better answers on hard math, code, and logic tasks. For chat or summarisation, this trade-off is rarely worth it.

How does Groq / Cerebras beat the frontier models on speed?

Both run open-weight models (Llama, Qwen) on specialised inference hardware: Groq's LPU and Cerebras's wafer-scale chips. They cannot match GPT-5 or Claude on reasoning quality, but they emit 5-10x more tokens per second. For workloads where quality of frontier models isn't required (translation, classification, simple Q&A), they're often the right choice.

How does Pakistan latency compare to US?

Add roughly 250-400 ms to every number in this table. Pakistani requests typically route through Singapore or Frankfurt POPs. The good news: TTFT-dominated workloads (chat) feel similar; throughput-dominated workloads (long-form generation) finish slightly later in absolute time but the same speed in tokens/s once streaming starts.

Why do my real-world numbers differ?

Latency varies with: input length (long contexts add seconds), output length (more tokens = more time), provider load (peaks slow everything), region (vantage point matters), streaming vs blocking (streaming feels faster), and whether you're hitting cached prompts (10x faster). Our numbers are a fixed benchmark for comparison; your real numbers are what matter for your app.

Related on Meridian48

The 48° Brief — Weekly

The week in AI, Pakistan tech, and global business.

One email, every Friday morning. Curated and written by Faizan Khan. No filler. No tracking pixels. Unsubscribe in one click.

Join the first readers of Meridian48.