Question 1

What is TTFT and why does it matter?

Accepted Answer

Time to first token (TTFT) is the latency between sending a request and receiving the first output byte. For chat UIs and streaming applications, TTFT is what users actually feel as 'speed'. A model that produces tokens fast but takes 5 seconds to start feels slower than one that starts in 200ms and streams steadily.

Question 2

Why is reasoning so much slower?

Accepted Answer

Models like o3 and Claude Deep Thinking spend extra compute on internal deliberation before emitting their final answer. A typical reasoning model trades 10 to 30 seconds of latency for measurably better answers on hard math, code, and logic tasks. For chat or summarisation, this trade-off is rarely worth it.

Question 3

How does Groq / Cerebras beat the frontier models on speed?

Accepted Answer

Both run open-weight models (Llama, Qwen) on specialised inference hardware: Groq's LPU and Cerebras's wafer-scale chips. They cannot match GPT-5 or Claude on reasoning quality, but they emit 5-10x more tokens per second. For workloads where quality of frontier models isn't required (translation, classification, simple Q&A), they're often the right choice.

Question 4

How does Pakistan latency compare to US?

Accepted Answer

Add roughly 250-400 ms to every number in this table. Pakistani requests typically route through Singapore or Frankfurt POPs. The good news: TTFT-dominated workloads (chat) feel similar; throughput-dominated workloads (long-form generation) finish slightly later in absolute time but the same speed in tokens/s once streaming starts.

Question 5

Why do my real-world numbers differ?

Accepted Answer

Latency varies with: input length (long contexts add seconds), output length (more tokens = more time), provider load (peaks slow everything), region (vantage point matters), streaming vs blocking (streaming feels faster), and whether you're hitting cached prompts (10x faster). Our numbers are a fixed benchmark for comparison; your real numbers are what matter for your app.

	Provider						Trend
Llama 4 70B on Cerebras	Cerebras	70 ms	680	0.36 s	0.72 s	1.15 s	▼Improving
Llama 4 70B on Groq	Groq	95 ms	520	0.48 s	0.88 s	1.40 s	→Stable
Gemini 3 Flash Lite	Google	110 ms	280	0.82 s	1.50 s	2.40 s	▼Improving
Gemini 3 Flash	Google	180 ms	215	1.10 s	2.00 s	3.30 s	▼Improving
Claude 4.5 Haiku	Anthropic	240 ms	145	1.60 s	2.90 s	4.40 s	→Stable
GPT-5 Mini	OpenAI	320 ms	118	2.00 s	3.50 s	5.50 s	▼Improving
Claude 4.6 Sonnet	Anthropic	420 ms	95	2.60 s	4.40 s	6.80 s	▼Improving
Gemini 3 Pro	Google	580 ms	84	2.90 s	4.80 s	7.40 s	→Stable
Grok 4	xAI	720 ms	89	2.90 s	5.20 s	8.40 s	→Stable
GPT-5	OpenAI	760 ms	71	3.50 s	6.30 s	9.80 s	→Stable
DeepSeek V4	DeepSeek	850 ms	58	4.10 s	7.40 s	11.60 s	→Stable
Claude 4.7 Opus	Anthropic	980 ms	62	4.20 s	7.10 s	11.20 s	→Stable
o3 Mini (reasoning)	OpenAI	4200 ms	38	18.00 s	32.00 s	48.00 s	→Stable

AI API Latency Tracker

Time to first token

Output throughput

Full benchmark table

Methodology

How we update this

Frequently asked questions

What is TTFT and why does it matter?

Why is reasoning so much slower?

How does Groq / Cerebras beat the frontier models on speed?

How does Pakistan latency compare to US?

Why do my real-world numbers differ?

Related on Meridian48

The week in AI, Pakistan tech, and global business.