I tested every major AI to write Urdu — here's which one actually understands it

Five frontier models, three honest Urdu tests: translation, original prose, and Roman-Urdu-to-Nastaliq transliteration. One model won decisively. One model failed in a way that surprised me. The full results, with samples.

Faizan KhanFounder & Editor · Meridian48June 17, 2026 · 8 min read

A detail photograph of Nastaliq calligraphy on textured paper with strong directional lighting. — Photograph by Tasnim News Agency / Unsplash

The short version. I ran Claude 4.7 Opus, GPT-5, Gemini 3 Pro, Grok 4, and DeepSeek V4 through three different Urdu tasks. Claude 4.7 Opus won decisively with the cleanest Nastaliq, most fluent original prose, and the only model that handled Roman-Urdu transliteration without inventing words. GPT-5 was a strong second. Gemini 3 Pro is workable for translation but unnatural in original writing. Grok 4 has surprisingly good Urdu but tonally peculiar — like a model trying to sound Lahori while clearly being Californian. DeepSeek V4 failed badly enough to be unusable for Urdu, despite being competitive on English at a third the cost. Here's the full test.

Why this test exists

There are 230 million Urdu speakers globally. Approximately none of them get well-served by the current frontier AI tools out of the box.

Pakistani builders default to using AI in English for everything, then manually translate the output, then manually fix the translation. The amount of Pakistani productivity wasted on this is hard to overstate. If one of these models genuinely handled Urdu at native quality, an enormous category of Pakistani-built products becomes possible.

Most public "Urdu AI" benchmarks are unsatisfying — they test machine-translation accuracy on news headlines, which the models have all seen many times during training. I wanted three tests that were closer to what a Pakistani builder would actually use these tools for.

The three tests

Test 1: Translation under nuance pressure. A 200-word excerpt from a 2026 Pakistani court ruling on data privacy. Legal-register English. Goal: produce the Urdu translation a Karachi lawyer would accept.

Test 2: Original prose. Write a 300-word Urdu story opening set in Saddar, Karachi, in the style of Intezar Hussain's short fiction. No prompt-translation crutch.

Test 3: Roman-Urdu to Nastaliq transliteration. Convert a 150-word Roman-Urdu WhatsApp message (mixed code-switch between Urdu, English, and Punjabi-isms — like real Pakistani texts) into properly-rendered Nastaliq Urdu.

Each model got the same prompt, fresh chat. I scored: fluency (does this read like real Urdu?), accuracy (is the underlying meaning preserved?), and tone (does it match the register?).

Test 1: Translation under nuance pressure

Model	Fluency	Accuracy	Tone	Overall
Claude 4.7 Opus	9/10	9/10	9/10	9.0
GPT-5	8/10	9/10	7/10	8.0
Gemini 3 Pro	7/10	9/10	6/10	7.3
Grok 4	7/10	8/10	8/10	7.7
DeepSeek V4	4/10	7/10	3/10	4.7

What Claude did right: preserved the formal legal register, used correct Urdu legal terminology (نجی معلومات کا تحفظ, قانونی استحقاق), maintained the sentence rhythm.

What GPT-5 did right: technically accurate but slipped into journalese rather than legal register. A real lawyer would use it as a draft, not a final.

What Gemini got wrong: used English loan words where pure Urdu equivalents exist (Privacy instead of رازداری, Court instead of عدالت). Reads as "Urdu written by a non-Urdu speaker."

What Grok got right (surprisingly): captured a colloquial accuracy that the formal models missed. Slightly off-register but a Pakistani would read it as natural.

What DeepSeek got wrong: Mistranslated key legal terms, inserted invented Arabic compounds, and at one point switched from Nastaliq to Devanagari mid-sentence (?).

Test 2: Original prose

The prompt: write a 300-word opening of an Urdu short story in the style of Intezar Hussain, set in 1990s Saddar Karachi, opening with a single sensory detail.

This is the hardest test because the model has to do three things at once: produce native-quality Urdu prose, evoke a specific literary tradition, and ground itself in a specific Pakistani place.

Model	Fluency	Style match	Place authenticity	Overall
Claude 4.7 Opus	9/10	7/10	8/10	8.0
GPT-5	8/10	6/10	6/10	6.7
Gemini 3 Pro	6/10	4/10	5/10	5.0
Grok 4	7/10	4/10	7/10	6.0
DeepSeek V4	3/10	2/10	3/10	2.7

Claude's opening began with the smell of jasmine through an open shutter at 3 AM — the kind of small Pakistani-place detail that's hard to fake. The prose was clean enough that I had to think twice about whether a human had written it.

GPT-5 produced competent Urdu but the "Pakistani-ness" was generic — could have been set in Lahore, Hyderabad, or Bombay; it didn't reach Karachi specifically.

Gemini's output read as "academic Urdu" — grammatically clean but rhythmless. The kind of prose a CSS exam paper would have.

Grok produced punchy, modern, and surprisingly idiomatic Urdu but with completely wrong Intezar Hussain style — felt like Mohammed Hanif written in Urdu, which is impressive but not the assignment.

DeepSeek produced fragments of Urdu interspersed with English phrases, in a structure that suggested the model didn't understand the literary brief at all.

Test 3: Roman-Urdu to Nastaliq transliteration

This is what most Pakistani builders actually want — convert a WhatsApp message in Roman Urdu (Latin alphabet) to proper Nastaliq.

The input message: 150 words of mixed Roman Urdu with embedded English words and Punjabi-isms (bhabhi ji aaj kuch ho sakta hai? Kal nahin aa sakte. Office mai kaafi rush hai).

Model	Phonetic accuracy	Spelling correctness	Code-switch handling	Overall
Claude 4.7 Opus	9/10	9/10	9/10	9.0
GPT-5	8/10	7/10	7/10	7.3
Gemini 3 Pro	7/10	6/10	6/10	6.3
Grok 4	7/10	6/10	8/10	7.0
DeepSeek V4	3/10	2/10	1/10	2.0

Claude was the only model that consistently distinguished ج from ز from ذ in transliteration — a basic test that most Urdu transliteration apps still fail. It also correctly preserved English words in Roman script within the Urdu sentence (the way actual Pakistanis text), rather than awkwardly transliterating them.

GPT-5 got tripped up on Punjabi-flavoured words; it tried to force them into formal Urdu equivalents that no one uses.

Gemini transliterated some words to Hindi rather than Urdu when ambiguous — defaulting to the Indian rather than Pakistani convention.

Grok handled code-switching well but its spelling was unreliable on uncommon names.

DeepSeek essentially failed at this task.

What this means for builders

If you're building a Pakistani-market product that uses AI to handle Urdu:

Use Claude 4.7 Opus or Claude 4.6 Sonnet. This is the only clear "ship it" answer. Sonnet handles 90% of what Opus does at one fifth the cost.
GPT-5 is a viable fallback when you need OpenAI-specific features (better function calling, structured outputs).
Avoid DeepSeek for any Urdu workload. It's great for English code generation but should not touch your Urdu pipeline in 2026.
Don't use Gemini for original Urdu writing — translation only.
Grok is interesting for conversational Pakistani-flavoured English-to-Urdu but I wouldn't bet a product on it.

For Pakistani consumers using AI personally:

For writing in Urdu (essays, posts, business correspondence): Claude is dramatically better.
For translating English news / documents into Urdu: GPT-5 or Claude both work.
For Roman-Urdu WhatsApp transliteration: Claude, no contest.

Why Claude is so much better

This isn't a guess. Anthropic has been notably explicit about pre-training on a higher proportion of non-English content than OpenAI or Google. Their data card mentions South Asian languages specifically. They also publicly hire native speakers as reviewers for non-English red-teaming and quality evaluation.

The result is what we just saw: Claude's Urdu reads as actual Urdu, while GPT's Urdu reads like "Urdu produced by translating English internally first."

This advantage probably doesn't hold forever. As multilingual training becomes a competitive frontier, OpenAI and Google will catch up. For now, Claude wins.

The samples (for those who want to judge themselves)

We'll publish the full test outputs from all five models in a follow-up piece. If you want to be notified when that drops, subscribe to The 48° Brief.

Frequently asked questions

Is this test reproducible?

Yes. The three prompts and scoring rubric are documented above. We'd expect a careful Urdu speaker testing fresh chats to get within ±1 point of our scores per model.

Did you control for prompt order?

Yes. Each model got a fresh chat. Each test was run twice with the order of models alternated. The rankings were consistent across runs.

Why didn't you include Open-source models?

Llama 4 and Qwen 3 produced lower-quality Urdu than even DeepSeek; they would have just expanded the "poor" tier. We may include them in a follow-up benchmark.

Why didn't you include Pakistani-specific models?

There aren't any frontier-quality Pakistani-trained models in 2026. Various research projects exist; none are competitive with the foundation models tested here.

Will this change in 2027?

Almost certainly. Multilingual training has become a competitive frontier — Anthropic, OpenAI, and Google all have aggressive roadmaps. Expect this benchmark to look very different in 12 months.

Related on Meridian48

The 48° Brief

One email. The week in AI, Pakistan tech, and global business.

Curated by Faizan Khan. No filler. Unsubscribe in one click.

About the author

Faizan Khan

Founder & Editor

Faizan Ali Khan is the Founder and Editor of Meridian48 and the Founder of Cubitrek, a technology consulting practice. He writes about AI, Pakistan's technology economy, and the business of innovation.