What is RAG? A builder's field guide to retrieval-augmented generation in 2026

Retrieval-augmented generation is the most useful AI technique most builders still don't understand. Here's what it is, when to use it, when not to, and the architecture choices that actually matter.

Faizan KhanFounder & Editor · Meridian48May 11, 2026 · 7 min read

A circuit board photographed close-up with shallow depth of field, highlighting blue accents. — Photograph by Alexandre Debiève / Unsplash

The short version. Retrieval-augmented generation (RAG) is a pattern for making AI systems answer questions about data the model was never trained on. You retrieve relevant chunks from a knowledge source, stuff them into the prompt, and ask the model to answer using only that context. It's simple, it's the right answer for 70% of business AI use cases, and the way it's taught online is mostly wrong.

What RAG actually is

A retrieval-augmented system has three components:

A knowledge source. Documents, database rows, web pages, internal wiki, customer-support tickets — anything textual.
A retriever. Given a user query, returns the most relevant chunks from the knowledge source.
A generator. A large language model (Claude, GPT, Gemini) that takes the user query and the retrieved chunks and produces an answer.

The whole thing is "ask a question → find relevant text → answer using the relevant text." That's the entire concept.

Why RAG matters

Large language models have two problems that block them from most real business uses out of the box. RAG solves both.

The first problem: the model doesn't know about your data. Claude 4.7 has never seen your company's internal wiki. It has never seen your customer database. It has never seen yesterday's sales numbers. Without retrieval, the model will either refuse to answer or make up plausible-sounding nonsense.

The second problem: the model's knowledge has a cutoff date. Even the most recent frontier models stopped reading the internet weeks or months ago. Without retrieval, the model can't answer questions about anything that happened after its training cutoff. Our AI Pricing Tracker shows the cutoff dates per model.

RAG sidesteps both by reading the relevant content at query time and feeding it to the model fresh. The model becomes essentially a reasoning engine on top of your data, not a closed database.

The simplest possible RAG system

In code, the entire concept is about 50 lines. Conceptually:

Take your knowledge source. Split it into chunks of ~500 tokens each (a few paragraphs of prose).
For each chunk, compute an embedding — a vector of numbers that represents the chunk's meaning. Models like OpenAI's text-embedding-3 or Cohere's embed-english-v4 do this for you in one API call.
Store the chunks and their embeddings in a database that supports vector search (Pinecone, Weaviate, pgvector, or just a flat file for small datasets).
When a user asks a question, embed the question with the same model, find the chunks with the closest embeddings (usually using cosine similarity), and return the top 5–10.
Put the question and the retrieved chunks into a prompt template and send it to a generator model. Ask it to answer using only the provided context.

That's RAG. Everything else is optimisation.

When RAG is the right answer

RAG is the right architecture when:

The knowledge source is too big to fit in the context window. Even with Gemini 3 Pro's 2M token window, putting your entire 50,000-document corpus in every prompt is wasteful and slow.
The knowledge source changes frequently. RAG lets you update one document without retraining anything. Fine-tuning the model bakes the data in; RAG keeps it queryable.
You need citations. RAG naturally surfaces which document a piece of information came from. Fine-tuning hides this.
Compliance requires you to know what was retrieved. Audited industries (legal, medical, financial) need to be able to reconstruct exactly what data was used to produce an answer.

When RAG is the wrong answer

RAG is the wrong architecture when:

The task is reasoning over a fixed small dataset. If your entire knowledge source fits in 100K tokens, just paste it into the prompt. Modern context windows are large enough.
The task requires deep stylistic alignment. Fine-tuning teaches the model your tone and style; RAG teaches it your facts. If you need both, you usually need fine-tuning.
You need very low latency. A RAG pipeline adds retrieval round-trip time (typically 100-400 ms) to every query. Real-time conversation use cases sometimes can't tolerate this; cache the retrieval if so.
The questions are mostly about reasoning, not lookup. "Solve this math problem" doesn't benefit from RAG. "Find me everything we've written about X" does.

The architecture choices that actually matter

Most RAG tutorials cover the wrong things. Here are the choices that move the needle in production.

Chunk size

The biggest single hyperparameter. Too small (50 tokens), and chunks lack context; the embedding can't represent meaning well. Too large (4,000 tokens), and a chunk contains too many ideas; retrieval becomes imprecise.

The right answer for most prose data is 500-1,000 tokens with 100-200 token overlap between adjacent chunks. Overlap matters because important context often spans the boundary between two arbitrary cuts.

Embedding model

Three categories. OpenAI text-embedding-3-large is the default; it's good and most tutorials use it. Cohere embed-english-v4 is competitive and sometimes better for specific domains. Open-source models like BGE-large are free, run locally, and are within 5% of the paid options on most benchmarks. For Pakistan-based builders worrying about API costs, BGE-large is genuinely viable.

Reranking

A second-pass model that takes the top 20 retrieved chunks and reranks them by actual relevance to the query. This single step typically improves answer quality more than any other RAG tweak. Cohere Rerank 3 and Jina Reranker are the popular options. Add this once you've confirmed your basic RAG pipeline works.

Hybrid retrieval

Pure vector search is good for semantic similarity ("find documents that mean similar things") but loses to keyword search for exact matches ("find documents mentioning 'Section 154A'"). Combining the two — typically with BM25 keyword scoring alongside vector similarity, weighted 30-70 — adds another quality bump for free.

Query rewriting

Users ask vague questions. A pre-processing step that rewrites "Tell me about that thing we discussed yesterday" into a richer query (using the prior conversation as context) improves retrieval dramatically. This is one place where a small LLM call before retrieval is worth the latency.

Cost economics

A real RAG system at scale costs less than most builders fear. The headline numbers, using current 2026 prices:

Embedding generation: ~$0.13 per 1M input tokens (OpenAI). A 50,000-document corpus of 1KB each is ~12M tokens, so initial embedding is ~$1.60. Re-embedding the whole thing weekly costs $7/month.
Vector storage: Pinecone's free tier covers 100K vectors. Beyond that, $0.40-$0.60 per 1M vectors per month. A million-document corpus runs ~$5/month.
Generation: This is the bulk of cost. Each query consumes the chunks (typically 3,000-5,000 retrieved tokens) plus the user query (200 tokens) plus the answer (300 tokens). At Claude Sonnet pricing (~$3/M input, $15/M output), one query costs ~$0.02. At Gemini 3 Flash, it's ~$0.002.

Estimate your specific workload with our AI Cost Calculator.

What goes wrong in production

Three failure modes account for most RAG systems being unsatisfying:

Retrieval returns the wrong chunks. The query "what is our refund policy" gets the wrong policy because there are six similar-sounding policies in the knowledge base. Fix: better chunking strategy + reranking + metadata filtering.
The model ignores the retrieved context. It answers from its own knowledge anyway, often confidently and wrong. Fix: prompt engineering with explicit "answer only using the provided context" instructions; consider switching to Claude (which respects context instructions better than most).
The user's question can't actually be answered from the data. This is a product design failure, not a technical one. Fix: tell the user. "I don't see anything about that in our knowledge base" is a better answer than a confident hallucination.

Frequently asked questions

Do I need a vector database?

For small corpora (under 10,000 documents), no. A flat numpy file with cosine similarity in Python works fine. For larger or production workloads, Pinecone (managed), Weaviate (open source), or pgvector (Postgres extension) are the popular choices.

Is RAG dead now that context windows are huge?

No, despite some hot takes. Large context windows make "stuff everything in" viable for small datasets but it's wasteful and slow for anything bigger. RAG is the right answer for most real corpora.

How does this compare to fine-tuning?

RAG teaches the model facts at query time. Fine-tuning teaches the model patterns at training time. Use RAG for facts that change. Use fine-tuning for style, tone, or specific output formats. They are complementary, not competing.

Can I run this entirely from Pakistan?

Yes. Every component has a self-hostable option. The cheapest production-grade Pakistan-friendly stack: BGE-large embeddings (local), pgvector (Postgres self-hosted), DeepSeek V4 for generation (very cheap API). Total monthly cost for a small business workload: under $50.

Related on Meridian48

The 48° Brief

One email. The week in AI, Pakistan tech, and global business.

Curated by Faizan Khan. No filler. Unsubscribe in one click.

About the author

Faizan Khan

Founder & Editor

Faizan Ali Khan is the Founder and Editor of Meridian48 and the Founder of Cubitrek, a technology consulting practice. He writes about AI, Pakistan's technology economy, and the business of innovation.