Every AI acronym you will see in 2026, explained in one sentence each
The full reference. 70+ terms, organised by category, one sentence each. Bookmark this and stop pretending you know what RAG, RLHF, MoE, FLOPs, or KV cache actually mean.

The short version. Most AI conversations in 2026 are full of three-letter shorthand. This page explains every term you will hear, in plain English, one sentence each. Bookmark it. Reference it. Stop nodding when someone says "MoE" without knowing what it means.
We grouped the 70+ terms into eight categories. Skip to whichever section matters.
Model architecture
| Term | What it means |
|---|---|
| LLM | Large Language Model — an AI trained on huge text corpora that predicts the next word. |
| MLM | Masked Language Model — predicts missing words in a sentence rather than the next word. |
| VLM | Vision Language Model — handles both text and images in the same model (e.g. GPT-5, Claude). |
| MMM | Multimodal Model — handles text, images, audio, video, sometimes all at once. |
| MoE | Mixture of Experts — a model with many specialised sub-networks; only the relevant ones activate per query (e.g. Mixtral, DeepSeek). |
| SLM | Small Language Model — smaller, faster, cheaper variant (e.g. Claude Haiku, GPT-5 Mini). |
| SSM | State Space Model — alternative architecture to Transformer that scales better on long sequences (Mamba is the famous example). |
| MoR | Mixture of Recursions — newer architecture that adapts compute per token. |
Training stages
- Pre-training: the initial massive training on internet-scale text.
- Fine-tuning: smaller secondary training on specific data.
- SFT (Supervised Fine-Tuning): fine-tuning on labelled human examples.
- RLHF (Reinforcement Learning from Human Feedback): training the model to prefer outputs humans rate highly.
- DPO (Direct Preference Optimization): cheaper alternative to RLHF, same goal.
- Constitutional AI: Anthropic's method that uses an AI to critique itself against written rules.
- Distillation: training a smaller model to mimic a larger one's outputs.
- LoRA (Low-Rank Adaptation): cheap fine-tuning that updates only a tiny part of the model.
- QLoRA: LoRA but on quantised models, even cheaper.
Performance and capability
| Term | What it means |
|---|---|
| FLOPs | Floating Point Operations — measures how much compute a model uses; more FLOPs ≈ more capability. |
| TFLOPs | Trillion FLOPs per second — chip throughput metric. |
| Token | Smallest unit of text the model processes (~3 to 5 characters of English). |
| Context window | How much input the model can read at once (200K, 1M, 2M tokens). |
| KV cache | Memory of past tokens during inference — what makes long conversations slow and expensive. |
| Latency | Wall-clock time from prompt to response. |
| TTFT | Time To First Token — how fast the model starts replying. |
| Throughput | Tokens per second once it's replying. |
| TPS | Same thing — tokens per second. |
| Perplexity | Measure of how surprised the model is by text; lower = better at predicting it. |
Retrieval and memory
- RAG (Retrieval-Augmented Generation): fetching relevant data and stuffing it into the prompt before generation. See our field guide to RAG.
- Vector database: stores text as mathematical embeddings for similarity search.
- Embedding: a vector of numbers representing the meaning of text.
- Cosine similarity: the standard way to compare two embeddings.
- Hybrid search: mixing vector similarity with keyword search for better retrieval.
- Reranker: a second model that re-orders retrieval results by relevance.
- Chunking: splitting documents into small pieces for embedding.
- Cache: reusing past computations to save cost (provider-specific term, not just RAG).
- Long context: loading the entire document into the context window instead of using RAG.
Reasoning and agents
| Term | What it means |
|---|---|
| CoT | Chain of Thought — the model writes out its reasoning step-by-step. |
| ToT | Tree of Thoughts — the model explores multiple reasoning paths. |
| ReAct | Reasoning + Acting — the model alternates between thinking and using tools. |
| Agent | A program that uses an LLM to autonomously plan and execute tasks. |
| Multi-agent | Multiple LLM-powered agents working together. |
| Tool use | The model calling external functions (web search, calculator, code execution). |
| Function calling | Structured tool use with JSON schemas. |
| Reflection | The model critiques its own output before finalising. |
| Deep thinking | High-compute reasoning mode (e.g. Claude 4.7 Deep Thinking, OpenAI o-series). |
| Inference-time compute | Spending more compute at query time rather than training time. |
Evaluation and safety
- Benchmark: a standard test (MMLU, HumanEval, MATH).
- MMLU: Massive Multitask Language Understanding — general knowledge test.
- HumanEval: coding benchmark from OpenAI.
- MATH: mathematics benchmark.
- HELM: Stanford's holistic evaluation framework.
- Eval: any evaluation; engineers use this constantly.
- Hallucination: model producing confident, fluent, wrong output.
- Jailbreak: prompt designed to get the model to bypass its safety training.
- Red team: people whose job is to find vulnerabilities in models.
- Alignment: training the model to do what the user actually wants.
- AGI (Artificial General Intelligence): AI matching human cognitive ability across all domains; ill-defined.
- ASI (Artificial Super Intelligence): AI exceeding human ability across all domains.
Inference and deployment
| Term | What it means |
|---|---|
| Inference | Running a trained model to get outputs. |
| Batching | Processing many requests together for cheaper compute. |
| Quantisation | Reducing model precision (FP16 → INT8) to make it faster and smaller. |
| Pruning | Removing model weights that contribute little to outputs. |
| Speculative decoding | A smaller model drafts tokens that the larger model verifies; faster. |
| Continuous batching | Allowing new requests to join an in-flight batch. |
| Streaming | Tokens delivered as they generate, not all at once. |
| Cold start | Latency on the first request after the model loaded. |
| GPU / TPU / LPU | Different chip types for ML inference (Nvidia / Google / Groq). |
| VRAM | GPU memory; the limiting factor for running large models locally. |
Pricing and business
- Per-token pricing: charged per million tokens of input and output separately.
- API: Application Programming Interface — the developer-facing way to call models.
- SDK: Software Development Kit — official client library for an API.
- Rate limit: maximum calls per minute or per day for a given API key.
- Cached input pricing: discount when you re-send a prompt the provider has seen recently.
- Batch API: cheaper async processing for large workloads (50% discount typical).
- Reserved capacity: committing to spend in exchange for guaranteed throughput.
Compare current per-token pricing across providers on our AI Pricing Tracker.
Terms that became commonplace in 2026
These weren't mainstream in 2024. They are now.
- Agentic: anything to do with AI agents rather than chat-only models.
- Inference-time scaling: the realisation that spending compute at query time often beats training a bigger model.
- Test-time compute: the same idea, different phrase.
- GenAI: Generative AI; used by enterprise buyers more than builders.
- AI Overview: Google's AI-generated summary at the top of search results. See our playbook for ranking in AI Overviews.
- AI search: Perplexity-style query interfaces.
- Codegen: code generation via AI (Cursor, Copilot, Windsurf).
- Vibe coding: loosely-specified development where the AI handles the details; sometimes used pejoratively.
How to memorise these
You will not memorise them by reading this page once. Three things that actually work:
- Open this page when you hear a new term. Builders do this constantly; it's not cheating.
- Use the term yourself. Explain RAG to a colleague today; you'll never forget it.
- Read one paper a month. Pick one term from this list each month and find the paper that introduced it.
Frequently asked questions
Which acronym matters most for a non-technical reader?
LLM (Large Language Model), RAG (Retrieval-Augmented Generation), and tokens. Those three carry 70% of all AI conversations in 2026.
What is the difference between fine-tuning and RAG?
Fine-tuning bakes new knowledge into the model's weights at training time. RAG fetches knowledge at query time. They solve different problems and are often used together.
Why are there so many acronyms?
The field moves fast and uses dense academic naming conventions. Every paper introduces a new acronym; most don't stick.
Will this page stay current?
Yes. We update this glossary monthly. The version of any AI model and any specific benchmark cited here may date, but the underlying concepts do not.
Related on Meridian48
One email. The week in AI, Pakistan tech, and global business.
Curated by Faizan Khan. No filler. Unsubscribe in one click.

Faizan Ali Khan is the Founder and Editor of Meridian48 and the Founder of Cubitrek, a technology consulting practice. He writes about AI, the technology business, and the policy shaping both.
More from this author →