Every AI acronym you will see in 2026, explained in one sentence each

The full reference. 70+ terms, organised by category, one sentence each. Bookmark this and stop pretending you know what RAG, RLHF, MoE, FLOPs, or KV cache actually mean.

Faizan KhanFounder & Editor · Meridian48June 21, 2026 · 7 min read

A glowing visualisation of neural network connections in cobalt blue against a dark background. — Photograph by Steve Johnson / Unsplash

The short version. Most AI conversations in 2026 are full of three-letter shorthand. This page explains every term you will hear, in plain English, one sentence each. Bookmark it. Reference it. Stop nodding when someone says "MoE" without knowing what it means.

We grouped the 70+ terms into eight categories. Skip to whichever section matters.

Model architecture

Term	What it means
LLM	Large Language Model — an AI trained on huge text corpora that predicts the next word.
MLM	Masked Language Model — predicts missing words in a sentence rather than the next word.
VLM	Vision Language Model — handles both text and images in the same model (e.g. GPT-5, Claude).
MMM	Multimodal Model — handles text, images, audio, video, sometimes all at once.
MoE	Mixture of Experts — a model with many specialised sub-networks; only the relevant ones activate per query (e.g. Mixtral, DeepSeek).
SLM	Small Language Model — smaller, faster, cheaper variant (e.g. Claude Haiku, GPT-5 Mini).
SSM	State Space Model — alternative architecture to Transformer that scales better on long sequences (Mamba is the famous example).
MoR	Mixture of Recursions — newer architecture that adapts compute per token.

Training stages

Pre-training: the initial massive training on internet-scale text.
Fine-tuning: smaller secondary training on specific data.
SFT (Supervised Fine-Tuning): fine-tuning on labelled human examples.
RLHF (Reinforcement Learning from Human Feedback): training the model to prefer outputs humans rate highly.
DPO (Direct Preference Optimization): cheaper alternative to RLHF, same goal.
Constitutional AI: Anthropic's method that uses an AI to critique itself against written rules.
Distillation: training a smaller model to mimic a larger one's outputs.
LoRA (Low-Rank Adaptation): cheap fine-tuning that updates only a tiny part of the model.
QLoRA: LoRA but on quantised models, even cheaper.

Performance and capability

Term	What it means
FLOPs	Floating Point Operations — measures how much compute a model uses; more FLOPs ≈ more capability.
TFLOPs	Trillion FLOPs per second — chip throughput metric.
Token	Smallest unit of text the model processes (~3 to 5 characters of English).
Context window	How much input the model can read at once (200K, 1M, 2M tokens).
KV cache	Memory of past tokens during inference — what makes long conversations slow and expensive.
Latency	Wall-clock time from prompt to response.
TTFT	Time To First Token — how fast the model starts replying.
Throughput	Tokens per second once it's replying.
TPS	Same thing — tokens per second.
Perplexity	Measure of how surprised the model is by text; lower = better at predicting it.

Retrieval and memory

RAG (Retrieval-Augmented Generation): fetching relevant data and stuffing it into the prompt before generation. See our field guide to RAG.
Vector database: stores text as mathematical embeddings for similarity search.
Embedding: a vector of numbers representing the meaning of text.
Cosine similarity: the standard way to compare two embeddings.
Hybrid search: mixing vector similarity with keyword search for better retrieval.
Reranker: a second model that re-orders retrieval results by relevance.
Chunking: splitting documents into small pieces for embedding.
Cache: reusing past computations to save cost (provider-specific term, not just RAG).
Long context: loading the entire document into the context window instead of using RAG.

Reasoning and agents

Term	What it means
CoT	Chain of Thought — the model writes out its reasoning step-by-step.
ToT	Tree of Thoughts — the model explores multiple reasoning paths.
ReAct	Reasoning + Acting — the model alternates between thinking and using tools.
Agent	A program that uses an LLM to autonomously plan and execute tasks.
Multi-agent	Multiple LLM-powered agents working together.
Tool use	The model calling external functions (web search, calculator, code execution).
Function calling	Structured tool use with JSON schemas.
Reflection	The model critiques its own output before finalising.
Deep thinking	High-compute reasoning mode (e.g. Claude 4.7 Deep Thinking, OpenAI o-series).
Inference-time compute	Spending more compute at query time rather than training time.

Evaluation and safety

Benchmark: a standard test (MMLU, HumanEval, MATH).
MMLU: Massive Multitask Language Understanding — general knowledge test.
HumanEval: coding benchmark from OpenAI.
MATH: mathematics benchmark.
HELM: Stanford's holistic evaluation framework.
Eval: any evaluation; engineers use this constantly.
Hallucination: model producing confident, fluent, wrong output.
Jailbreak: prompt designed to get the model to bypass its safety training.
Red team: people whose job is to find vulnerabilities in models.
Alignment: training the model to do what the user actually wants.
AGI (Artificial General Intelligence): AI matching human cognitive ability across all domains; ill-defined.
ASI (Artificial Super Intelligence): AI exceeding human ability across all domains.

Inference and deployment

Term	What it means
Inference	Running a trained model to get outputs.
Batching	Processing many requests together for cheaper compute.
Quantisation	Reducing model precision (FP16 → INT8) to make it faster and smaller.
Pruning	Removing model weights that contribute little to outputs.
Speculative decoding	A smaller model drafts tokens that the larger model verifies; faster.
Continuous batching	Allowing new requests to join an in-flight batch.
Streaming	Tokens delivered as they generate, not all at once.
Cold start	Latency on the first request after the model loaded.
GPU / TPU / LPU	Different chip types for ML inference (Nvidia / Google / Groq).
VRAM	GPU memory; the limiting factor for running large models locally.

Pricing and business

Per-token pricing: charged per million tokens of input and output separately.
API: Application Programming Interface — the developer-facing way to call models.
SDK: Software Development Kit — official client library for an API.
Rate limit: maximum calls per minute or per day for a given API key.
Cached input pricing: discount when you re-send a prompt the provider has seen recently.
Batch API: cheaper async processing for large workloads (50% discount typical).
Reserved capacity: committing to spend in exchange for guaranteed throughput.

Compare current per-token pricing across providers on our AI Pricing Tracker.

Terms that became commonplace in 2026

These weren't mainstream in 2024. They are now.

Agentic: anything to do with AI agents rather than chat-only models.
Inference-time scaling: the realisation that spending compute at query time often beats training a bigger model.
Test-time compute: the same idea, different phrase.
GenAI: Generative AI; used by enterprise buyers more than builders.
AI Overview: Google's AI-generated summary at the top of search results. See our playbook for ranking in AI Overviews.
AI search: Perplexity-style query interfaces.
Codegen: code generation via AI (Cursor, Copilot, Windsurf).
Vibe coding: loosely-specified development where the AI handles the details; sometimes used pejoratively.

How to memorise these

You will not memorise them by reading this page once. Three things that actually work:

Open this page when you hear a new term. Builders do this constantly; it's not cheating.
Use the term yourself. Explain RAG to a colleague today; you'll never forget it.
Read one paper a month. Pick one term from this list each month and find the paper that introduced it.

Frequently asked questions

Which acronym matters most for a non-technical reader?

LLM (Large Language Model), RAG (Retrieval-Augmented Generation), and tokens. Those three carry 70% of all AI conversations in 2026.

What is the difference between fine-tuning and RAG?

Fine-tuning bakes new knowledge into the model's weights at training time. RAG fetches knowledge at query time. They solve different problems and are often used together.

Why are there so many acronyms?

The field moves fast and uses dense academic naming conventions. Every paper introduces a new acronym; most don't stick.

Will this page stay current?

Yes. We update this glossary monthly. The version of any AI model and any specific benchmark cited here may date, but the underlying concepts do not.

Related on Meridian48

The 48° Brief

One email. The week in AI, Pakistan tech, and global business.

Curated by Faizan Khan. No filler. Unsubscribe in one click.

About the author

Faizan Khan

Founder & Editor

Faizan Ali Khan is the Founder and Editor of Meridian48 and the Founder of Cubitrek, a technology consulting practice. He writes about AI, the technology business, and the policy shaping both.