Tuesday, June 23, 2026Subscribe
Est. 2026 · A Faizan Khan Publication
Meridian48
Tech news, summarised. AI, business, devices, policy — what you actually need to know.

Every AI acronym you will see in 2026, explained in one sentence each

The full reference. 70+ terms, organised by category, one sentence each. Bookmark this and stop pretending you know what RAG, RLHF, MoE, FLOPs, or KV cache actually mean.

Faizan Ali Khan
Faizan KhanFounder & Editor · Meridian48 · 7 min read
A glowing visualisation of neural network connections in cobalt blue against a dark background.
Photograph by Steve Johnson / Unsplash

The short version. Most AI conversations in 2026 are full of three-letter shorthand. This page explains every term you will hear, in plain English, one sentence each. Bookmark it. Reference it. Stop nodding when someone says "MoE" without knowing what it means.

We grouped the 70+ terms into eight categories. Skip to whichever section matters.

Model architecture

TermWhat it means
LLMLarge Language Model — an AI trained on huge text corpora that predicts the next word.
MLMMasked Language Model — predicts missing words in a sentence rather than the next word.
VLMVision Language Model — handles both text and images in the same model (e.g. GPT-5, Claude).
MMMMultimodal Model — handles text, images, audio, video, sometimes all at once.
MoEMixture of Experts — a model with many specialised sub-networks; only the relevant ones activate per query (e.g. Mixtral, DeepSeek).
SLMSmall Language Model — smaller, faster, cheaper variant (e.g. Claude Haiku, GPT-5 Mini).
SSMState Space Model — alternative architecture to Transformer that scales better on long sequences (Mamba is the famous example).
MoRMixture of Recursions — newer architecture that adapts compute per token.

Training stages

  • Pre-training: the initial massive training on internet-scale text.
  • Fine-tuning: smaller secondary training on specific data.
  • SFT (Supervised Fine-Tuning): fine-tuning on labelled human examples.
  • RLHF (Reinforcement Learning from Human Feedback): training the model to prefer outputs humans rate highly.
  • DPO (Direct Preference Optimization): cheaper alternative to RLHF, same goal.
  • Constitutional AI: Anthropic's method that uses an AI to critique itself against written rules.
  • Distillation: training a smaller model to mimic a larger one's outputs.
  • LoRA (Low-Rank Adaptation): cheap fine-tuning that updates only a tiny part of the model.
  • QLoRA: LoRA but on quantised models, even cheaper.

Performance and capability

TermWhat it means
FLOPsFloating Point Operations — measures how much compute a model uses; more FLOPs ≈ more capability.
TFLOPsTrillion FLOPs per second — chip throughput metric.
TokenSmallest unit of text the model processes (~3 to 5 characters of English).
Context windowHow much input the model can read at once (200K, 1M, 2M tokens).
KV cacheMemory of past tokens during inference — what makes long conversations slow and expensive.
LatencyWall-clock time from prompt to response.
TTFTTime To First Token — how fast the model starts replying.
ThroughputTokens per second once it's replying.
TPSSame thing — tokens per second.
PerplexityMeasure of how surprised the model is by text; lower = better at predicting it.

Retrieval and memory

  • RAG (Retrieval-Augmented Generation): fetching relevant data and stuffing it into the prompt before generation. See our field guide to RAG.
  • Vector database: stores text as mathematical embeddings for similarity search.
  • Embedding: a vector of numbers representing the meaning of text.
  • Cosine similarity: the standard way to compare two embeddings.
  • Hybrid search: mixing vector similarity with keyword search for better retrieval.
  • Reranker: a second model that re-orders retrieval results by relevance.
  • Chunking: splitting documents into small pieces for embedding.
  • Cache: reusing past computations to save cost (provider-specific term, not just RAG).
  • Long context: loading the entire document into the context window instead of using RAG.

Reasoning and agents

TermWhat it means
CoTChain of Thought — the model writes out its reasoning step-by-step.
ToTTree of Thoughts — the model explores multiple reasoning paths.
ReActReasoning + Acting — the model alternates between thinking and using tools.
AgentA program that uses an LLM to autonomously plan and execute tasks.
Multi-agentMultiple LLM-powered agents working together.
Tool useThe model calling external functions (web search, calculator, code execution).
Function callingStructured tool use with JSON schemas.
ReflectionThe model critiques its own output before finalising.
Deep thinkingHigh-compute reasoning mode (e.g. Claude 4.7 Deep Thinking, OpenAI o-series).
Inference-time computeSpending more compute at query time rather than training time.

Evaluation and safety

  • Benchmark: a standard test (MMLU, HumanEval, MATH).
  • MMLU: Massive Multitask Language Understanding — general knowledge test.
  • HumanEval: coding benchmark from OpenAI.
  • MATH: mathematics benchmark.
  • HELM: Stanford's holistic evaluation framework.
  • Eval: any evaluation; engineers use this constantly.
  • Hallucination: model producing confident, fluent, wrong output.
  • Jailbreak: prompt designed to get the model to bypass its safety training.
  • Red team: people whose job is to find vulnerabilities in models.
  • Alignment: training the model to do what the user actually wants.
  • AGI (Artificial General Intelligence): AI matching human cognitive ability across all domains; ill-defined.
  • ASI (Artificial Super Intelligence): AI exceeding human ability across all domains.

Inference and deployment

TermWhat it means
InferenceRunning a trained model to get outputs.
BatchingProcessing many requests together for cheaper compute.
QuantisationReducing model precision (FP16 → INT8) to make it faster and smaller.
PruningRemoving model weights that contribute little to outputs.
Speculative decodingA smaller model drafts tokens that the larger model verifies; faster.
Continuous batchingAllowing new requests to join an in-flight batch.
StreamingTokens delivered as they generate, not all at once.
Cold startLatency on the first request after the model loaded.
GPU / TPU / LPUDifferent chip types for ML inference (Nvidia / Google / Groq).
VRAMGPU memory; the limiting factor for running large models locally.

Pricing and business

  • Per-token pricing: charged per million tokens of input and output separately.
  • API: Application Programming Interface — the developer-facing way to call models.
  • SDK: Software Development Kit — official client library for an API.
  • Rate limit: maximum calls per minute or per day for a given API key.
  • Cached input pricing: discount when you re-send a prompt the provider has seen recently.
  • Batch API: cheaper async processing for large workloads (50% discount typical).
  • Reserved capacity: committing to spend in exchange for guaranteed throughput.

Compare current per-token pricing across providers on our AI Pricing Tracker.

Terms that became commonplace in 2026

These weren't mainstream in 2024. They are now.

  • Agentic: anything to do with AI agents rather than chat-only models.
  • Inference-time scaling: the realisation that spending compute at query time often beats training a bigger model.
  • Test-time compute: the same idea, different phrase.
  • GenAI: Generative AI; used by enterprise buyers more than builders.
  • AI Overview: Google's AI-generated summary at the top of search results. See our playbook for ranking in AI Overviews.
  • AI search: Perplexity-style query interfaces.
  • Codegen: code generation via AI (Cursor, Copilot, Windsurf).
  • Vibe coding: loosely-specified development where the AI handles the details; sometimes used pejoratively.

How to memorise these

You will not memorise them by reading this page once. Three things that actually work:

  1. Open this page when you hear a new term. Builders do this constantly; it's not cheating.
  2. Use the term yourself. Explain RAG to a colleague today; you'll never forget it.
  3. Read one paper a month. Pick one term from this list each month and find the paper that introduced it.

Frequently asked questions

Which acronym matters most for a non-technical reader?

LLM (Large Language Model), RAG (Retrieval-Augmented Generation), and tokens. Those three carry 70% of all AI conversations in 2026.

What is the difference between fine-tuning and RAG?

Fine-tuning bakes new knowledge into the model's weights at training time. RAG fetches knowledge at query time. They solve different problems and are often used together.

Why are there so many acronyms?

The field moves fast and uses dense academic naming conventions. Every paper introduces a new acronym; most don't stick.

Will this page stay current?

Yes. We update this glossary monthly. The version of any AI model and any specific benchmark cited here may date, but the underlying concepts do not.

Related on Meridian48

The 48° Brief

One email. The week in AI, Pakistan tech, and global business.

Curated by Faizan Khan. No filler. Unsubscribe in one click.

About the author
Faizan Ali Khan
Faizan Khan
Founder & Editor

Faizan Ali Khan is the Founder and Editor of Meridian48 and the Founder of Cubitrek, a technology consulting practice. He writes about AI, the technology business, and the policy shaping both.

More from this author →
AI glossaryAI acronymsLLMRAGML termswhat is

More from Meridian48