SUNDAY, JULY 5, 2026 48° E  /  GLOBAL TECH · SUMMARISED SUBSCRIBE
AI, business, devices, policy — global tech, summarised every 30 minutes.
Dev Tools · 2h ago

vLLM dominates in-VRAM inference but crashes on out-of-memory models

By Meridian48 News Desk · Summarised from DEV Community ·

A benchmark on a single RTX 3090 (24GB) tested llama.cpp, Ollama, and vLLM across five models from 1B to 116.8B parameters. vLLM scaled throughput 3.9x-5.4x from concurrency 1 to 8, beating llama.cpp by up to 3.7x. However, vLLM crashed on models exceeding VRAM, while llama.cpp and Ollama degraded to single-digit tok/s but kept running.

Meridian48 take
The benchmark highlights a key trade-off: vLLM's aggressive VRAM optimization delivers speed but no graceful degradation, making llama.cpp or Ollama better for setups that occasionally exceed GPU memory.
Read the full reporting
vLLM vs llama.cpp vs Ollama: What Happens When Your Model Doesn't Fit in 24GB VRAM →
DEV Community
llm-inferencebenchmark
More dev tools briefs
Go deeper on dev tools
AllAIStartupsBusinessDevicesPolicySecurityDev ToolsPakistan