Dev Tools · 2h ago
vLLM dominates in-VRAM inference but crashes on out-of-memory models
A benchmark on a single RTX 3090 (24GB) tested llama.cpp, Ollama, and vLLM across five models from 1B to 116.8B parameters. vLLM scaled throughput 3.9x-5.4x from concurrency 1 to 8, beating llama.cpp by up to 3.7x. However, vLLM crashed on models exceeding VRAM, while llama.cpp and Ollama degraded to single-digit tok/s but kept running.
Meridian48 take
The benchmark highlights a key trade-off: vLLM's aggressive VRAM optimization delivers speed but no graceful degradation, making llama.cpp or Ollama better for setups that occasionally exceed GPU memory.
Read the full reporting
vLLM vs llama.cpp vs Ollama: What Happens When Your Model Doesn't Fit in 24GB VRAM →
DEV Community
llm-inferencebenchmark