vLLM dominates in-VRAM inference but crashes on out-of-memory models

By Meridian48 News Desk · Summarised from DEV Community · July 5, 2026

A benchmark on a single RTX 3090 (24GB) tested llama.cpp, Ollama, and vLLM across five models from 1B to 116.8B parameters. vLLM scaled throughput 3.9x-5.4x from concurrency 1 to 8, beating llama.cpp by up to 3.7x. However, vLLM crashed on models exceeding VRAM, while llama.cpp and Ollama degraded to single-digit tok/s but kept running.

Meridian48 take

The benchmark highlights a key trade-off: vLLM's aggressive VRAM optimization delivers speed but no graceful degradation, making llama.cpp or Ollama better for setups that occasionally exceed GPU memory.

Read the full reporting

vLLM vs llama.cpp vs Ollama: What Happens When Your Model Doesn't Fit in 24GB VRAM →

DEV Community

llm-inferencebenchmark

vLLM dominates in-VRAM inference but crashes on out-of-memory models

Bilateral signatures strengthen AI agent provenance on NEAR

Silero VAD and ONNX Runtime Extract Speech Segments in 14-Second Test

Rust Startup Lessons: AI Doc Tool's Technical Hurdles