AI · 2h ago
Local LLM vs Claude: Qwen3-Coder Scores 22.8 vs 89.4 in Real Agent Test
A developer benchmarked qwen3-coder:30b against Claude on 27 real tasks from a production LangGraph agent with ~90 tools. Claude scored 89.4/100 while qwen scored 22.8/100, though qwen was ~5,150x cheaper per task ($0.00015 vs $0.763). The local model leaked malformed tool calls in 26% of answers and overlapped with needed tools only 14.8% of the time.
Meridian48 take
The massive quality gap highlights how local models still struggle with complex tool-use surfaces, but the cost difference keeps the dream of affordable local agents alive for simpler tasks.
Read the full reporting
Local LLM vs Claude: Benchmarking qwen3-coder:30b as a Production Agent Backend →
DEV Community
local-llmagent-benchmarking