Dev Tools · 1h ago
LLM-as-Judge Harness for Agent Eval: Don't Fool Yourself
FamNest built an LLM-as-judge harness to evaluate its coach agent's non-deterministic responses. The harness grades outputs against a rubric, but the judge model has biases like position, verbosity, and drift. Mitigations include shuffling order, pinning judge version, and using a human-labeled anchor set.
Meridian48 take
The piece is a practical reality check for anyone relying on LLM judges for evaluation, but the mitigations are well-known—the real challenge is operationalizing them at scale.
Read the full reporting
Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About It) →
DEV Community
llm-evaluationagent-testing