AI · 2h ago
AI Judges Are Consistent but Wrong, Major Audit Finds
A large-scale audit of over half a million AI judgments reveals that AI judges are reliable (consistent) but not valid (correct). The study shows that consistency, often mistaken for trustworthiness, can be trivially faked. Researchers provide a checklist to sanity-test AI judges before relying on them.
Meridian48 take
The finding that AI judges are consistently wrong undermines a key assumption in AI evaluation, but the paper's actionable checklist makes this more than just a warning.
ai-evaluationbenchmark-validity