AI Evals for Engineers & PMs
Authors: Shreya Shankar (UC Berkeley, DocETL) & Hamel Husain URL: https://hamel.dev/blog/posts/evals-faq/
Summary
The definitive practical guide to evaluating AI products. Shankar and Husain train teams at OpenAI, Anthropic, Google, and Meta on how to build AI products that actually work. Their course has reached 4,500+ professionals from 500+ companies. Core thesis: evals are the new skill that separates reliable AI products from broken ones. Simon Willison: “A robust approach to evals is the single most important distinguishing factor between well-engineered, reliable AI systems and risky development.”
Core Principles
- Evals are a development practice, not a line item — like debugging, always be doing error analysis
- Start with error analysis, not infrastructure — manually review 20-50 LLM outputs whenever you make significant changes
- 30 minutes of manual review beats hours of automated metrics — humans spot patterns machines miss
- Use ONE domain expert as the quality decision maker — not a committee
- Error analysis reveals what to fix — your evals should emerge from observed failures, not hypothetical ones
- Build evals before scaling — you can’t improve what you don’t measure
Why This Matters
Every AI product faces the same problem: LLMs are non-deterministic and sometimes fail in weird ways. Without systematic evaluation, you’re flying blind. Teams that invest in evals ship AI features faster, with fewer regressions, and with genuine confidence in output quality. Teams that don’t end up in “vibes-based development” where nobody knows if the product is getting better or worse.
Key Claims
- Evals are the hottest new skill for AI product builders
- Manual review of 20-50 outputs > any automated metric
- Error analysis → test cases → eval suite (in that order, not the reverse)
- Domain experts are more valuable than generalist ML engineers for eval quality
- The absence of evals is why most AI products feel broken
- Ship AI features behind feature flags + evals, not just flags