AI Evals
The discipline of systematically measuring whether AI features work. According to Shreya Shankar and Hamel Husain (who train teams at OpenAI, Anthropic, Google, Meta), evals are the hottest new skill for product builders in the AI era — and the single most important distinguishing factor between reliable AI products and broken ones.
Why Evals Matter
LLMs are non-deterministic. The same prompt can produce different outputs. A change that improves one query can break another. Without systematic evaluation, you’re shipping AI features on vibes — and your users feel it.
The teams that invest in evals:
- Ship AI features faster (no “is this getting better?” guessing)
- Catch regressions before users do
- Know which prompts, models, and retrieval strategies actually work
- Differentiate on quality, not just model choice
The teams that don’t:
- Ship once, declare victory, then slowly degrade as edge cases accumulate
- Panic when a model update breaks production
- Blame users for “using it wrong” when the product fails
The Process (In Order)
The Shankar/Husain playbook — do these in sequence, not parallel:
- Error analysis first — before building any infrastructure, manually review 20-50 LLM outputs. Look at what actually happens, not what you hoped would happen.
- Identify failure patterns — cluster the failures. What’s going wrong? Hallucination? Tone? Missed context? Bad retrieval?
- Write test cases from observed failures — your evals should emerge from reality, not from a hypothetical test matrix
- Build an eval suite — only now do you automate. Each test corresponds to a real failure you saw.
- Run evals on every significant change — treat them like unit tests for AI behavior
The wrong order: build infrastructure → use it → discover you measured the wrong things → start over.
The Domain Expert Principle
Use ONE domain expert as the quality decision maker. Not a committee. Not a generalist ML engineer. Not a vibes-based consensus.
The person who knows what “good” looks like for your specific domain (legal, medical, creative, technical) is more valuable than any ML PhD when building evals. They spot failure modes that automated metrics miss.
Manual Review Beats Automated Metrics
Counterintuitive finding from Shankar/Husain: 30 minutes of manual output review beats hours of automated metrics. Humans spot patterns that algorithms miss. The right workflow:
- Spend 30 minutes manually reviewing outputs after any significant change
- Write down every failure, weird behavior, or surprise
- Turn those into eval test cases
- Run the automated suite
- Repeat the manual review — because new failures emerge
Automation supports human judgment. It doesn’t replace it.
Evals vs Feature Flags
Feature flags let you ship safely. Evals let you ship with confidence that it’s better.
- Feature flags alone: “We can turn it off if users complain.”
- Feature flags + evals: “We turned this on because it’s 15% better on our eval suite and we caught 3 regressions before shipping.”
The combination is table stakes for AI-native companies in 2026.
Connection to Other Frameworks
- product-development: Evals are to AI features what tests are to traditional code
- startup-metrics: Traditional metrics measure usage; evals measure quality
- ai-era-entrepreneurship: The “AI changes the how, not the why” principle applies — you still need quality measurement, just adapted for non-deterministic systems
- execution: Shipping AI features without evals is the modern version of shipping code without tests
- Zapier’s code red playbook includes “vibe-check your AI features” — evals are the structured version of the vibe-check
Why Most Teams Skip This
- It feels slow — spending 30 minutes on manual review when you could be coding feels wrong
- It requires humility — admitting your AI feature isn’t as good as you thought
- It requires domain expertise — not every team has someone who can judge quality
- It doesn’t have a clear ROI until it prevents a disaster — like insurance
The teams that invest anyway are the ones whose AI products actually work.
See Also
Sources
- AI Evals for Engineers & PMs — Shankar & Husain
- Zapier’s AI Code Red — Mike Knoop
- Garry Tan on YC in the AI Era