AI Evals

The discipline of systematically measuring whether AI features work. According to Shreya Shankar and Hamel Husain (who train teams at OpenAI, Anthropic, Google, Meta), evals are the hottest new skill for product builders in the AI era — and the single most important distinguishing factor between reliable AI products and broken ones.

Why Evals Matter

LLMs are non-deterministic. The same prompt can produce different outputs. A change that improves one query can break another. Without systematic evaluation, you’re shipping AI features on vibes — and your users feel it.

The teams that invest in evals:

Ship AI features faster (no “is this getting better?” guessing)
Catch regressions before users do
Know which prompts, models, and retrieval strategies actually work
Differentiate on quality, not just model choice

The teams that don’t:

Ship once, declare victory, then slowly degrade as edge cases accumulate
Panic when a model update breaks production
Blame users for “using it wrong” when the product fails

The Process (In Order)

The Shankar/Husain playbook — do these in sequence, not parallel:

Error analysis first — before building any infrastructure, manually review 20-50 LLM outputs. Look at what actually happens, not what you hoped would happen.
Identify failure patterns — cluster the failures. What’s going wrong? Hallucination? Tone? Missed context? Bad retrieval?
Write test cases from observed failures — your evals should emerge from reality, not from a hypothetical test matrix
Build an eval suite — only now do you automate. Each test corresponds to a real failure you saw.
Run evals on every significant change — treat them like unit tests for AI behavior

The wrong order: build infrastructure → use it → discover you measured the wrong things → start over.

The Domain Expert Principle

Use ONE domain expert as the quality decision maker. Not a committee. Not a generalist ML engineer. Not a vibes-based consensus.

The person who knows what “good” looks like for your specific domain (legal, medical, creative, technical) is more valuable than any ML PhD when building evals. They spot failure modes that automated metrics miss.

Manual Review Beats Automated Metrics

Counterintuitive finding from Shankar/Husain: 30 minutes of manual output review beats hours of automated metrics. Humans spot patterns that algorithms miss. The right workflow:

Spend 30 minutes manually reviewing outputs after any significant change
Write down every failure, weird behavior, or surprise
Turn those into eval test cases
Run the automated suite
Repeat the manual review — because new failures emerge

Automation supports human judgment. It doesn’t replace it.

Evals vs Feature Flags

Feature flags let you ship safely. Evals let you ship with confidence that it’s better.

Feature flags alone: “We can turn it off if users complain.”
Feature flags + evals: “We turned this on because it’s 15% better on our eval suite and we caught 3 regressions before shipping.”

The combination is table stakes for AI-native companies in 2026.

Connection to Other Frameworks

product-development: Evals are to AI features what tests are to traditional code
startup-metrics: Traditional metrics measure usage; evals measure quality
ai-era-entrepreneurship: The “AI changes the how, not the why” principle applies — you still need quality measurement, just adapted for non-deterministic systems
execution: Shipping AI features without evals is the modern version of shipping code without tests
Zapier’s code red playbook includes “vibe-check your AI features” — evals are the structured version of the vibe-check

Why Most Teams Skip This

It feels slow — spending 30 minutes on manual review when you could be coding feels wrong
It requires humility — admitting your AI feature isn’t as good as you thought
It requires domain expertise — not every team has someone who can judge quality
It doesn’t have a clear ROI until it prevents a disaster — like insurance

The teams that invest anyway are the ones whose AI products actually work.

Sources

Backlinks

ai-agents

Entrepreneurship KB

Explorer

AI Evals

AI Evals

Why Evals Matter

The Process (In Order)

The Domain Expert Principle

Manual Review Beats Automated Metrics

Evals vs Feature Flags

Connection to Other Frameworks

Why Most Teams Skip This

See Also

Sources

Backlinks

Graph View

Table of Contents

Backlinks