Evaluation-driven development: how we test LLM apps

"It seems better" is not a test result. Yet that's how most teams ship prompt changes: someone tweaks the system prompt, asks five questions, nods, and merges. Three weeks later a user reports that the assistant now refuses a question it used to answer perfectly — and nobody can say which change broke it, because nothing was ever measured. We've inherited enough of these systems to state our position plainly: LLM quality is test coverage, not vibes. If you can't show a number that moved, you don't know whether your change helped.

Evaluation-driven development is the discipline we apply to every LLM application we take to production. It borrows almost everything from software testing — fixtures, regression suites, CI gates — and adapts the parts that don't transfer, because LLM outputs are probabilistic and "correct" is often a judgment call. Here's how the pieces fit together.

Start with a golden dataset, not a benchmark

Public benchmarks tell you how a model performs on someone else's problem. A golden dataset tells you how your system performs on yours. We build one for every engagement before we tune a single prompt, and the recipe is consistent:

Source from real users. Pull actual questions from support tickets, search logs, chat transcripts and pilot sessions — not questions the build team invents. Invented questions are systematically easier.
100–300 examples. Below 100, scores are too noisy to gate a release on. Above 300, the marginal signal rarely justifies the labeling cost for a single workflow.
Verified answers, not plausible ones. Each question gets a reference answer signed off by a domain expert, plus the source documents that support it.
Include the ugly cases. Ambiguous questions, questions with no answer in the corpus, questions that should be refused. A golden set of only happy paths certifies nothing.
Refresh quarterly. Products change, policies change, users ask new things. A stale golden set drifts into measuring last year's problem.

Wire the regression suite into CI

A prompt change is a code change. It alters system behavior for every user, so it gets the same treatment: a pull request, a review, and an automated run against the golden set before merge. Our pipelines score every candidate change — prompt edits, retrieval parameter tweaks, chunking changes, model upgrades — and block the merge if any tracked metric drops beyond a set threshold. Engineers grumble for about a week, then discover the freedom this buys: you can refactor a 900-token prompt aggressively because the suite will catch what breaks. Model upgrades stop being a leap of faith and become a diff you can read. On one financial-services copilot, the CI gate caught a "minor wording improvement" that silently doubled the refusal rate on legitimate compliance questions — before a single user saw it.

If a prompt change can break production, it deserves the same gate as a code change.

LLM-as-judge — with humans keeping the judge honest

You can't hand-grade 300 outputs on every commit, so a second model does the scoring. But "LLM-as-judge" fails quietly when the judge gets a vague instruction like "rate this answer 1–10." We write rubrics instead: explicit, criterion-by-criterion definitions of what a pass looks like — does the answer assert anything the retrieved sources don't support? Does it address the question actually asked? Did it cite the right document? Each criterion gets a binary or three-level scale, because judges are far more reliable on "yes/no per criterion" than on holistic scores.

Then we calibrate. Every two weeks, domain experts re-grade a random sample of 30–50 judge-scored outputs blind. We measure agreement, and when the judge and the humans diverge, we fix the rubric — or accept that this criterion needs human review permanently. A judge nobody audits is just vibes with extra steps. The calibration loop is also where stakeholders learn to trust the dashboard: when the head of operations has personally graded samples and watched the judge agree with her team, the weekly quality number stops being "AI grading AI" and becomes evidence.

Track metrics that map to failure modes

We keep the metric set small and tie each one to a way the system can actually hurt you. Faithfulness: is every claim in the answer grounded in the retrieved context? This is your hallucination alarm. Answer relevance: did it answer the question asked, or a nearby easier one? Retrieval precision: of the chunks retrieved, how many were actually useful? When faithfulness drops, this is usually where the rot started. Refusal correctness: does the system decline what it should decline — and only that? Over-refusal kills adoption as surely as hallucination kills trust. Four numbers, each diagnosable, each ownable. Teams that track fifteen metrics tend to act on none; these four are the ones our AI engineering practice wires into every production dashboard.

Key takeaways

Build a golden dataset of 100–300 real user questions with expert-verified answers, and refresh it quarterly.
Treat prompt changes like code changes: PR, review, and an automated regression run that can block the merge.
Use LLM-as-judge with explicit rubrics, and calibrate it against blind human grading every two weeks.
Track four metrics — faithfulness, answer relevance, retrieval precision, refusal correctness — and tie each to an owner.
Red-team continuously for jailbreaks and PII leaks; adversarial prompts belong in the regression suite, not in a one-off audit.

Red-team like an attacker, regress like an engineer

The golden set measures whether the system serves honest users. Red-teaming measures whether it resists dishonest ones. We maintain a separate adversarial suite — jailbreak attempts, prompt injections embedded in retrieved documents, social-engineering phrasings designed to extract PII or system prompts — and we run it with the same CI discipline as everything else. Every successful attack becomes a permanent regression case, so the system can never re-open a hole it already closed. On a recent deployment handling customer records, the adversarial suite grew from 40 to over 160 cases in six months, and the leak rate in testing fell to zero before launch — a number we could show the security review board instead of asking them to take our word.

None of this is exotic. It's the same engineering hygiene your software teams already practice, pointed at a new kind of system. If your LLM app is shipping on vibes today, talk to us — standing up the evaluation harness is a weeks-long job, not a quarters-long one, and it changes everything that comes after.

Evaluation-driven development: how we test LLM apps

Start with a golden dataset, not a benchmark

Wire the regression suite into CI

LLM-as-judge — with humans keeping the judge honest

Track metrics that map to failure modes

Key takeaways

Red-team like an attacker, regress like an engineer

Related reading

Why 80% of enterprise GenAI pilots stall

RAG is not enough: when to fine-tune, when to ground

Building something with GenAI?