EVALSRAGPRODUCTION

How I evaluate GenAI — the metrics that actually matter

5 June 2026 · 5 min read · Daniyal Malik

"How do you know it's working?" If a team can't answer that, they don't have a product — they have a demo that happened to work the day they showed it. Evaluation is what turns "it seemed fine" into "it's measurably good, and we'll know the moment it isn't."

Vibes don't survive production

Here are the metrics that actually matter for a RAG or agent system:

Faithfulness — is the answer grounded in the retrieved context, or did the model invent it? This is your hallucination meter.
Context precision / recall — did retrieval surface the right chunks (precision) and all the needed ones (recall)? Most "the model is dumb" complaints are recall failures in disguise.
Answer relevancy — does the response actually address the question that was asked?

Notice three of those measure retrieval, not the model. That's the point: in production RAG, the model is rarely the bottleneck.

Ground truth is the unglamorous prerequisite

You can't measure any of this without a golden set — real questions with known-good answers and known source passages. Building it is tedious, and it's the highest-leverage thing most teams skip. At Preamble, the eval set is the spec.

Evals as a gate, not a report

Run the golden set on every change — prompt, chunking, model, retrieval config. A tweak that improves one query and quietly breaks five gets caught before it ships. This is eval-driven development: the same discipline as TDD, applied to a non-deterministic system.

Human-in-the-loop where it counts

Automated metrics catch regressions; humans catch what the metrics miss. For high-stakes outputs, a review step isn't a failure of automation — it's the design. The craft is putting the human exactly where the cost of being wrong is highest, and nowhere else.

The bar

"Good" is not a feeling. It's a number on a golden set that you watch on every change. Get that in place and everything else — model choice, chunking, whether to fine-tune — becomes an experiment you can actually run instead of an argument you keep having.

If you're trying to put a real quality bar around an AI feature, I can help you build the harness.

← all field notes Ask my AI twin about this →