RAGPRODUCTIONEVALS

Production RAG — what the tutorials don't show you

8 June 2026 · 7 min read · Daniyal Malik

Every RAG tutorial ends at the same place: chunk the docs, embed them, stuff the top-k into a prompt, ship. It demos beautifully. Then you put it in front of a real user with a real document and a real consequence, and it falls apart — quietly, by being confidently wrong.

I learned the gap building Preamble, where an agentic pipeline reads 50–150-page Australian construction tenders and produces a priced, cited quote draft. When a number in that draft is wrong, someone bids wrong and loses money. "Mostly right" is a failure. Here's what the tutorials skip.

Fixed-size chunking shreds meaning

Splitting on every 500 tokens is the default, and it's the first thing I throw out. It cuts tables in half, separates a clause from its heading, and orphans the one sentence that gave a number its unit.

I parse layout first — structure-aware extraction that keeps tables, sections, and headings intact — then chunk along those semantic boundaries with a little overlap. A chunk should be a thing a human would point at, not an arbitrary window. Every chunk carries metadata: page number, section, table id. That metadata is what makes the next two steps possible.

Retrieval is hybrid, then reranked

Pure vector search finds meaning but misses exact terms — a part code, a standard number, a clause reference. Pure keyword search finds the term but misses the paraphrase. Production needs both: dense vectors (pgvector) for semantics, sparse/keyword for precision, fused and then reranked so the model sees the five chunks that actually matter, not the fifty that are vaguely related.

Top-k is a knob, not a constant. Tune it against your evals, not your vibes.

The step everyone skips: verification

This is the difference between a demo and a system. Retrieval gives you candidate context. It does not give you a correct answer. So after generation, I run a verification pass that checks each claim in the output against the spans it was supposed to come from. If a quote line can't be traced to a source page, it doesn't ship — it gets flagged, not guessed.

Hallucination is a systems problem, not a prompt problem. You don't prompt your way out of it. You build a layer that makes "confidently wrong" structurally impossible to reach the user.

Evals are the product, not the afterthought

You cannot improve what you cannot measure, and "it looks good" is not a measurement. I keep a golden set — real documents with known-correct answers — and every prompt change, model swap, or chunking tweak runs against it before it merges. A "small improvement" that quietly regresses citation accuracy is the most expensive bug you'll ship, because you won't see it until a customer does.

Observability, because production drifts

Models change. Documents change. Traffic changes. Every answer at Preamble is traced — inputs, retrieved context, tokens, latency, cost — so any output a user got can be replayed and explained. When something looks off, "I can't reproduce it" is not an answer I accept from a system I built.

The shape of it

Tutorials teach the happy path: retrieve and generate. Production is the unglamorous scaffolding around it — structure-aware chunking, hybrid retrieval, verification, evals, observability — that turns a clever demo into something you'd stake a customer's bid on.

That scaffolding is the work. It's also most of why "we added RAG" projects stall at the prototype. If yours has hit that ceiling, that's exactly the conversation I have on a scoping call.

← all field notes Ask my AI twin about this →