Here's the most important thing we've learned shipping retrieval-augmented generation into production: RAG fails when retrieval is bad, not when the model is bad. Teams spend weeks tweaking prompts and swapping models when the real problem is that the right context never made it into the window. Fix retrieval first, and the model usually takes care of itself.
This is a tour of the patterns that have actually held up for us, and the failure modes they prevent.
Invest in retrieval before generation
If the model is hallucinating or giving vague answers, the first suspect is almost always retrieval. The fixes, in rough order of impact:
- Chunking. Too large and you bury the answer in noise; too small and you lose context. Chunk on natural boundaries (headings, sections), not arbitrary token counts.
- Metadata. Tag chunks with source, section, date, and permissions so you can filter before you rank. This single step kills a huge class of "confidently wrong" answers.
- Re-ranking. Retrieve broadly, then re-rank the top candidates by relevance before handing them to the model. Cheap to add, big quality jump.
Use hybrid search
Pure vector search is great at meaning and bad at exact terms, product names, error codes, SKUs, acronyms. Pure keyword search is the opposite. In any domain with specific terminology, combine them.
- Keyword (BM25) catches the exact-match terms users actually type.
- Vector search catches paraphrases and conceptual matches.
- Fused and re-ranked, the two together beat either one alone in almost every eval we've run.
Require citations
Make the model cite the chunks it used. This does two things at once: users can verify answers themselves, and you get a debugging trail. When an answer is wrong, citations tell you instantly whether retrieval pulled the wrong context or the model misread the right context, two completely different fixes.
If you can't see which chunks produced an answer, you can't debug your RAG system, you can only guess at it.
Monitor retrieval and generation separately
These are different problems and they need different metrics. Conflating them is how teams end up "improving the prompt" for weeks with no movement.
| Measure | What it tells you |
|---|---|
| Retrieval quality | Did the right context make it into the window? (recall, hit rate, rank of the gold chunk) |
| Generation quality | Given good context, did the model answer correctly and stay grounded? (faithfulness, correctness) |
Build a small eval set of real questions with known-good answers and run it on every change. It's the difference between engineering and vibes.
Keep the index fresh
Stale data is the silent killer of RAG systems. The demo works because the docs were current that week; six weeks later the copilot is confidently quoting a deprecated policy. Treat the index pipeline as a first-class system: re-embed when source docs change, track freshness, and alert when a source goes stale.
A pragmatic build order
- Start with hybrid retrieval + metadata filtering and a small eval set.
- Add re-ranking once retrieval recall is solid.
- Require citations from day one.
- Wire up a refresh pipeline before you scale the doc set.
- Only then spend time on prompt and model tuning.
How we build RAG at Appflare
Every RAG system we ship is grounded in the client's real data, instrumented with retrieval and generation evals, and kept honest with citations and a refresh pipeline. For customer-facing use, we keep a human in the loop until the evals earn the system more autonomy. That's exactly the approach behind our support automation case study, where handle time dropped from 12 minutes to 4.
If you want a copilot or search experience that's actually trustworthy in production, talk to us.