Overview
- Retrieval-augmented generation (RAG) is the standard way to give large language models access to private and recent documents so they can answer with cited facts.
- The single biggest cause of wrong answers is bad retrieval, especially naive chunking that splits meaning across pieces and returns irrelevant or partial context to the model.
- Even with correct retrieval, models can invent facts so teams must force grounding by requiring citations, explicit "I don't know" answers, and traceable chunk IDs.
- Indexes go stale without disciplined ingestion: teams need incremental re-indexing, deduplication, freshness signals, and metadata to avoid serving outdated or deleted content.
- Production fixes focus on engineering: use hybrid dense+sparse retrieval with bi-encoder recall and cross-encoder rerank, hold-out eval sets and metrics for regressions, and caching or model routing to cut latency and cost.