Overview
- Microsoft published guidance on May 22 urging engineers to treat RAG as an information‑retrieval and distributed‑systems problem and to restore document structure through semantic chunking, deduplication, hierarchical indexes, partitioned indexes, precomputed embeddings, and caching.
- Practitioners say the root scaling fault is token‑based chunking, which multiplies fragments and creates a noisy high‑dimensional vector space that degrades nearest‑neighbor search and drives nonlinear latency, cost, and accuracy loss.
- InferX reported on May 23 that they replaced the retrieval pipeline with a persistent KV cache that saves the LLM’s attention state after a full‑document prefill, which improved answer quality, cut operational complexity, and gave near‑instant warm response times.
- The KV‑cache approach has firm limits: current context ceilings of roughly 120,000 tokens, higher upfront prefill cost, and cold‑restore latency, so it fits high‑query, low‑update, privacy‑sensitive documents while RAG still suits very large or highly dynamic corpora.
- The likely near‑term outcome is hybrid architectures that combine structure‑aware RAG for broad, changing collections with KV caches for heavy‑use, focused documents, a shift that will change operational roles, costs, and SLAs for enterprise AI teams.