RAG in production: 7 pitfalls nobody warns you about
Building a RAG proof of concept is straightforward. Deploying it in a regulated banking or industrial environment is another story. Here are the 7 problems I've encountered — and how to solve them.
Everyone is talking about RAG (Retrieval-Augmented Generation). The demos are impressive. The PoC works in two days. And then the moment comes to push to production — and that's when the real questions emerge.
I have industrialized several RAG systems in constrained environments (Defense, Banking, Industry). Here is what I wish I had known beforehand.
1. Indexing quality determines everything
The LLM model cannot compensate for poor indexing. If your chunks are too large, too small, or semantically poorly delimited, your responses will be mediocre — regardless of the model you use.
What works:
- Semantic chunking (by logical section, not fixed size)
- 10–15% overlap between chunks to preserve context
- Rich metadata on each chunk (source, date, document type)
- Post-retrieval reranking with a cross-encoder
2. Embedding choice is not trivial
text-embedding-ada-002 is not always the best choice. For technical documents in French or other languages, I have achieved better results with multilingual models like intfloat/multilingual-e5-large.
Test your embeddings on your data, not on generic benchmarks.
3. Latency is your enemy #1
A complete RAG system chains:
- Query embedding
- Vector search
- Reranking
- LLM call
In production, each of these steps can become a bottleneck. Aggressive caching of frequent query embeddings, async pipeline, and streaming of the LLM response are non-negotiable.
4. Data sovereignty imposes architectural constraints
In regulated environments, you cannot send your data to OpenAI. This implies:
- Local LLM (Llama 3, Mistral) via Ollama or vLLM
- On-premise Vector DB (self-hosted Qdrant)
- Local embeddings
I deployed this "air-gapped" stack at Sandvik. The performance is lower than cloud models, but the compliance gain is non-negotiable.
5. Guardrails are a feature, not an option
Without guardrails, your RAG will:
- Hallucinate when context is insufficient (instead of admitting ignorance)
- Be vulnerable to prompt injection attacks in indexed documents
- Potentially leak information from other users if corpus is not properly tenant-isolated
Implement:
- Confidence scoring on retrieved chunk relevance
- Security filter on incoming queries
- Per-user/organization corpus isolation
6. Continuous evaluation is essential
A RAG that works today can degrade tomorrow if:
- New documents shift the corpus distribution
- The embedding model is updated
- Query patterns evolve
Set up an automated evaluation pipeline with golden datasets (questions + expected answers) and regularly measure faithfulness and relevance.
7. LLMOps monitoring differs from classical application monitoring
Classic metrics (latency, errors) are insufficient. You need:
- Faithfulness: is the response grounded in the retrieved chunks?
- Answer relevance: does it actually answer the question asked?
- Context precision: are the retrieved chunks relevant?
- Cost tracking: how many tokens are consumed per query?
Tools like LangSmith, Phoenix, or Helicone facilitate this monitoring.
RAG is a powerful technology, but its industrialization requires an engineering rigor that goes far beyond the PoC. If you are navigating this transition and need an external perspective, get in touch.