AI Architecture2025-03-15· 8 min read

RAG in production: 7 pitfalls nobody warns you about

Building a RAG proof of concept is straightforward. Deploying it in a regulated banking or industrial environment is another story. Here are the 7 problems I've encountered — and how to solve them.

#RAG#LLM#Production#Architecture#Qdrant

Everyone is talking about RAG (Retrieval-Augmented Generation). The demos are impressive. The PoC works in two days. And then the moment comes to push to production — and that's when the real questions emerge.

I have industrialized several RAG systems in constrained environments (Defense, Banking, Industry). Here is what I wish I had known beforehand.

1. Indexing quality determines everything

The LLM model cannot compensate for poor indexing. If your chunks are too large, too small, or semantically poorly delimited, your responses will be mediocre — regardless of the model you use.

What works:

Semantic chunking (by logical section, not fixed size)
10–15% overlap between chunks to preserve context
Rich metadata on each chunk (source, date, document type)
Post-retrieval reranking with a cross-encoder

2. Embedding choice is not trivial

text-embedding-ada-002 is not always the best choice. For technical documents in French or other languages, I have achieved better results with multilingual models like intfloat/multilingual-e5-large.

Test your embeddings on your data, not on generic benchmarks.

3. Latency is your enemy #1

A complete RAG system chains:

Query embedding
Vector search
Reranking
LLM call

In production, each of these steps can become a bottleneck. Aggressive caching of frequent query embeddings, async pipeline, and streaming of the LLM response are non-negotiable.

4. Data sovereignty imposes architectural constraints

In regulated environments, you cannot send your data to OpenAI. This implies:

Local LLM (Llama 3, Mistral) via Ollama or vLLM
On-premise Vector DB (self-hosted Qdrant)
Local embeddings

I deployed this "air-gapped" stack at Sandvik. The performance is lower than cloud models, but the compliance gain is non-negotiable.

5. Guardrails are a feature, not an option

Without guardrails, your RAG will:

Hallucinate when context is insufficient (instead of admitting ignorance)
Be vulnerable to prompt injection attacks in indexed documents
Potentially leak information from other users if corpus is not properly tenant-isolated

Implement:

Confidence scoring on retrieved chunk relevance
Security filter on incoming queries
Per-user/organization corpus isolation

6. Continuous evaluation is essential

A RAG that works today can degrade tomorrow if:

New documents shift the corpus distribution
The embedding model is updated
Query patterns evolve

Set up an automated evaluation pipeline with golden datasets (questions + expected answers) and regularly measure faithfulness and relevance.

7. LLMOps monitoring differs from classical application monitoring

Classic metrics (latency, errors) are insufficient. You need:

Faithfulness: is the response grounded in the retrieved chunks?
Answer relevance: does it actually answer the question asked?
Context precision: are the retrieved chunks relevant?
Cost tracking: how many tokens are consumed per query?

Tools like LangSmith, Phoenix, or Helicone facilitate this monitoring.

RAG is a powerful technology, but its industrialization requires an engineering rigor that goes far beyond the PoC. If you are navigating this transition and need an external perspective, get in touch.

Stéphane Agoumé

AI Solution Architect · Coach & Mentor · Speaker

Get in touch

← All articles