RAG in Production: The Gap Between Theory and Reality

Most RAG tutorials are lies. Not intentionally. But they show you a demo that works in a Jupyter notebook, and reality is very different.

I've built 7 production RAG systems. Here's what actually matters:

The Metrics That Count

1. Precision vs Recall Trade-off

Precision: % of retrieved documents that are relevant Recall: % of relevant documents you actually retrieve

Theory says you want both high. Reality: Pick one.

In my lending underwriting system:

High precision setup: 91% precision, 42% recall
- Users trust results
- Miss important edge cases
High recall setup: 64% precision, 88% recall
- Catch edge cases
- Users distrust noisy results

We went with 76% precision / 72% recall. Balanced. Imperfect. Works.

2. Latency Under Load

Your 200ms response time in testing becomes 2000ms in production with 100 concurrent users.

Vector DB latency compounds:

Query embedding: 50ms
Vector search: 120ms
Retrieval: 30ms
LLM reranking: 80ms
LLM generation: 800ms

Total: 1080ms. Your 200ms target assumed no users.

3. The Staleness Problem (Nobody Talks About This)

Your RAG system uses documents from yesterday. A client updates their policy today. Your system gives outdated information.

I've seen companies lose users because of this.

Solution: Update vectors when source documents change. But this gets complicated with 500K+ documents.

We built an async vector update system that updates changed documents within 5 minutes. Cost: $2,000/month in infrastructure. Saves maybe $50K/month in support burden.

The Real Cost Breakdown

Building RAG:

Embedding model (fine-tuning): $5-20K
Vector DB (initial): $0-5K
Integration/testing: 80-120 hours
Total: $15-30K

Running RAG (monthly):

Embeddings API: $500-2K
Vector DB: $500-5K
LLM calls: $2K-10K
Monitoring/updates: $1K-3K
Total: $4K-20K/month

Most companies underestimate operational costs by 5x.

What Surprised Me

Reranking > Raw retrieval quality
- Retrieving 50 documents and reranking them with an LLM beats retrieving 5 perfect documents
- Reranking costs $0.001-0.01 per query
- Massive ROI
Query expansion matters more than I thought
- User asks: "Can I return items?"
- Expanded: "Can I return items? What's the return policy? Return procedures? Refunds?"
- Gets 40% better retrieval
- Costs: 20ms per query
Small custom embedding beats large generic embedding
- 768-dim domain-specific > 1536-dim OpenAI generic
- 90% of the quality at 30% of the cost

The Production Lessons

Monitor embedding staleness
- Track: "How old are the embeddings for this document?"
- Alert: If embeddings > 7 days old
- Regenerate weekly
Set up A/B testing from day one
- Test precision vs recall
- Test reranking on/off
- Test different embedding models
- You'll change your strategy
Plan for failure modes
- Vector DB down? Fall back to keyword search
- Embedding service down? Cache embeddings
- LLM down? Return raw documents
- Have a graceful degradation strategy
User feedback is gold
- Track: "Was this retrieval helpful?"
- Build dataset from "No" answers
- Fine-tune on that data every month
- Your system gets better over time

The Numbers

Success metric: User satisfaction with retrieved documents

First deployment: 58% satisfied After embedding tuning: 71% satisfied
After reranking: 82% satisfied After user feedback loop (3 months): 87% satisfied

Took 6 months to get good. Still improving.

Reality Check

RAG is not "load documents, embed them, done."

It's: monitoring, updating, A/B testing, reranking, feedback loops, and constant iteration.

The systems that work treat RAG as a living, evolving product. Not a static pipeline.

Do that, and you'll build RAG that actually serves users well.cs From Production

RAG Systems That Actually Work: Real Metrics From Production

Comments

More from this blog

Privacy-Preserving AI: Building in the Shadows

AI for Code: The Developer's New Superpower

Multi-Agent AI Systems: Orchestrating Teams of AI

Advanced RAG: When Simple Retrieval Isn't Enough

Model Compression: Why Smaller AI Models Are Winning

RAG in Production: The Gap Between Theory and Reality

The Metrics That Count

The Real Cost Breakdown

What Surprised Me

The Production Lessons

The Numbers

Reality Check

Command Palette

Comments

More from this blog

RAG in Production: The Gap Between Theory and Reality

The Metrics That Count

The Real Cost Breakdown

What Surprised Me

The Production Lessons

The Numbers

Reality Check