RAG Systems That Actually Work: Real Metrics From Production
RAG in Production: The Gap Between Theory and Reality
Most RAG tutorials are lies. Not intentionally. But they show you a demo that works in a Jupyter notebook, and reality is very different.
I've built 7 production RAG systems. Here's what actually matters:
The Metrics That Count
1. Precision vs Recall Trade-off
Precision: % of retrieved documents that are relevant Recall: % of relevant documents you actually retrieve
Theory says you want both high. Reality: Pick one.
In my lending underwriting system:
High precision setup: 91% precision, 42% recall
Users trust results
Miss important edge cases
High recall setup: 64% precision, 88% recall
Catch edge cases
Users distrust noisy results
We went with 76% precision / 72% recall. Balanced. Imperfect. Works.
2. Latency Under Load
Your 200ms response time in testing becomes 2000ms in production with 100 concurrent users.
Vector DB latency compounds:
Query embedding: 50ms
Vector search: 120ms
Retrieval: 30ms
LLM reranking: 80ms
LLM generation: 800ms
Total: 1080ms. Your 200ms target assumed no users.
3. The Staleness Problem (Nobody Talks About This)
Your RAG system uses documents from yesterday. A client updates their policy today. Your system gives outdated information.
I've seen companies lose users because of this.
Solution: Update vectors when source documents change. But this gets complicated with 500K+ documents.
We built an async vector update system that updates changed documents within 5 minutes. Cost: \(2,000/month in infrastructure. Saves maybe \)50K/month in support burden.
The Real Cost Breakdown
Building RAG:
Embedding model (fine-tuning): $5-20K
Vector DB (initial): $0-5K
Integration/testing: 80-120 hours
Total: $15-30K
Running RAG (monthly):
Embeddings API: $500-2K
Vector DB: $500-5K
LLM calls: $2K-10K
Monitoring/updates: $1K-3K
Total: $4K-20K/month
Most companies underestimate operational costs by 5x.
What Surprised Me
Reranking > Raw retrieval quality
Retrieving 50 documents and reranking them with an LLM beats retrieving 5 perfect documents
Reranking costs $0.001-0.01 per query
Massive ROI
Query expansion matters more than I thought
User asks: "Can I return items?"
Expanded: "Can I return items? What's the return policy? Return procedures? Refunds?"
Gets 40% better retrieval
Costs: 20ms per query
Small custom embedding beats large generic embedding
768-dim domain-specific > 1536-dim OpenAI generic
90% of the quality at 30% of the cost
The Production Lessons
Monitor embedding staleness
Track: "How old are the embeddings for this document?"
Alert: If embeddings > 7 days old
Regenerate weekly
Set up A/B testing from day one
Test precision vs recall
Test reranking on/off
Test different embedding models
You'll change your strategy
Plan for failure modes
Vector DB down? Fall back to keyword search
Embedding service down? Cache embeddings
LLM down? Return raw documents
Have a graceful degradation strategy
User feedback is gold
Track: "Was this retrieval helpful?"
Build dataset from "No" answers
Fine-tune on that data every month
Your system gets better over time
The Numbers
Success metric: User satisfaction with retrieved documents
First deployment: 58% satisfied After embedding tuning: 71% satisfied
After reranking: 82% satisfied After user feedback loop (3 months): 87% satisfied
Took 6 months to get good. Still improving.
Reality Check
RAG is not "load documents, embed them, done."
It's: monitoring, updating, A/B testing, reranking, feedback loops, and constant iteration.
The systems that work treat RAG as a living, evolving product. Not a static pipeline.
Do that, and you'll build RAG that actually serves users well.cs From Production
