Skip to main content

Command Palette

Search for a command to run...

RAG Systems That Actually Work: Real Metrics From Production

Updated
4 min read
M
Full-Stack AI Engineer based in Turku, Finland. I helped scale Quran.com to 50M+ daily users and have shipped 40+ applications across web and mobile. I write about production RAG pipelines, LLM integrations, multi-agent systems, and building AI-powered products that work at scale. My stack includes LangChain, Next.js, TypeScript, Python, and vector databases. Open to EU & remote opportunities. Portfolio: zunain.com

RAG in Production: The Gap Between Theory and Reality

Most RAG tutorials are lies. Not intentionally. But they show you a demo that works in a Jupyter notebook, and reality is very different.

I've built 7 production RAG systems. Here's what actually matters:

The Metrics That Count

1. Precision vs Recall Trade-off

Precision: % of retrieved documents that are relevant Recall: % of relevant documents you actually retrieve

Theory says you want both high. Reality: Pick one.

In my lending underwriting system:

  • High precision setup: 91% precision, 42% recall

    • Users trust results

    • Miss important edge cases

  • High recall setup: 64% precision, 88% recall

    • Catch edge cases

    • Users distrust noisy results

We went with 76% precision / 72% recall. Balanced. Imperfect. Works.

2. Latency Under Load

Your 200ms response time in testing becomes 2000ms in production with 100 concurrent users.

Vector DB latency compounds:

  • Query embedding: 50ms

  • Vector search: 120ms

  • Retrieval: 30ms

  • LLM reranking: 80ms

  • LLM generation: 800ms

Total: 1080ms. Your 200ms target assumed no users.

3. The Staleness Problem (Nobody Talks About This)

Your RAG system uses documents from yesterday. A client updates their policy today. Your system gives outdated information.

I've seen companies lose users because of this.

Solution: Update vectors when source documents change. But this gets complicated with 500K+ documents.

We built an async vector update system that updates changed documents within 5 minutes. Cost: \(2,000/month in infrastructure. Saves maybe \)50K/month in support burden.

The Real Cost Breakdown

Building RAG:

  • Embedding model (fine-tuning): $5-20K

  • Vector DB (initial): $0-5K

  • Integration/testing: 80-120 hours

  • Total: $15-30K

Running RAG (monthly):

  • Embeddings API: $500-2K

  • Vector DB: $500-5K

  • LLM calls: $2K-10K

  • Monitoring/updates: $1K-3K

  • Total: $4K-20K/month

Most companies underestimate operational costs by 5x.

What Surprised Me

  1. Reranking > Raw retrieval quality

    • Retrieving 50 documents and reranking them with an LLM beats retrieving 5 perfect documents

    • Reranking costs $0.001-0.01 per query

    • Massive ROI

  2. Query expansion matters more than I thought

    • User asks: "Can I return items?"

    • Expanded: "Can I return items? What's the return policy? Return procedures? Refunds?"

    • Gets 40% better retrieval

    • Costs: 20ms per query

  3. Small custom embedding beats large generic embedding

    • 768-dim domain-specific > 1536-dim OpenAI generic

    • 90% of the quality at 30% of the cost

The Production Lessons

  1. Monitor embedding staleness

    • Track: "How old are the embeddings for this document?"

    • Alert: If embeddings > 7 days old

    • Regenerate weekly

  2. Set up A/B testing from day one

    • Test precision vs recall

    • Test reranking on/off

    • Test different embedding models

    • You'll change your strategy

  3. Plan for failure modes

    • Vector DB down? Fall back to keyword search

    • Embedding service down? Cache embeddings

    • LLM down? Return raw documents

    • Have a graceful degradation strategy

  4. User feedback is gold

    • Track: "Was this retrieval helpful?"

    • Build dataset from "No" answers

    • Fine-tune on that data every month

    • Your system gets better over time

The Numbers

Success metric: User satisfaction with retrieved documents

First deployment: 58% satisfied After embedding tuning: 71% satisfied
After reranking: 82% satisfied After user feedback loop (3 months): 87% satisfied

Took 6 months to get good. Still improving.

Reality Check

RAG is not "load documents, embed them, done."

It's: monitoring, updating, A/B testing, reranking, feedback loops, and constant iteration.

The systems that work treat RAG as a living, evolving product. Not a static pipeline.

Do that, and you'll build RAG that actually serves users well.cs From Production

More from this blog

M

Muhammad Zulqarnain | Full Stack AI Engineer & Geospatial Developer

15 posts

A blog by Muhammad Zulqarnain — Full Stack AI Engineer & Geospatial Developer based in Turku, Finland. I write about RAG systems, LLMs, Prompt Engineering, Next.js, TypeScript, and geospatial development. Practical insights, deep dives, and real-world AI solutions.