Building a Production RAG Pipeline: Lessons from Real-World AI Apps
If you've built more than a toy RAG prototype, you already know that the hard part isn't connecting an LLM to a vector database. The hard part is everything that comes after: degraded retrieval quality on edge cases, latency spikes at scale, context windows filled with irrelevant chunks, and evaluation that tells you nothing useful.
This post covers the production patterns we actually use — chunking strategies, FAISS setup, cross-encoder reranking, and offline evaluation metrics that give you real signal.
The Core Problem with Naive RAG
Most tutorials show you this pipeline:
- Split document into chunks
- Embed chunks
- Store in vector DB
- At query time: embed query → find nearest chunks → stuff into LLM prompt
It works in demos. It fails in production. Here's why:
Fixed-size chunking splits sentences mid-thought. A chunk ending with "The model performs well when" and the next chunk starting "the temperature is above 0.7" means neither chunk retrieves correctly for a query about model behavior.
No score threshold means garbage in, garbage out. If you retrieve the top-k regardless of score, you'll pass irrelevant chunks to the LLM. The LLM will either hallucinate or say "I don't know" — even when the answer exists in your corpus.
Bi-encoder similarity is approximate. Fast for retrieval, but the dot product between independently encoded embeddings is a coarse approximation of relevance. A cross-encoder that jointly processes the query and document is far more accurate.
Chunking: Get This Right First
Four strategies worth knowing:
Fixed-size chunking — Simple. Chunk every N tokens with M token overlap. Fast, predictable. Bad for prose, acceptable for structured data.
Sentence-boundary chunking — Split on sentence endings, accumulate until you hit the size limit, then start a new chunk with overlap. This preserves semantic units. What we use for most text at Quran.com.
Recursive chunking — Try paragraph splits first. If still too large, try sentence splits. If still too large, try word splits. LangChain's RecursiveCharacterTextSplitter is the standard implementation.
Semantic chunking — Embed each sentence, detect where cosine similarity drops (meaning shift), split there. Highest quality, ~5x slower. Worth it for high-stakes retrieval.
Key parameter: overlap matters more than chunk size. We found 10–15% overlap (e.g., 50 tokens of overlap on 400-token chunks) gives the best recall without wasting context budget. Too little and you lose context at boundaries. Too much and you fill the LLM prompt with duplicate content.
FAISS Setup for Production
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
dim = model.get_sentence_embedding_dimension() # 384
# For small-medium corpora (<500k docs): IndexFlatIP
index = faiss.IndexFlatIP(dim)
# For large corpora (>500k docs): IndexHNSW
# index = faiss.IndexHNSWFlat(dim, 32)
Always normalize your embeddings before adding to an IP index. This makes inner product equivalent to cosine similarity.
def embed_and_add(texts, index, model, batch_size=64):
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
embeddings = model.encode(batch, normalize_embeddings=True)
index.add(embeddings.astype('float32'))
Score threshold is your most important hyperparameter. Set it empirically on a validation set, not by gut.
def retrieve(query, index, chunks, model, top_k=10, threshold=0.5):
query_vec = model.encode([query], normalize_embeddings=True).astype('float32')
scores, indices = index.search(query_vec, top_k)
results = []
for score, idx in zip(scores[0], indices[0]):
if idx != -1 and score >= threshold:
results.append((chunks[idx], float(score)))
return results
Cross-Encoder Reranking
The pattern: retrieve top-20 with a fast bi-encoder, then rerank to top-5 with a cross-encoder.
Cross-encoders jointly process the query and document through the full attention mechanism. They're 10–100x slower than bi-encoders but far more accurate at judging relevance.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, candidates, top_k=5):
pairs = [(query, chunk.text) for chunk, _ in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip([c for c, _ in candidates], scores),
key=lambda x: x[1], reverse=True)
return ranked[:top_k]
At Quran.com, adding a cross-encoder reranker improved our semantic search precision from ~72% to ~89% on our eval set.
Evaluation: Context Precision and Recall
Don't evaluate RAG by running queries through the LLM and judging the answers. Evaluate the retrieval layer directly.
Context Precision — Of the chunks I retrieved, what fraction are actually relevant? Context Recall — Of all the relevant chunks in my corpus, what fraction did I retrieve?
def context_precision(retrieved_chunks, relevant_sources):
if not retrieved_chunks:
return 0.0
relevant = sum(1 for c in retrieved_chunks if c.source in relevant_sources)
return relevant / len(retrieved_chunks)
def context_recall(retrieved_chunks, relevant_sources, all_chunks):
total_relevant = sum(1 for c in all_chunks if c.source in relevant_sources)
if total_relevant == 0:
return 1.0
retrieved_relevant = sum(1 for c in retrieved_chunks if c.source in relevant_sources)
return retrieved_relevant / total_relevant
Build a test set of (query, relevant_source_ids) pairs. Run your pipeline. Track precision and recall over time.
Production Patterns That Actually Matter
Cache at the right layer. Cache (query_embedding → chunk_ids), not (query_string → chunk_ids). Semantically similar queries hit the same cache entry. We use Redis with a 1-hour TTL and cut embedding inference load by ~60%.
Groundedness monitoring. Log every LLM response along with the context chunks. Periodically sample and check whether the answer is grounded in the context or hallucinated.
Fallback on low confidence. If no chunk clears your score threshold, don't pass empty context to the LLM. Return a "I don't have enough information to answer this" response. Users trust a system that knows what it doesn't know.
Semi-supervised eval improvement. Every time a user clicks on a search result, that's a weak positive label. Accumulate these signals to periodically retune your threshold and swap embedding models.
Conclusion
The gap between a RAG demo and production RAG is mostly about the decisions around the retrieval layer — chunking strategy, score thresholds, reranking, and systematic evaluation. Get these right and the LLM part largely takes care of itself.
The patterns here are what we've battle-tested at Quran.com serving 50M+ monthly users. They're not the only way, but they work.
If you want to see the full implementation as runnable code, check out the companion Kaggle notebook: rag_pipeline_from_scratch.
