Skip to main content

Command Palette

Search for a command to run...

Building Production-Ready RAG: A Complete Architecture Guide

Updated
9 min read
M
Full-Stack AI Engineer based in Turku, Finland. I helped scale Quran.com to 50M+ daily users and have shipped 40+ applications across web and mobile. I write about production RAG pipelines, LLM integrations, multi-agent systems, and building AI-powered products that work at scale. My stack includes LangChain, Next.js, TypeScript, Python, and vector databases. Open to EU & remote opportunities. Portfolio: zunain.com

Building Production-Ready RAG: A Complete Architecture Guide

Most RAG tutorials show you how to stuff documents into a vector database and query them with an LLM. That gets you a demo. It does not get you a production system.

I've built RAG pipelines that serve real users. The gap between "it works in a notebook" and "it works at 2 AM when the on-call gets paged" is massive. This guide covers what actually matters when you ship RAG to production.

Why Most RAG Demos Fail

The standard RAG tutorial looks like this: load PDFs, chunk them, embed them, store in Pinecone, query with OpenAI. Five lines of LangChain. Done.

Then you deploy it and discover:

The chunking strategy matters more than the model. Bad chunks produce bad retrieval. Bad retrieval produces hallucinated answers. Users lose trust fast.

Latency kills user experience. A 6-second response time is fine in a demo. Real users expect sub-2 seconds. You need caching, streaming, and careful orchestration.

Cost scales faster than you expect. Every query hits an embedding model and an LLM. At 10K queries per day, your monthly bill will surprise you.

Evaluation is almost impossible without infrastructure. "Does this answer look right?" is not a test strategy.

Let's fix each of these.

The Ingestion Pipeline

Ingestion is where most RAG systems quietly break. Bad data in, bad answers out.

Document Processing

Before you chunk anything, clean your documents. Strip headers, footers, navigation elements, and boilerplate. Extract tables separately because they chunk terribly as plain text.

interface ProcessedDocument {
  id: string;
  content: string;
  metadata: {
    source: string;
    title: string;
    section: string;
    lastUpdated: Date;
    documentType: 'article' | 'api-doc' | 'tutorial' | 'faq';
  };
}

async function processDocument(raw: RawDocument): Promise<ProcessedDocument[]> {
  // Split by semantic sections, not arbitrary character counts
  const sections = extractSections(raw.html);

  return sections.map((section, index) => ({
    id: `${raw.id}-section-${index}`,
    content: cleanMarkdown(section.content),
    metadata: {
      source: raw.url,
      title: raw.title,
      section: section.heading || `Section ${index + 1}`,
      lastUpdated: raw.updatedAt,
      documentType: classifyDocument(raw),
    },
  }));
}

Chunking Strategy

Forget fixed-size chunking. It splits sentences mid-thought and produces chunks that lack context.

Use semantic chunking: split on paragraph boundaries, heading boundaries, and logical breaks. Each chunk should be a self-contained thought that makes sense in isolation.

function semanticChunk(
  document: ProcessedDocument,
  options: { maxTokens: number; overlap: number }
): Chunk[] {
  const paragraphs = document.content.split(/\n\n+/);
  const chunks: Chunk[] = [];
  let currentChunk: string[] = [];
  let currentTokens = 0;

  for (const paragraph of paragraphs) {
    const paragraphTokens = estimateTokens(paragraph);

    if (currentTokens + paragraphTokens > options.maxTokens && currentChunk.length > 0) {
      chunks.push({
        content: currentChunk.join('\n\n'),
        tokenCount: currentTokens,
        metadata: document.metadata,
      });

      // Keep last paragraph for overlap/context continuity
      const overlapParagraphs = currentChunk.slice(-1);
      currentChunk = [...overlapParagraphs, paragraph];
      currentTokens = estimateTokens(overlapParagraphs.join('\n\n')) + paragraphTokens;
    } else {
      currentChunk.push(paragraph);
      currentTokens += paragraphTokens;
    }
  }

  if (currentChunk.length > 0) {
    chunks.push({
      content: currentChunk.join('\n\n'),
      tokenCount: currentTokens,
      metadata: document.metadata,
    });
  }

  return chunks;
}

Target chunk size: 200-500 tokens. Smaller chunks give more precise retrieval. Larger chunks give more context. I've found 300-400 tokens to be the sweet spot for most documentation use cases.

Embedding Models

For production, I use OpenAI's text-embedding-3-small for most cases. It's cheap ($0.02 per 1M tokens), fast, and the quality is good enough for 90% of use cases.

If you need better retrieval quality and can afford the latency, text-embedding-3-large with dimension reduction to 1024 gives excellent results at a reasonable cost.

async function embedChunks(chunks: Chunk[]): Promise<EmbeddedChunk[]> {
  // Batch embeddings for efficiency (max 2048 per request)
  const batchSize = 2048;
  const results: EmbeddedChunk[] = [];

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: batch.map(c => c.content),
    });

    for (let j = 0; j < batch.length; j++) {
      results.push({
        ...batch[j],
        embedding: response.data[j].embedding,
      });
    }
  }

  return results;
}

The Retrieval Layer

Vector similarity search alone is not enough. You need hybrid search and re-ranking.

Hybrid Search with pgvector

If you're already running PostgreSQL, use pgvector. One less service to manage. Combine vector similarity with full-text search for hybrid retrieval.

async function hybridSearch(
  query: string,
  queryEmbedding: number[],
  options: { limit: number; minScore: number }
): Promise<SearchResult[]> {
  const results = await db.query(`
    WITH vector_results AS (
      SELECT id, content, metadata,
             1 - (embedding <=> $1::vector) AS vector_score
      FROM chunks
      WHERE 1 - (embedding <=> $1::vector) > $3
      ORDER BY embedding <=> $1::vector
      LIMIT $2 * 2
    ),
    text_results AS (
      SELECT id, content, metadata,
             ts_rank(search_vector, plainto_tsquery('english', $4)) AS text_score
      FROM chunks
      WHERE search_vector @@ plainto_tsquery('english', $4)
      ORDER BY text_score DESC
      LIMIT $2 * 2
    )
    SELECT COALESCE(v.id, t.id) AS id,
           COALESCE(v.content, t.content) AS content,
           COALESCE(v.metadata, t.metadata) AS metadata,
           COALESCE(v.vector_score, 0) * 0.7 + COALESCE(t.text_score, 0) * 0.3 AS combined_score
    FROM vector_results v
    FULL OUTER JOIN text_results t ON v.id = t.id
    ORDER BY combined_score DESC
    LIMIT $2
  `, [queryEmbedding, options.limit, options.minScore, query]);

  return results.rows;
}

The 0.7/0.3 weighting between vector and text scores works well as a starting point. Tune it based on your evaluation results.

Re-ranking

After retrieval, re-rank the results. A cross-encoder model gives much better relevance scores than cosine similarity alone. Cohere's re-rank API is the easiest option.

async function rerankResults(
  query: string,
  results: SearchResult[],
  topK: number = 5
): Promise<SearchResult[]> {
  const response = await cohere.rerank({
    model: 'rerank-english-v3.0',
    query,
    documents: results.map(r => r.content),
    topN: topK,
  });

  return response.results.map(r => ({
    ...results[r.index],
    relevanceScore: r.relevanceScore,
  }));
}

This step alone improved our answer quality by about 15% in our evaluations.

The Generation Layer

Prompt Design

Your system prompt should be specific about the task, the constraints, and the expected format. Vague prompts produce vague answers.

function buildPrompt(query: string, context: SearchResult[]): string {
  const contextBlock = context
    .map((c, i) => `[Source ${i + 1}: ${c.metadata.title}]\n${c.content}`)
    .join('\n\n---\n\n');

  return `You are a technical documentation assistant. Answer the user's question using ONLY the provided context. If the context doesn't contain enough information to answer fully, say so clearly.

Rules:
- Cite sources using [Source N] format
- If you're unsure, say "I'm not confident about this" rather than guessing
- Keep answers concise and technical
- Include code examples when relevant

Context:
${contextBlock}

Question: ${query}`;
}

Streaming Responses

Users should see the answer forming in real time. Stream tokens as they arrive.

async function* generateAnswer(
  query: string,
  context: SearchResult[]
): AsyncGenerator<string> {
  const prompt = buildPrompt(query, context);

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: prompt }],
    stream: true,
    temperature: 0.1, // Low temperature for factual answers
    max_tokens: 1000,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) yield content;
  }
}

Hallucination Prevention

Three layers of defense:

  1. Confidence scoring: If no retrieved chunk scores above 0.75 relevance, don't generate an answer. Tell the user you couldn't find relevant information.

  2. Source grounding: The prompt explicitly limits answers to the provided context. This doesn't eliminate hallucinations but reduces them significantly.

  3. Citation verification: Post-process the response. If the model cites [Source 3] but Source 3 doesn't contain the claimed information, flag it.

function validateCitations(
  answer: string,
  sources: SearchResult[]
): ValidationResult {
  const citationPattern = /\[Source (\d+)\]/g;
  const citations = [...answer.matchAll(citationPattern)];
  const issues: string[] = [];

  for (const citation of citations) {
    const sourceIndex = parseInt(citation[1]) - 1;
    if (sourceIndex >= sources.length) {
      issues.push(`Citation [Source ${citation[1]}] references non-existent source`);
    }
  }

  // Check if answer contains claims without any citations
  const sentences = answer.split(/[.!?]+/).filter(s => s.trim().length > 20);
  const uncitedClaims = sentences.filter(
    s => !s.includes('[Source') && looksLikeFactualClaim(s)
  );

  if (uncitedClaims.length > sentences.length * 0.5) {
    issues.push('More than half of factual claims lack citations');
  }

  return {
    valid: issues.length === 0,
    issues,
    citationCount: citations.length,
  };
}

Evaluation

You cannot improve what you cannot measure. Build an evaluation pipeline from day one.

Test Set Construction

Create a test set of 50-100 question/answer pairs. Include edge cases: questions with no answer in the docs, ambiguous questions, questions requiring information from multiple sources.

interface EvalCase {
  query: string;
  expectedAnswer: string;
  relevantDocIds: string[];
  difficulty: 'easy' | 'medium' | 'hard';
  category: 'factual' | 'procedural' | 'conceptual' | 'unanswerable';
}

async function evaluateRAG(testCases: EvalCase[]): Promise<EvalReport> {
  const results = await Promise.all(
    testCases.map(async (tc) => {
      const retrieved = await hybridSearch(tc.query, await embed(tc.query), { limit: 5, minScore: 0.3 });
      const answer = await generateFullAnswer(tc.query, retrieved);

      return {
        query: tc.query,
        retrievalRecall: calculateRecall(retrieved, tc.relevantDocIds),
        answerRelevance: await scoreRelevance(answer, tc.expectedAnswer),
        latencyMs: answer.latencyMs,
        tokenCost: answer.totalTokens,
      };
    })
  );

  return {
    avgRetrievalRecall: avg(results.map(r => r.retrievalRecall)),
    avgAnswerRelevance: avg(results.map(r => r.answerRelevance)),
    avgLatencyMs: avg(results.map(r => r.latencyMs)),
    totalCost: sum(results.map(r => r.tokenCost)),
    results,
  };
}

Key Metrics

Track these in production:

  • Retrieval recall@5: What percentage of relevant documents appear in the top 5 results? Target: >85%.
  • Answer relevance: LLM-as-judge scoring against expected answers. Target: >4.0/5.0.
  • Latency P95: Time from query to first streamed token. Target: <2 seconds.
  • User feedback rate: Thumbs up/down ratio. Target: >80% positive.
  • Hallucination rate: Flagged by citation validation. Target: <5%.

Cost Optimization

At scale, RAG costs add up fast. Three strategies that made the biggest difference for us:

Semantic Caching

Cache answers for semantically similar queries. If someone asks "how do I deploy to production?" and another person asks "what's the deployment process?", they should get the same cached answer.

async function getCachedOrGenerate(query: string): Promise<RAGResponse> {
  const queryEmbedding = await embed(query);

  // Check cache for semantically similar queries
  const cached = await db.query(`
    SELECT response, query
    FROM response_cache
    WHERE 1 - (query_embedding <=> $1::vector) > 0.95
    AND created_at > NOW() - INTERVAL '24 hours'
    ORDER BY query_embedding <=> $1::vector
    LIMIT 1
  `, [queryEmbedding]);

  if (cached.rows.length > 0) {
    return { ...cached.rows[0].response, fromCache: true };
  }

  // Generate fresh response
  const response = await generateRAGResponse(query, queryEmbedding);

  // Cache it
  await db.query(
    `INSERT INTO response_cache (query, query_embedding, response) VALUES ($1, $2, $3)`,
    [query, queryEmbedding, response]
  );

  return response;
}

This alone cut our LLM costs by about 40%.

Model Selection

Use gpt-4o-mini for most queries. It's 15x cheaper than gpt-4o and handles straightforward documentation questions just fine. Route complex queries (multi-hop reasoning, comparison questions) to gpt-4o.

Embedding Cost

Pre-compute and store embeddings. Never re-embed a document that hasn't changed. Use a hash of the content to detect changes during re-indexing.

Real-World Performance Numbers

From our production system serving ~5K queries per day:

  • Average retrieval latency: 45ms (pgvector with HNSW index)
  • Average generation latency: 1.2s (gpt-4o-mini, streaming)
  • Time to first token: 380ms
  • Retrieval recall@5: 88%
  • User satisfaction (thumbs up): 84%
  • Monthly cost: ~$280 (embeddings + LLM + infrastructure)
  • Cache hit rate: 38%

These numbers are realistic for a mid-scale documentation search system. Your numbers will vary based on document corpus size, query complexity, and model choice.

Wrapping Up

Production RAG is not about picking the right model. It's about building a system with good data pipelines, smart retrieval, proper evaluation, and cost controls.

Start with the simplest version that works. Measure everything. Improve the pieces that matter most based on your metrics.

The code examples in this article are simplified but functional. Adapt them to your stack and your data.

If you're building RAG right now, I'd love to hear what challenges you're hitting. Drop a comment or reach out.