_SH Log's
Back to Root
EST: 5 min read

RAG in Production: Architecture That Actually Scales

RAG (Retrieval-Augmented Generation) breaks in predictable ways at production scale. Here's the architecture that works based on running RAG in multiple live products.

#rag#ai#pgvector#systems

RAG (Retrieval-Augmented Generation) is trivial to demo and hard to scale. The gap between a working prototype and a production system that handles thousands of queries reliably is where most RAG projects fail. Here's what production RAG actually looks like, based on running it in Context-Heavy and BikroyBuddy.

Why naive RAG breaks

The typical RAG tutorial:

# Tutorial RAG
query_embedding = embed(user_query)
chunks = vector_db.similarity_search(query_embedding, k=5)
context = "\n".join(chunks)
response = llm.complete(f"Context: {context}\n\nQuery: {user_query}")

This breaks at scale because:

  1. Retrieval quality degrades as the knowledge base grows (more candidates = more noise)
  2. Context window overflow — 5 chunks × 500 tokens = 2,500 tokens, plus system prompt, plus query, plus response = expensive and slow
  3. Latency stacking — embed query + vector search + LLM completion = 3 serial calls
  4. No freshness — embeddings are stale the moment documents update

Production RAG architecture

User query
  ↓
Query preprocessor
  ├── Expand abbreviations, fix typos
  ├── Decompose compound questions → sub-queries
  └── Extract structured filters (date range, entity type)
  ↓
Hybrid retrieval
  ├── Semantic search (pgvector cosine similarity)
  └── Keyword search (PostgreSQL full-text, tsvector)
  ↓
Reranker (cross-encoder)
  ↓
Context compiler (budget-aware)
  ↓
LLM generation (with citations)
  ↓
Response

Each stage is independently testable and optimizable. This is the key difference from naive RAG — you can measure and improve retrieval quality separately from generation quality.

Hybrid retrieval: semantic + keyword

Pure semantic search misses exact matches. Pure keyword search misses semantic variants. Production RAG needs both:

async def hybrid_retrieve(query: str, tenant_id: str, k: int = 20) -> list[Chunk]:
    # Parallel semantic + keyword
    semantic_task = asyncio.create_task(
        semantic_search(embed(query), tenant_id, k=k)
    )
    keyword_task = asyncio.create_task(
        keyword_search(query, tenant_id, k=k)
    )
    
    semantic_results, keyword_results = await asyncio.gather(
        semantic_task, keyword_task
    )
    
    # Reciprocal rank fusion (RRF)
    return rrf_merge(semantic_results, keyword_results, k=k)

RRF (Reciprocal Rank Fusion) merges ranked lists: score = Σ 1/(rank + 60). It consistently outperforms score-based merging without requiring tuning of weights.

Reranking: the quality multiplier

First-pass retrieval with k=20 returns candidates. A cross-encoder reranker scores each candidate against the actual query (not embedding similarity — it reads both):

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[Chunk], top_k: int = 5) -> list[Chunk]:
    pairs = [(query, chunk.text) for chunk in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in ranked[:top_k]]

Reranking is the single highest-ROI improvement to RAG quality. It typically improves retrieval precision by 20–40% over embedding-only retrieval. The cross-encoder is slower than embedding similarity but runs on top-20 candidates, not the full corpus.

Context compilation: budget-aware

Don't fill the context window. Calculate token counts and stop:

def compile_context(chunks: list[Chunk], budget: int = 2000) -> str:
    context_parts = []
    used_tokens = 0
    
    for chunk in chunks:
        chunk_tokens = count_tokens(chunk.text)
        if used_tokens + chunk_tokens > budget:
            break
        context_parts.append(f"[Source: {chunk.source}]\n{chunk.text}")
        used_tokens += chunk_tokens
    
    return "\n\n---\n\n".join(context_parts)

2,000 tokens for context leaves room for system prompt (~500), query (~100), and response (~1,000+) without hitting most models' effective reasoning ceiling.

PostgreSQL as vector DB (pgvector)

For most production RAG systems, a dedicated vector database (Pinecone, Weaviate, Qdrant) is unnecessary and adds operational complexity. pgvector in PostgreSQL handles:

  • Vector similarity search (ivfflat or HNSW index)
  • Metadata filtering (standard SQL WHERE clauses)
  • Full-text search (tsvector, to_tsquery)
  • Transactions (embeddings update atomically with document updates)
-- Hybrid index
CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);
CREATE INDEX ON chunks USING GIN (to_tsvector('english', content));

-- Hybrid query
WITH semantic AS (
    SELECT id, content, source,
           1 - (embedding <=> $1) AS semantic_score
    FROM chunks
    WHERE tenant_id = $2
    ORDER BY embedding <=> $1
    LIMIT 20
),
keyword AS (
    SELECT id, content, source,
           ts_rank(to_tsvector('english', content), plainto_tsquery($3)) AS kw_score
    FROM chunks
    WHERE tenant_id = $2
      AND to_tsvector('english', content) @@ plainto_tsquery($3)
    LIMIT 20
)
SELECT DISTINCT ON (id) * FROM (
    SELECT * FROM semantic UNION ALL SELECT * FROM keyword
) combined
ORDER BY id, (semantic_score + kw_score) DESC;

Performance numbers (Context-Heavy, production)

| Stage | P50 | P99 | |-------|-----|-----| | Query embedding | 25ms | 80ms | | Hybrid retrieval (20 candidates) | 12ms | 35ms | | Reranking (top-5 from 20) | 45ms | 120ms | | LLM generation (Claude Haiku) | 380ms | 800ms | | Total | 462ms | 1035ms |

Sub-500ms P50 for a full RAG pipeline including LLM generation is achievable with pgvector + local reranker + Haiku.

FAQ

What is RAG (Retrieval-Augmented Generation)? RAG is a technique that improves LLM responses by retrieving relevant documents from a knowledge base and including them in the prompt context. Instead of relying on the model's training data, RAG allows the model to use current, specific, or private information.

Should I use a dedicated vector database or pgvector? For most applications, pgvector in PostgreSQL is sufficient and simpler to operate. Dedicated vector databases (Pinecone, Weaviate) add value at very large scale (100M+ vectors) or when you need specific features like multi-vector indexing.

What is hybrid retrieval in RAG? Hybrid retrieval combines semantic (vector) search with keyword (full-text) search. It consistently outperforms either approach alone because semantic search catches meaning variants while keyword search catches exact matches.

What is a cross-encoder reranker? A cross-encoder reranker is a model that takes a query and document pair as input and produces a relevance score — more accurate than vector similarity because it reads both texts together. Used to select the best k results from a larger candidate set.


Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: pgvector: Vector Search in PostgreSQL · Building Context-Heavy: Knowledge-Graph API for AI Agents.