RAG in Production: Architecture That Actually Scales
RAG (Retrieval-Augmented Generation) breaks in predictable ways at production scale. Here's the architecture that works based on running RAG in multiple live products.
RAG (Retrieval-Augmented Generation) is trivial to demo and hard to scale. The gap between a working prototype and a production system that handles thousands of queries reliably is where most RAG projects fail. Here's what production RAG actually looks like, based on running it in Context-Heavy and BikroyBuddy.
Why naive RAG breaks
The typical RAG tutorial:
# Tutorial RAG
query_embedding = embed(user_query)
chunks = vector_db.similarity_search(query_embedding, k=5)
context = "\n".join(chunks)
response = llm.complete(f"Context: {context}\n\nQuery: {user_query}")
This breaks at scale because:
- Retrieval quality degrades as the knowledge base grows (more candidates = more noise)
- Context window overflow — 5 chunks × 500 tokens = 2,500 tokens, plus system prompt, plus query, plus response = expensive and slow
- Latency stacking — embed query + vector search + LLM completion = 3 serial calls
- No freshness — embeddings are stale the moment documents update
Production RAG architecture
User query
↓
Query preprocessor
├── Expand abbreviations, fix typos
├── Decompose compound questions → sub-queries
└── Extract structured filters (date range, entity type)
↓
Hybrid retrieval
├── Semantic search (pgvector cosine similarity)
└── Keyword search (PostgreSQL full-text, tsvector)
↓
Reranker (cross-encoder)
↓
Context compiler (budget-aware)
↓
LLM generation (with citations)
↓
Response
Each stage is independently testable and optimizable. This is the key difference from naive RAG — you can measure and improve retrieval quality separately from generation quality.
Hybrid retrieval: semantic + keyword
Pure semantic search misses exact matches. Pure keyword search misses semantic variants. Production RAG needs both:
async def hybrid_retrieve(query: str, tenant_id: str, k: int = 20) -> list[Chunk]:
# Parallel semantic + keyword
semantic_task = asyncio.create_task(
semantic_search(embed(query), tenant_id, k=k)
)
keyword_task = asyncio.create_task(
keyword_search(query, tenant_id, k=k)
)
semantic_results, keyword_results = await asyncio.gather(
semantic_task, keyword_task
)
# Reciprocal rank fusion (RRF)
return rrf_merge(semantic_results, keyword_results, k=k)
RRF (Reciprocal Rank Fusion) merges ranked lists: score = Σ 1/(rank + 60). It consistently outperforms score-based merging without requiring tuning of weights.
Reranking: the quality multiplier
First-pass retrieval with k=20 returns candidates. A cross-encoder reranker scores each candidate against the actual query (not embedding similarity — it reads both):
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[Chunk], top_k: int = 5) -> list[Chunk]:
pairs = [(query, chunk.text) for chunk in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in ranked[:top_k]]
Reranking is the single highest-ROI improvement to RAG quality. It typically improves retrieval precision by 20–40% over embedding-only retrieval. The cross-encoder is slower than embedding similarity but runs on top-20 candidates, not the full corpus.
Context compilation: budget-aware
Don't fill the context window. Calculate token counts and stop:
def compile_context(chunks: list[Chunk], budget: int = 2000) -> str:
context_parts = []
used_tokens = 0
for chunk in chunks:
chunk_tokens = count_tokens(chunk.text)
if used_tokens + chunk_tokens > budget:
break
context_parts.append(f"[Source: {chunk.source}]\n{chunk.text}")
used_tokens += chunk_tokens
return "\n\n---\n\n".join(context_parts)
2,000 tokens for context leaves room for system prompt (~500), query (~100), and response (~1,000+) without hitting most models' effective reasoning ceiling.
PostgreSQL as vector DB (pgvector)
For most production RAG systems, a dedicated vector database (Pinecone, Weaviate, Qdrant) is unnecessary and adds operational complexity. pgvector in PostgreSQL handles:
- Vector similarity search (ivfflat or HNSW index)
- Metadata filtering (standard SQL WHERE clauses)
- Full-text search (tsvector, to_tsquery)
- Transactions (embeddings update atomically with document updates)
-- Hybrid index
CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
CREATE INDEX ON chunks USING GIN (to_tsvector('english', content));
-- Hybrid query
WITH semantic AS (
SELECT id, content, source,
1 - (embedding <=> $1) AS semantic_score
FROM chunks
WHERE tenant_id = $2
ORDER BY embedding <=> $1
LIMIT 20
),
keyword AS (
SELECT id, content, source,
ts_rank(to_tsvector('english', content), plainto_tsquery($3)) AS kw_score
FROM chunks
WHERE tenant_id = $2
AND to_tsvector('english', content) @@ plainto_tsquery($3)
LIMIT 20
)
SELECT DISTINCT ON (id) * FROM (
SELECT * FROM semantic UNION ALL SELECT * FROM keyword
) combined
ORDER BY id, (semantic_score + kw_score) DESC;
Performance numbers (Context-Heavy, production)
| Stage | P50 | P99 | |-------|-----|-----| | Query embedding | 25ms | 80ms | | Hybrid retrieval (20 candidates) | 12ms | 35ms | | Reranking (top-5 from 20) | 45ms | 120ms | | LLM generation (Claude Haiku) | 380ms | 800ms | | Total | 462ms | 1035ms |
Sub-500ms P50 for a full RAG pipeline including LLM generation is achievable with pgvector + local reranker + Haiku.
FAQ
What is RAG (Retrieval-Augmented Generation)? RAG is a technique that improves LLM responses by retrieving relevant documents from a knowledge base and including them in the prompt context. Instead of relying on the model's training data, RAG allows the model to use current, specific, or private information.
Should I use a dedicated vector database or pgvector? For most applications, pgvector in PostgreSQL is sufficient and simpler to operate. Dedicated vector databases (Pinecone, Weaviate) add value at very large scale (100M+ vectors) or when you need specific features like multi-vector indexing.
What is hybrid retrieval in RAG? Hybrid retrieval combines semantic (vector) search with keyword (full-text) search. It consistently outperforms either approach alone because semantic search catches meaning variants while keyword search catches exact matches.
What is a cross-encoder reranker? A cross-encoder reranker is a model that takes a query and document pair as input and produces a relevance score — more accurate than vector similarity because it reads both texts together. Used to select the best k results from a larger candidate set.
Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: pgvector: Vector Search in PostgreSQL · Building Context-Heavy: Knowledge-Graph API for AI Agents.