Building RAG Systems: Lessons from Production

Retrieval-Augmented Generation (RAG) looks simple in tutorials. Reality? It's nuanced. Here's what I learned building production RAG systems.

What is RAG?

Quick refresher:

User Query
    ↓
Embed Query → Search Vector DB → Retrieve Relevant Chunks
                                          ↓
                     Query + Chunks → LLM → Response

Instead of relying solely on the LLM's training data, we augment it with our own documents.

The Naive Approach (Don't Do This)

# Tutorial-grade RAG
def naive_rag(query, documents):
    # Chunk everything at 1000 chars
    chunks = [doc[i:i+1000] for doc in documents for i in range(0, len(doc), 1000)]
    
    # Embed and store
    embeddings = embed(chunks)
    index.add(embeddings)
    
    # Retrieve top 5
    results = index.search(embed(query), k=5)
    
    # Send to LLM
    return llm(f"Context: {results}\n\nQuestion: {query}")

This works for demos. In production, everything breaks.

Lesson 1: Chunking is Everything

The Problem

Too small: Lose context
Too large: Dilute relevance
Fixed size: Break sentences mid-thought

What Works

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,          # Smaller than you think
    chunk_overlap=50,         # Overlap prevents context loss
    separators=["\n\n", "\n", ". ", " ", ""]  # Respect structure
)

Pro Tips

Respect document structure — Headers, paragraphs matter
Semantic chunking — Split on topic changes, not char count
Chunk metadata — Store source, page, section with each chunk

Lesson 2: Embedding Model Matters

Common Choices

Model	Dimensions	Speed	Quality
OpenAI ada-002	1536	Fast	Good
Cohere embed-v3	1024	Fast	Great
BGE-large	1024	Medium	Excellent
E5-mistral	4096	Slow	Best

My Recommendation

Start with Cohere embed-v3 or BGE-large. OpenAI's embedding is overpriced for what you get.

The Embedding Mismatch Problem

# ❌ Bad: Query and documents embedded differently
query_embedding = model_a.embed(query)
doc_embeddings = model_b.embed(documents)  # Different model!

# ✅ Good: Same model for both
embedding_model = CohereEmbeddings()
query_embedding = embedding_model.embed(query)
doc_embeddings = embedding_model.embed(documents)

Lesson 3: Hybrid Search Wins

Vector search alone isn't enough. Combine with keyword search.

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, chunks, embeddings):
        self.vector_index = FAISS.from_embeddings(embeddings)
        self.bm25 = BM25Okapi([chunk.split() for chunk in chunks])
    
    def search(self, query, k=5, alpha=0.5):
        # Vector search
        vector_results = self.vector_index.search(query, k=k*2)
        
        # BM25 keyword search  
        bm25_scores = self.bm25.get_scores(query.split())
        bm25_results = sorted(range(len(bm25_scores)), 
                             key=lambda i: bm25_scores[i], 
                             reverse=True)[:k*2]
        
        # Combine scores (RRF or weighted)
        return self.reciprocal_rank_fusion(vector_results, bm25_results, k)

Lesson 4: Reranking is Worth It

Retrieval gets you candidates. Reranking picks the best.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def retrieve_and_rerank(query, k=5):
    # Get more candidates than needed
    candidates = retriever.search(query, k=20)
    
    # Rerank with cross-encoder
    pairs = [[query, c.text] for c in candidates]
    scores = reranker.predict(pairs)
    
    # Return top k after reranking
    reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in reranked[:k]]

Lesson 5: Prompt Engineering for RAG

The prompt template matters as much as retrieval.

RAG_PROMPT = """You are a helpful assistant answering questions based on the provided context.

CONTEXT:
{context}

RULES:
1. Only answer based on the context provided
2. If the context doesn't contain the answer, say "I don't have information about that"
3. Cite your sources using [Source: X] format
4. Be concise but complete

QUESTION: {question}

ANSWER:"""

Lesson 6: Evaluation is Hard but Necessary

Metrics to Track

# Retrieval quality
def hit_rate(queries, ground_truth, k=5):
    """Did we retrieve the correct document?"""
    hits = 0
    for query, expected_doc in zip(queries, ground_truth):
        results = retriever.search(query, k=k)
        if expected_doc in [r.id for r in results]:
            hits += 1
    return hits / len(queries)

# Answer quality (use LLM as judge)
def answer_relevance(question, answer, context):
    """Is the answer relevant and grounded?"""
    prompt = f"""Rate the answer on a scale of 1-5:
    Question: {question}
    Context: {context}
    Answer: {answer}
    
    Score:"""
    return llm(prompt)

Production Checklist

Pre-launch:
├── [ ] Chunking strategy tested on real documents
├── [ ] Embedding model benchmarked
├── [ ] Hybrid search implemented
├── [ ] Reranking added for quality
├── [ ] Fallback for retrieval failures
├── [ ] Rate limiting on LLM calls
├── [ ] Caching for repeated queries
├── [ ] Monitoring and logging
└── [ ] Evaluation pipeline

Post-launch:
├── [ ] User feedback collection
├── [ ] Query analysis (what are users asking?)
├── [ ] Failure case review
└── [ ] Continuous improvement loop

Architecture for Scale

                    ┌──────────────┐
User Query ────────▶│   Gateway    │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │ Embedder │ │ Retriever│ │ Reranker │
        └──────────┘ └──────────┘ └──────────┘
              │            │            │
              └────────────┼────────────┘
                           ▼
                    ┌──────────────┐
                    │     LLM      │
                    └──────────────┘

Key Takeaways

Chunking determines ceiling — Bad chunks = bad retrieval
Hybrid beats vector-only — Keywords still matter
Reranking is cheap wins — 10% more latency, 30% better quality
Measure everything — You can't improve what you don't measure
Start simple, iterate — Don't over-engineer day 1

Building a RAG system? Let's chat on Twitter!