Back to Blog
Building RAG Systems: Lessons from Production
What I learned building production RAG systems — chunking strategies, embedding choices, and retrieval optimization.
February 18, 20265 min read
#AI#RAG#LLM#production#engineering
Building RAG Systems: Lessons from Production
Retrieval-Augmented Generation (RAG) looks simple in tutorials. Reality? It's nuanced. Here's what I learned building production RAG systems.
What is RAG?
Quick refresher:
User Query
↓
Embed Query → Search Vector DB → Retrieve Relevant Chunks
↓
Query + Chunks → LLM → Response
Instead of relying solely on the LLM's training data, we augment it with our own documents.
The Naive Approach (Don't Do This)
# Tutorial-grade RAG
def naive_rag(query, documents):
# Chunk everything at 1000 chars
chunks = [doc[i:i+1000] for doc in documents for i in range(0, len(doc), 1000)]
# Embed and store
embeddings = embed(chunks)
index.add(embeddings)
# Retrieve top 5
results = index.search(embed(query), k=5)
# Send to LLM
return llm(f"Context: {results}\n\nQuestion: {query}")
This works for demos. In production, everything breaks.
Lesson 1: Chunking is Everything
The Problem
- Too small: Lose context
- Too large: Dilute relevance
- Fixed size: Break sentences mid-thought
What Works
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Smaller than you think
chunk_overlap=50, # Overlap prevents context loss
separators=["\n\n", "\n", ". ", " ", ""] # Respect structure
)
Pro Tips
- Respect document structure — Headers, paragraphs matter
- Semantic chunking — Split on topic changes, not char count
- Chunk metadata — Store source, page, section with each chunk
Lesson 2: Embedding Model Matters
Common Choices
| Model | Dimensions | Speed | Quality |
|---|---|---|---|
| OpenAI ada-002 | 1536 | Fast | Good |
| Cohere embed-v3 | 1024 | Fast | Great |
| BGE-large | 1024 | Medium | Excellent |
| E5-mistral | 4096 | Slow | Best |
My Recommendation
Start with Cohere embed-v3 or BGE-large. OpenAI's embedding is overpriced for what you get.
The Embedding Mismatch Problem
# ❌ Bad: Query and documents embedded differently
query_embedding = model_a.embed(query)
doc_embeddings = model_b.embed(documents) # Different model!
# ✅ Good: Same model for both
embedding_model = CohereEmbeddings()
query_embedding = embedding_model.embed(query)
doc_embeddings = embedding_model.embed(documents)
Lesson 3: Hybrid Search Wins
Vector search alone isn't enough. Combine with keyword search.
from rank_bm25 import BM25Okapi
class HybridRetriever:
def __init__(self, chunks, embeddings):
self.vector_index = FAISS.from_embeddings(embeddings)
self.bm25 = BM25Okapi([chunk.split() for chunk in chunks])
def search(self, query, k=5, alpha=0.5):
# Vector search
vector_results = self.vector_index.search(query, k=k*2)
# BM25 keyword search
bm25_scores = self.bm25.get_scores(query.split())
bm25_results = sorted(range(len(bm25_scores)),
key=lambda i: bm25_scores[i],
reverse=True)[:k*2]
# Combine scores (RRF or weighted)
return self.reciprocal_rank_fusion(vector_results, bm25_results, k)
Lesson 4: Reranking is Worth It
Retrieval gets you candidates. Reranking picks the best.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def retrieve_and_rerank(query, k=5):
# Get more candidates than needed
candidates = retriever.search(query, k=20)
# Rerank with cross-encoder
pairs = [[query, c.text] for c in candidates]
scores = reranker.predict(pairs)
# Return top k after reranking
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [c for c, _ in reranked[:k]]
Lesson 5: Prompt Engineering for RAG
The prompt template matters as much as retrieval.
RAG_PROMPT = """You are a helpful assistant answering questions based on the provided context.
CONTEXT:
{context}
RULES:
1. Only answer based on the context provided
2. If the context doesn't contain the answer, say "I don't have information about that"
3. Cite your sources using [Source: X] format
4. Be concise but complete
QUESTION: {question}
ANSWER:"""
Lesson 6: Evaluation is Hard but Necessary
Metrics to Track
# Retrieval quality
def hit_rate(queries, ground_truth, k=5):
"""Did we retrieve the correct document?"""
hits = 0
for query, expected_doc in zip(queries, ground_truth):
results = retriever.search(query, k=k)
if expected_doc in [r.id for r in results]:
hits += 1
return hits / len(queries)
# Answer quality (use LLM as judge)
def answer_relevance(question, answer, context):
"""Is the answer relevant and grounded?"""
prompt = f"""Rate the answer on a scale of 1-5:
Question: {question}
Context: {context}
Answer: {answer}
Score:"""
return llm(prompt)
Production Checklist
Pre-launch:
├── [ ] Chunking strategy tested on real documents
├── [ ] Embedding model benchmarked
├── [ ] Hybrid search implemented
├── [ ] Reranking added for quality
├── [ ] Fallback for retrieval failures
├── [ ] Rate limiting on LLM calls
├── [ ] Caching for repeated queries
├── [ ] Monitoring and logging
└── [ ] Evaluation pipeline
Post-launch:
├── [ ] User feedback collection
├── [ ] Query analysis (what are users asking?)
├── [ ] Failure case review
└── [ ] Continuous improvement loop
Architecture for Scale
┌──────────────┐
User Query ────────▶│ Gateway │
└──────┬───────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Embedder │ │ Retriever│ │ Reranker │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└────────────┼────────────┘
▼
┌──────────────┐
│ LLM │
└──────────────┘
Key Takeaways
- Chunking determines ceiling — Bad chunks = bad retrieval
- Hybrid beats vector-only — Keywords still matter
- Reranking is cheap wins — 10% more latency, 30% better quality
- Measure everything — You can't improve what you don't measure
- Start simple, iterate — Don't over-engineer day 1
Building a RAG system? Let's chat on Twitter!