Retrieval Augmented Generation (RAG) Explained
RAG became the standard architecture for production LLM applications in 2023. Here’s why and how to implement it properly.
The Problem
LLMs have limitations:
- Knowledge cutoff (training data ends at some date)
- No access to private data (your documents, databases)
- Hallucinations when asked about unknown topics
User: What's in our Q3 2023 financial report?
LLM: I don't have access to your internal documents.
The Solution: RAG
Retrieval-Augmented Generation adds a retrieval step:
User Query
│
▼
┌─────────────┐
│ Retrieval │ ──► Fetch relevant documents
└─────────────┘
│
▼ Documents
┌─────────────┐
│ LLM │ ──► Generate answer using documents
└─────────────┘
│
▼
Grounded Answer
The LLM answers using your data instead of making things up.
Basic RAG Pipeline
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 1. Load documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(documents)
# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 4. Create retrieval chain
llm = ChatOpenAI(model="gpt-4")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# 5. Query
answer = qa_chain.run("What is our refund policy?")
The Components
Embeddings
Convert text to vectors that capture semantic meaning:
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-ada-002",
input=text
)
return response.data[0].embedding
Vector Store
Store and search embeddings efficiently:
| Store | Type | Best For |
|---|---|---|
| Chroma | Open source | Development |
| Pinecone | Managed | Production |
| Weaviate | Open source | Self-hosted production |
| pgvector | PostgreSQL extension | Existing Postgres users |
Chunking
Split documents into appropriate pieces:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap to preserve context
separators=["\n\n", "\n", " ", ""] # Split hierarchy
)
Retrieval
Find the most relevant chunks:
def retrieve(query: str, k: int = 5) -> list[str]:
query_embedding = get_embedding(query)
results = vectorstore.similarity_search(
query_embedding,
k=k
)
return [doc.page_content for doc in results]
Generation
Combine context with the question:
def generate(query: str, context: list[str]) -> str:
prompt = f"""Answer based only on the following context:
{chr(10).join(context)}
Question: {query}
If the answer is not in the context, say "I don't know."
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
Advanced Patterns
Hybrid Search
Combine semantic and keyword search:
def hybrid_search(query: str, k: int = 5) -> list[str]:
# Semantic search
semantic_results = vector_search(query, k=k*2)
# Keyword search (BM25)
keyword_results = bm25_search(query, k=k*2)
# Reciprocal rank fusion
combined = reciprocal_rank_fusion([
semantic_results,
keyword_results
])
return combined[:k]
Reranking
Improve retrieval quality with a second pass:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_with_rerank(query: str, k: int = 5) -> list[str]:
# Initial retrieval (more candidates)
candidates = vector_search(query, k=k*4)
# Rerank
pairs = [(query, doc.content) for doc in candidates]
scores = reranker.predict(pairs)
# Sort by rerank score
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in reranked[:k]]
Query Expansion
Improve recall with query rewriting:
def expand_query(original_query: str) -> list[str]:
prompt = f"""Generate 3 alternative phrasings of this question:
Original: {original_query}
Return as a JSON list of strings."""
response = llm.generate(prompt)
alternatives = json.loads(response)
return [original_query] + alternatives
Parent Document Retrieval
Retrieve small chunks, return full context:
# Store both small chunks and full documents
child_chunks = split(documents, chunk_size=500)
parent_docs = documents
# Map children to parents
child_to_parent = {child.id: parent.id for child, parent in pairs}
def retrieve_with_parents(query: str) -> list[str]:
# Search small chunks
matches = vector_search(query, chunk_collection)
# Get parent documents
parent_ids = set(child_to_parent[m.id] for m in matches)
return [get_document(pid) for pid in parent_ids]
Evaluation
Retrieval Metrics
def evaluate_retrieval(queries, ground_truth_docs):
hits = 0
mrr_sum = 0
for query, expected in zip(queries, ground_truth_docs):
retrieved = retrieve(query, k=10)
# Hit rate
if expected in retrieved:
hits += 1
# Mean Reciprocal Rank
if expected in retrieved:
rank = retrieved.index(expected) + 1
mrr_sum += 1 / rank
return {
"hit_rate": hits / len(queries),
"mrr": mrr_sum / len(queries)
}
Generation Quality
Use LLM-as-judge:
def evaluate_answer(question, context, answer):
prompt = f"""Evaluate this answer on a scale of 1-5:
Question: {question}
Context provided: {context}
Answer: {answer}
Criteria:
- Correctness (based on context)
- Completeness
- No hallucination
Return JSON: {{"score": n, "reason": "..."}}"""
return llm.generate(prompt)
Production Considerations
Latency
Embedding: ~100ms
Vector search: ~50ms
Reranking: ~200ms
LLM generation: ~1-5s
Cache aggressively, especially embeddings.
Cost
| Component | Cost Driver |
|---|---|
| Embeddings | Per document + per query |
| Vector store | Storage + queries |
| LLM | Tokens (context + response) |
Chunking Tuning
| Chunk Size | Trade-off |
|---|---|
| Small (200) | Better retrieval, more fragments |
| Medium (500) | Balanced |
| Large (1000) | More context, may miss specifics |
Final Thoughts
RAG solves the “LLM doesn’t know my data” problem. The pattern is simple:
- Index your documents
- Retrieve relevant context
- Generate with context
Most production LLM applications use some form of RAG. Master this pattern.
RAG: The bridge between LLMs and your data.