Retrieval Augmented Generation (RAG) Explained

ai ml

RAG became the standard architecture for production LLM applications in 2023. Here’s why and how to implement it properly.

The Problem

LLMs have limitations:

User: What's in our Q3 2023 financial report?
LLM: I don't have access to your internal documents.

The Solution: RAG

Retrieval-Augmented Generation adds a retrieval step:

User Query


┌─────────────┐
│   Retrieval │ ──► Fetch relevant documents
└─────────────┘

    ▼ Documents
┌─────────────┐
│   LLM       │ ──► Generate answer using documents
└─────────────┘


Grounded Answer

The LLM answers using your data instead of making things up.

Basic RAG Pipeline

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Load documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)

# 3. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4. Create retrieval chain
llm = ChatOpenAI(model="gpt-4")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# 5. Query
answer = qa_chain.run("What is our refund policy?")

The Components

Embeddings

Convert text to vectors that capture semantic meaning:

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response.data[0].embedding

Vector Store

Store and search embeddings efficiently:

StoreTypeBest For
ChromaOpen sourceDevelopment
PineconeManagedProduction
WeaviateOpen sourceSelf-hosted production
pgvectorPostgreSQL extensionExisting Postgres users

Chunking

Split documents into appropriate pieces:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Characters per chunk
    chunk_overlap=200,     # Overlap to preserve context
    separators=["\n\n", "\n", " ", ""]  # Split hierarchy
)

Retrieval

Find the most relevant chunks:

def retrieve(query: str, k: int = 5) -> list[str]:
    query_embedding = get_embedding(query)
    results = vectorstore.similarity_search(
        query_embedding,
        k=k
    )
    return [doc.page_content for doc in results]

Generation

Combine context with the question:

def generate(query: str, context: list[str]) -> str:
    prompt = f"""Answer based only on the following context:

{chr(10).join(context)}

Question: {query}

If the answer is not in the context, say "I don't know."
"""
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Advanced Patterns

Combine semantic and keyword search:

def hybrid_search(query: str, k: int = 5) -> list[str]:
    # Semantic search
    semantic_results = vector_search(query, k=k*2)
    
    # Keyword search (BM25)
    keyword_results = bm25_search(query, k=k*2)
    
    # Reciprocal rank fusion
    combined = reciprocal_rank_fusion([
        semantic_results,
        keyword_results
    ])
    
    return combined[:k]

Reranking

Improve retrieval quality with a second pass:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_with_rerank(query: str, k: int = 5) -> list[str]:
    # Initial retrieval (more candidates)
    candidates = vector_search(query, k=k*4)
    
    # Rerank
    pairs = [(query, doc.content) for doc in candidates]
    scores = reranker.predict(pairs)
    
    # Sort by rerank score
    reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in reranked[:k]]

Query Expansion

Improve recall with query rewriting:

def expand_query(original_query: str) -> list[str]:
    prompt = f"""Generate 3 alternative phrasings of this question:
    
Original: {original_query}

Return as a JSON list of strings."""
    
    response = llm.generate(prompt)
    alternatives = json.loads(response)
    
    return [original_query] + alternatives

Parent Document Retrieval

Retrieve small chunks, return full context:

# Store both small chunks and full documents
child_chunks = split(documents, chunk_size=500)
parent_docs = documents

# Map children to parents
child_to_parent = {child.id: parent.id for child, parent in pairs}

def retrieve_with_parents(query: str) -> list[str]:
    # Search small chunks
    matches = vector_search(query, chunk_collection)
    
    # Get parent documents
    parent_ids = set(child_to_parent[m.id] for m in matches)
    return [get_document(pid) for pid in parent_ids]

Evaluation

Retrieval Metrics

def evaluate_retrieval(queries, ground_truth_docs):
    hits = 0
    mrr_sum = 0
    
    for query, expected in zip(queries, ground_truth_docs):
        retrieved = retrieve(query, k=10)
        
        # Hit rate
        if expected in retrieved:
            hits += 1
            
        # Mean Reciprocal Rank
        if expected in retrieved:
            rank = retrieved.index(expected) + 1
            mrr_sum += 1 / rank
    
    return {
        "hit_rate": hits / len(queries),
        "mrr": mrr_sum / len(queries)
    }

Generation Quality

Use LLM-as-judge:

def evaluate_answer(question, context, answer):
    prompt = f"""Evaluate this answer on a scale of 1-5:

Question: {question}
Context provided: {context}
Answer: {answer}

Criteria:
- Correctness (based on context)
- Completeness
- No hallucination

Return JSON: {{"score": n, "reason": "..."}}"""
    
    return llm.generate(prompt)

Production Considerations

Latency

Embedding: ~100ms
Vector search: ~50ms
Reranking: ~200ms
LLM generation: ~1-5s

Cache aggressively, especially embeddings.

Cost

ComponentCost Driver
EmbeddingsPer document + per query
Vector storeStorage + queries
LLMTokens (context + response)

Chunking Tuning

Chunk SizeTrade-off
Small (200)Better retrieval, more fragments
Medium (500)Balanced
Large (1000)More context, may miss specifics

Final Thoughts

RAG solves the “LLM doesn’t know my data” problem. The pattern is simple:

  1. Index your documents
  2. Retrieve relevant context
  3. Generate with context

Most production LLM applications use some form of RAG. Master this pattern.


RAG: The bridge between LLMs and your data.

All posts