Retrieval-Augmented Generation (RAG) sounds straightforward: retrieve relevant docs, stuff them into the context, let the LLM answer. It works in demos. It breaks in production.
Here are the failure modes I've hit and how I fixed them.
Problem 1: Chunk Size Is a Decision, Not a Default
Most tutorials default to 512-token chunks. That's wrong for your data.
- Too small: loses context, splits concepts mid-sentence
- Too large: buries the relevant snippet, increases noise
The fix: use recursive character splitting with overlap, and tune chunk size per document type. Code files and prose have very different optimal sizes.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
separators=["\n\n", "\n", ".", " "],
)Problem 2: Semantic Search Alone Is Not Enough
Pure vector similarity retrieval misses exact matches — product names, IDs, error codes. Hybrid search (BM25 + vector) almost always outperforms either alone.
from langchain.retrievers import EnsembleRetriever
retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.3, 0.7],
)Start at 30/70. Tune from there based on your query distribution.
Problem 3: No Re-Ranking
Top-k vector search returns the most similar chunks, not the most relevant ones. Add a cross-encoder re-ranker as a second pass:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, chunk) for chunk in candidates])
ranked = sorted(zip(scores, candidates), reverse=True)This consistently improves answer quality without touching the retrieval index.
Problem 4: No Evaluation Loop
RAG without evals is flying blind. At minimum, track:
- Retrieval recall: does the right chunk appear in top-k?
- Answer faithfulness: is the LLM answer grounded in the retrieved context?
- Answer relevancy: does the answer actually address the question?
Use RAGAS or build lightweight evals with your own labeled test set.
Getting RAG right is 20% building the pipeline and 80% iterating on the data, chunking, retrieval strategy, and evaluation. The infrastructure is the easy part.