Back to Blog

RAG Architecture: The Pitfalls Nobody Talks About

Why naive retrieval-augmented generation fails in production and how to design a RAG pipeline that actually stays accurate.

January 22, 2025 (1y ago)2 min read

Retrieval-Augmented Generation (RAG) sounds straightforward: retrieve relevant docs, stuff them into the context, let the LLM answer. It works in demos. It breaks in production.

Here are the failure modes I've hit and how I fixed them.

Problem 1: Chunk Size Is a Decision, Not a Default

Most tutorials default to 512-token chunks. That's wrong for your data.

  • Too small: loses context, splits concepts mid-sentence
  • Too large: buries the relevant snippet, increases noise

The fix: use recursive character splitting with overlap, and tune chunk size per document type. Code files and prose have very different optimal sizes.

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " "],
)

Problem 2: Semantic Search Alone Is Not Enough

Pure vector similarity retrieval misses exact matches — product names, IDs, error codes. Hybrid search (BM25 + vector) almost always outperforms either alone.

from langchain.retrievers import EnsembleRetriever
 
retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7],
)

Start at 30/70. Tune from there based on your query distribution.

Problem 3: No Re-Ranking

Top-k vector search returns the most similar chunks, not the most relevant ones. Add a cross-encoder re-ranker as a second pass:

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, chunk) for chunk in candidates])
ranked = sorted(zip(scores, candidates), reverse=True)

This consistently improves answer quality without touching the retrieval index.

Problem 4: No Evaluation Loop

RAG without evals is flying blind. At minimum, track:

  • Retrieval recall: does the right chunk appear in top-k?
  • Answer faithfulness: is the LLM answer grounded in the retrieved context?
  • Answer relevancy: does the answer actually address the question?

Use RAGAS or build lightweight evals with your own labeled test set.

Getting RAG right is 20% building the pipeline and 80% iterating on the data, chunking, retrieval strategy, and evaluation. The infrastructure is the easy part.