RAG Architecture Overview
2 of 22Building RAG Applications
RAG Architecture Overview
Lesson 1 set up what RAG is. This lesson is the architectural blueprint — the components every production RAG system uses, the data flow at indexing time vs query time, and the decisions you'll make at each layer. By the end you'll be able to read any commercial RAG system's marketing diagram and know exactly what's underneath.
1. Two Pipelines, Not One
Every RAG system has two distinct flows that share a vector store but otherwise run independently:
INDEXING PIPELINE (offline, run periodically)
───────────────────────────────────────────────
Documents → Loader → Cleaner → Chunker → Embedder → Vector Store
↓
(also) BM25 index
QUERY PIPELINE (online, every user request)
───────────────────────────────────────────
Query → Query Transform → Retriever (vector + BM25)
↓
Re-ranker → Top-K
↓
Prompt Builder → LLM
↓
Answer + Citations
Confusing the two is the most common architecture mistake. Indexing is batch / offline; query is online and latency-sensitive. They have different scaling needs and failure modes.
2. The Indexing Pipeline, in Detail
| Stage | What it does | Common tools |
|---|---|---|
| Loader | Read documents from PDFs, HTML, Word, Notion, S3, Confluence... | Unstructured, LangChain loaders, LlamaParse |
| Cleaner | Strip boilerplate, normalize whitespace, drop boilerplate sections | Custom regex / LLM-based cleanup |
| Chunker | Split long docs into 200-1000 token pieces | RecursiveCharacterTextSplitter, sentence-aware |
| Embedder | Map each chunk to a dense vector | OpenAI, Cohere, BGE, sentence-transformers |
| Vector store | Index vectors for fast nearest-neighbor search | Pinecone, Weaviate, Qdrant, Chroma, pgvector |
| Sparse index (optional) | BM25 / inverted index for keyword retrieval | Elasticsearch, OpenSearch, Vespa |
| Metadata store | Source URL, doc ID, timestamps, author for each chunk | Same DB as vector store, or separate Postgres |
Run on a schedule (hourly, daily, weekly). Re-running the full pipeline on an unchanged corpus is wasteful — most production systems use incremental indexing where only changed documents get re-processed.
3. The Query Pipeline, in Detail
| Stage | What it does | Why |
|---|---|---|
| Query understanding | Spell-correct, expand abbreviations, classify intent | Better retrieval signal |
| Query transformation | Rewrite into multiple queries; HyDE; step-back prompting | Bridge query-doc vocabulary mismatch |
| Dense retrieval | Embed query, ANN search in vector store | Semantic match; the meat of RAG |
| Sparse retrieval | BM25 over inverted index | Catches exact-term queries dense misses |
| Hybrid fusion | Reciprocal Rank Fusion or weighted sum | Best of both retrieval families |
| Re-ranking | Cross-encoder re-scores the top-50 → top-10 | ~5-15% lift in retrieval quality |
| Prompt assembly | Format docs into a system prompt with the user query | The prompt template is a tunable artifact |
| Generation | Call the LLM with the assembled prompt | Self-explanatory |
| Citation extraction | Map model's claims back to retrieved docs | Auditability |
4. Latency Budget at Query Time
Target: ~2-5 seconds end-to-end (chat-style UX)
Query embedding: 30-100 ms
Vector store search: 50-200 ms
BM25 search: 20-100 ms
Hybrid fusion: 5-20 ms
Re-ranking (CE): 200-500 ms (slow; sometimes skipped)
Prompt assembly: 10-50 ms
LLM (first token): 500-2000 ms
LLM (full response): 1-3 s
Citation extraction: 20-100 ms
─────────────────────────────────
Total p50: ~2-3 s
Total p99: ~5-8 s
The LLM call dominates. Streaming the response (first token in ~500ms) makes the system feel responsive even at multi- second total latencies. The retrieval half rarely exceeds 500ms in production.
5. The Three RAG Quality Metrics
| Metric | What it measures | Where it can fail |
|---|---|---|
| Retrieval quality | Did we fetch the right docs? | Bad embeddings, bad chunking, missing docs |
| Faithfulness | Does the answer use only the retrieved docs? | Model hallucinates beyond context |
| Answer correctness | Is the answer right? | Either of the above, or model reasoning error |
Lesson 18 covers evaluation in depth. The 2026 standard is Ragas / TruLens for automated metrics + a small golden set for ground-truth evaluation.
6. The Standard Failure Modes
7. The "Hello World" Mental Model
def rag_answer(query, k=5):
chunks = retriever.search(query, top_k=k)
prompt = format_prompt(query, chunks)
answer = llm.generate(prompt)
citations = extract_citations(answer, chunks)
return answer, citations
def format_prompt(query, chunks):
docs = "\n\n".join(f"[{i}] {c.text}" for i, c in enumerate(chunks))
return (
"Use the documents below to answer the question.\n"
"If the documents don't contain the answer, say \"I don't know.\"\n"
"Cite sources as [N].\n\n"
f"DOCUMENTS:\n{docs}\n\n"
f"QUESTION: {query}\n"
"ANSWER:"
)
Every commercial RAG system is variations on this
10-line skeleton. The interesting engineering happens at
retriever.search (Sections 2-3) and at the
prompt template (Section 4).
8. Architecture Variations You'll See
- Single-vector RAG — what we've described. The standard.
- Hybrid RAG — dense + BM25 with fusion. Production default in 2026.
- Re-ranked RAG — adds cross-encoder after retrieval. Standard for high-quality production.
- Multi-vector RAG — represent each chunk by multiple vectors (ColBERT-style late interaction).
- Hierarchical RAG — retrieve summary first, drill down to detail chunks. Helps with very long documents.
- Agentic RAG — the LLM iteratively issues searches and refines. Section 4.
- Graph RAG — represent the corpus as a knowledge graph; retrieve subgraphs. Microsoft's 2024 paper.