RAG Architecture Overview

35 min readvideoRAG Fundamentals

2 of 22Building RAG Applications

RAG Architecture Overview

Lesson 1 set up what RAG is. This lesson is the architectural blueprint — the components every production RAG system uses, the data flow at indexing time vs query time, and the decisions you'll make at each layer. By the end you'll be able to read any commercial RAG system's marketing diagram and know exactly what's underneath.

1. Two Pipelines, Not One

Every RAG system has two distinct flows that share a vector store but otherwise run independently:

code

INDEXING PIPELINE (offline, run periodically)
───────────────────────────────────────────────
Documents → Loader → Cleaner → Chunker → Embedder → Vector Store
                                                  ↓
                                          (also) BM25 index

QUERY PIPELINE (online, every user request)
───────────────────────────────────────────
Query → Query Transform → Retriever (vector + BM25)
                                  ↓
                          Re-ranker → Top-K
                                  ↓
                        Prompt Builder → LLM
                                  ↓
                              Answer + Citations

Confusing the two is the most common architecture mistake. Indexing is batch / offline; query is online and latency-sensitive. They have different scaling needs and failure modes.

2. The Indexing Pipeline, in Detail

Stage	What it does	Common tools
Loader	Read documents from PDFs, HTML, Word, Notion, S3, Confluence...	Unstructured, LangChain loaders, LlamaParse
Cleaner	Strip boilerplate, normalize whitespace, drop boilerplate sections	Custom regex / LLM-based cleanup
Chunker	Split long docs into 200-1000 token pieces	RecursiveCharacterTextSplitter, sentence-aware
Embedder	Map each chunk to a dense vector	OpenAI, Cohere, BGE, sentence-transformers
Vector store	Index vectors for fast nearest-neighbor search	Pinecone, Weaviate, Qdrant, Chroma, pgvector
Sparse index (optional)	BM25 / inverted index for keyword retrieval	Elasticsearch, OpenSearch, Vespa
Metadata store	Source URL, doc ID, timestamps, author for each chunk	Same DB as vector store, or separate Postgres

Run on a schedule (hourly, daily, weekly). Re-running the full pipeline on an unchanged corpus is wasteful — most production systems use incremental indexing where only changed documents get re-processed.

3. The Query Pipeline, in Detail

Stage	What it does	Why
Query understanding	Spell-correct, expand abbreviations, classify intent	Better retrieval signal
Query transformation	Rewrite into multiple queries; HyDE; step-back prompting	Bridge query-doc vocabulary mismatch
Dense retrieval	Embed query, ANN search in vector store	Semantic match; the meat of RAG
Sparse retrieval	BM25 over inverted index	Catches exact-term queries dense misses
Hybrid fusion	Reciprocal Rank Fusion or weighted sum	Best of both retrieval families
Re-ranking	Cross-encoder re-scores the top-50 → top-10	~5-15% lift in retrieval quality
Prompt assembly	Format docs into a system prompt with the user query	The prompt template is a tunable artifact
Generation	Call the LLM with the assembled prompt	Self-explanatory
Citation extraction	Map model's claims back to retrieved docs	Auditability

4. Latency Budget at Query Time

code

Target: ~2-5 seconds end-to-end (chat-style UX)

Query embedding:      30-100 ms
Vector store search:  50-200 ms
BM25 search:          20-100 ms
Hybrid fusion:         5-20 ms
Re-ranking (CE):     200-500 ms (slow; sometimes skipped)
Prompt assembly:      10-50 ms
LLM (first token):  500-2000 ms
LLM (full response): 1-3 s
Citation extraction:  20-100 ms
─────────────────────────────────
Total p50:            ~2-3 s
Total p99:            ~5-8 s

The LLM call dominates. Streaming the response (first token in ~500ms) makes the system feel responsive even at multi- second total latencies. The retrieval half rarely exceeds 500ms in production.

5. The Three RAG Quality Metrics

Metric	What it measures	Where it can fail
Retrieval quality	Did we fetch the right docs?	Bad embeddings, bad chunking, missing docs
Faithfulness	Does the answer use only the retrieved docs?	Model hallucinates beyond context
Answer correctness	Is the answer right?	Either of the above, or model reasoning error

Lesson 18 covers evaluation in depth. The 2026 standard is Ragas / TruLens for automated metrics + a small golden set for ground-truth evaluation.

6. The Standard Failure Modes

7. The "Hello World" Mental Model

code

def rag_answer(query, k=5):
    chunks = retriever.search(query, top_k=k)
    prompt = format_prompt(query, chunks)
    answer = llm.generate(prompt)
    citations = extract_citations(answer, chunks)
    return answer, citations

def format_prompt(query, chunks):
    docs = "\n\n".join(f"[{i}] {c.text}" for i, c in enumerate(chunks))
    return (
        "Use the documents below to answer the question.\n"
        "If the documents don't contain the answer, say \"I don't know.\"\n"
        "Cite sources as [N].\n\n"
        f"DOCUMENTS:\n{docs}\n\n"
        f"QUESTION: {query}\n"
        "ANSWER:"
    )

Every commercial RAG system is variations on this 10-line skeleton. The interesting engineering happens at retriever.search (Sections 2-3) and at the prompt template (Section 4).

8. Architecture Variations You'll See

Single-vector RAG — what we've described. The standard.
Hybrid RAG — dense + BM25 with fusion. Production default in 2026.
Re-ranked RAG — adds cross-encoder after retrieval. Standard for high-quality production.
Multi-vector RAG — represent each chunk by multiple vectors (ColBERT-style late interaction).
Hierarchical RAG — retrieve summary first, drill down to detail chunks. Helps with very long documents.
Agentic RAG — the LLM iteratively issues searches and refines. Section 4.
Graph RAG — represent the corpus as a knowledge graph; retrieve subgraphs. Microsoft's 2024 paper.

9. The Mental Model

← Previous lessonWhat is Retrieval-Augmented Generation?

Up next · Building Your First RAG Pipeline