AIMaks

RAG Architecture Overview

35 min readvideoRAG Fundamentals
2 of 22Building RAG Applications

RAG Architecture Overview

Lesson 1 set up what RAG is. This lesson is the architectural blueprint — the components every production RAG system uses, the data flow at indexing time vs query time, and the decisions you'll make at each layer. By the end you'll be able to read any commercial RAG system's marketing diagram and know exactly what's underneath.

1. Two Pipelines, Not One

Every RAG system has two distinct flows that share a vector store but otherwise run independently:

code
INDEXING PIPELINE (offline, run periodically)
───────────────────────────────────────────────
Documents → Loader → Cleaner → Chunker → Embedder → Vector Store

                                          (also) BM25 index

QUERY PIPELINE (online, every user request)
───────────────────────────────────────────
Query → Query Transform → Retriever (vector + BM25)

                          Re-ranker → Top-K

                        Prompt Builder → LLM

                              Answer + Citations

Confusing the two is the most common architecture mistake. Indexing is batch / offline; query is online and latency-sensitive. They have different scaling needs and failure modes.

2. The Indexing Pipeline, in Detail

StageWhat it doesCommon tools
LoaderRead documents from PDFs, HTML, Word, Notion, S3, Confluence...Unstructured, LangChain loaders, LlamaParse
CleanerStrip boilerplate, normalize whitespace, drop boilerplate sectionsCustom regex / LLM-based cleanup
ChunkerSplit long docs into 200-1000 token piecesRecursiveCharacterTextSplitter, sentence-aware
EmbedderMap each chunk to a dense vectorOpenAI, Cohere, BGE, sentence-transformers
Vector storeIndex vectors for fast nearest-neighbor searchPinecone, Weaviate, Qdrant, Chroma, pgvector
Sparse index (optional)BM25 / inverted index for keyword retrievalElasticsearch, OpenSearch, Vespa
Metadata storeSource URL, doc ID, timestamps, author for each chunkSame DB as vector store, or separate Postgres

Run on a schedule (hourly, daily, weekly). Re-running the full pipeline on an unchanged corpus is wasteful — most production systems use incremental indexing where only changed documents get re-processed.

3. The Query Pipeline, in Detail

StageWhat it doesWhy
Query understandingSpell-correct, expand abbreviations, classify intentBetter retrieval signal
Query transformationRewrite into multiple queries; HyDE; step-back promptingBridge query-doc vocabulary mismatch
Dense retrievalEmbed query, ANN search in vector storeSemantic match; the meat of RAG
Sparse retrievalBM25 over inverted indexCatches exact-term queries dense misses
Hybrid fusionReciprocal Rank Fusion or weighted sumBest of both retrieval families
Re-rankingCross-encoder re-scores the top-50 → top-10~5-15% lift in retrieval quality
Prompt assemblyFormat docs into a system prompt with the user queryThe prompt template is a tunable artifact
GenerationCall the LLM with the assembled promptSelf-explanatory
Citation extractionMap model's claims back to retrieved docsAuditability

4. Latency Budget at Query Time

code
Target: ~2-5 seconds end-to-end (chat-style UX)

Query embedding:      30-100 ms
Vector store search:  50-200 ms
BM25 search:          20-100 ms
Hybrid fusion:         5-20 ms
Re-ranking (CE):     200-500 ms (slow; sometimes skipped)
Prompt assembly:      10-50 ms
LLM (first token):  500-2000 ms
LLM (full response): 1-3 s
Citation extraction:  20-100 ms
─────────────────────────────────
Total p50:            ~2-3 s
Total p99:            ~5-8 s

The LLM call dominates. Streaming the response (first token in ~500ms) makes the system feel responsive even at multi- second total latencies. The retrieval half rarely exceeds 500ms in production.

5. The Three RAG Quality Metrics

MetricWhat it measuresWhere it can fail
Retrieval qualityDid we fetch the right docs?Bad embeddings, bad chunking, missing docs
FaithfulnessDoes the answer use only the retrieved docs?Model hallucinates beyond context
Answer correctnessIs the answer right?Either of the above, or model reasoning error

Lesson 18 covers evaluation in depth. The 2026 standard is Ragas / TruLens for automated metrics + a small golden set for ground-truth evaluation.

6. The Standard Failure Modes

7. The "Hello World" Mental Model

code
def rag_answer(query, k=5):
    chunks = retriever.search(query, top_k=k)
    prompt = format_prompt(query, chunks)
    answer = llm.generate(prompt)
    citations = extract_citations(answer, chunks)
    return answer, citations

def format_prompt(query, chunks):
    docs = "\n\n".join(f"[{i}] {c.text}" for i, c in enumerate(chunks))
    return (
        "Use the documents below to answer the question.\n"
        "If the documents don't contain the answer, say \"I don't know.\"\n"
        "Cite sources as [N].\n\n"
        f"DOCUMENTS:\n{docs}\n\n"
        f"QUESTION: {query}\n"
        "ANSWER:"
    )

Every commercial RAG system is variations on this 10-line skeleton. The interesting engineering happens at retriever.search (Sections 2-3) and at the prompt template (Section 4).

8. Architecture Variations You'll See

  • Single-vector RAG — what we've described. The standard.
  • Hybrid RAG — dense + BM25 with fusion. Production default in 2026.
  • Re-ranked RAG — adds cross-encoder after retrieval. Standard for high-quality production.
  • Multi-vector RAG — represent each chunk by multiple vectors (ColBERT-style late interaction).
  • Hierarchical RAG — retrieve summary first, drill down to detail chunks. Helps with very long documents.
  • Agentic RAG — the LLM iteratively issues searches and refines. Section 4.
  • Graph RAG — represent the corpus as a knowledge graph; retrieve subgraphs. Microsoft's 2024 paper.

9. The Mental Model

Up next · Building Your First RAG Pipeline