Building Your First RAG Pipeline

45 min readnotebookRAG Fundamentals

3 of 22Building RAG Applications

Building Your First RAG Pipeline

Time to build one end-to-end. This notebook walks the full pipeline — load documents, chunk them, embed them, build a vector index, retrieve at query time, and generate an answer with citations — in roughly 100 lines of Python. Same recipe scales to production with bigger embedding models, real vector databases, and re-ranking layered on top. By the end you have a working "chat with your docs" system.

code

pip install "sentence-transformers==3.3.1" \
            "chromadb==0.5.18" \
            "openai==1.55.0" \
            "tiktoken==0.8.0"

Set OPENAI_API_KEY in your env. Total cost for this notebook on a few thousand short docs: well under $0.10. Or swap to Anthropic / a self-hosted LLM via litellm — same shape.

1. The Toy Corpus

code

docs = [
    "Our return window is 30 days for unopened items.",
    "Opened laptops can be returned within 14 days for a 15% restocking fee.",
    "All software downloads are non-refundable once accessed.",
    "Free shipping is available on orders over $50 in the continental US.",
    "International shipping costs vary by destination and weight.",
    "Customer support is available Monday-Friday, 9am-6pm ET.",
    "Premium support customers have 24/7 phone access.",
    "Refunds are processed within 5-7 business days after we receive the return.",
    "Gift cards cannot be returned but can be transferred to another account.",
    "Order tracking is available from your account dashboard.",
]
print(f"{len(docs)} documents")

Ten short docs is enough to demonstrate the full pipeline. Real corpora are millions of chunks; the API is the same.

2. Chunking

code

# Our docs are already short. For longer text, split on sentences or tokens.
chunks = [{"text": d, "id": str(i), "source": f"doc_{i}.md"} for i, d in enumerate(docs)]

In production, longer documents get split into 200-800 token chunks with some overlap (Lesson 9 covers chunking strategies). Each chunk carries metadata back to its source — that's how citations work later.

3. Embedding

code

from sentence_transformers import SentenceTransformer

embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = embed_model.encode(
    [c["text"] for c in chunks],
    normalize_embeddings=True,            # cosine similarity friendly
    show_progress_bar=True,
)
print("embedding shape:", embeddings.shape)
# (10, 384)

BGE-small is the strong open-source default in 2026 — 384 dim, fast, comparable quality to OpenAI text-embedding-3-small. Always normalize embeddings if your vector store uses dot product (it makes dot product = cosine similarity).

4. The Vector Store

code

import chromadb

client = chromadb.PersistentClient(path="./chroma")
collection = client.get_or_create_collection(
    name="returns_kb",
    metadata={"hnsw:space": "cosine"},
)

collection.upsert(
    ids=[c["id"] for c in chunks],
    documents=[c["text"] for c in chunks],
    embeddings=embeddings.tolist(),
    metadatas=[{"source": c["source"]} for c in chunks],
)
print("indexed:", collection.count(), "chunks")

Chroma is the simplest vector DB to learn — pure Python, embedded, no separate service. Lesson 6 covers Pinecone / Weaviate / Qdrant for production. The API is roughly the same across all of them.

5. Retrieval

code

def retrieve(query, k=3):
    q_emb = embed_model.encode(query, normalize_embeddings=True).tolist()
    results = collection.query(query_embeddings=[q_emb], n_results=k)
    return [
        {"text": doc, "source": meta["source"], "score": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]

for r in retrieve("Can I return my laptop after a month?"):
    print(f"[{r['source']}] {r['text']}  (dist={r['score']:.3f})")

Two top results should be the laptop-return policy and the general return policy. The semantic match is what makes RAG work — the query says "month", the doc says "30 days", and embedding similarity bridges them.

6. The Prompt

code

SYSTEM_PROMPT = (
    "You are a helpful customer-support assistant. Answer the user's "
    "question using ONLY the documents below. If the documents do not "
    "contain enough information, say so explicitly and do not guess. "
    "Cite sources by their bracketed number, e.g., [1] or [2]."
)

def format_prompt(query, chunks):
    docs_str = "\n\n".join(
        f"[{i+1}] (from {c['source']}) {c['text']}"
        for i, c in enumerate(chunks)
    )
    return (
        f"DOCUMENTS:\n{docs_str}\n\n"
        f"QUESTION: {query}\n\n"
        "ANSWER:"
    )

Three things this prompt does well: (a) tells the model to only use the retrieved docs, (b) gives an explicit "I don't know" escape, (c) asks for citations the user can verify. Skipping any of these creates a footgun.

7. Generation

code

from openai import OpenAI
client_llm = OpenAI()

def answer(query, k=3):
    chunks = retrieve(query, k=k)
    prompt = format_prompt(query, chunks)
    response = client_llm.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt},
        ],
        temperature=0,
    )
    return response.choices[0].message.content, chunks

reply, sources = answer("Can I return my laptop after a month?")
print(reply)
print("\nSources:")
for s in sources:
    print(f"  - {s['source']}: {s['text']}")

Expected output: something like "Opened laptops can only be returned within 14 days, so a return after a month is not possible per [1]. Unopened items have a 30-day window per [2]." — the model uses the retrieved docs and cites them back.

8. The Whole Pipeline, in 50 Lines

code

# rag.py — production-shaped
from sentence_transformers import SentenceTransformer
import chromadb
from openai import OpenAI

EMBED = SentenceTransformer("BAAI/bge-small-en-v1.5")
CLIENT = OpenAI()
DB = chromadb.PersistentClient(path="./chroma").get_or_create_collection("kb")

def index(docs):
    embs = EMBED.encode([d["text"] for d in docs],
                        normalize_embeddings=True).tolist()
    DB.upsert(ids=[d["id"] for d in docs],
              documents=[d["text"] for d in docs],
              embeddings=embs,
              metadatas=[{"source": d["source"]} for d in docs])

def retrieve(query, k=3):
    q = EMBED.encode(query, normalize_embeddings=True).tolist()
    r = DB.query(query_embeddings=[q], n_results=k)
    return list(zip(r["documents"][0], r["metadatas"][0]))

def answer(query, k=3):
    docs = retrieve(query, k=k)
    formatted = "\n".join(f"[{i+1}] ({m['source']}) {t}"
                          for i, (t, m) in enumerate(docs))
    prompt = f"DOCUMENTS:\n{formatted}\n\nQUESTION: {query}\nANSWER:"
    out = CLIENT.chat.completions.create(
        model="gpt-4o-mini", temperature=0,
        messages=[
            {"role": "system", "content":
             "Use the documents to answer. If they don't help, say so. "
             "Cite as [N]."},
            {"role": "user", "content": prompt},
        ],
    ).choices[0].message.content
    return out, docs

Three functions: index, retrieve, answer. ~50 lines. Wrap with FastAPI or Streamlit and you have a working "chat with my docs" service. Sections 2-5 of this course turn the same shape into a production-grade system.

9. Common Pitfalls

10. Exercises

← Previous lessonRAG Architecture Overview

Up next · Quiz: RAG Fundamentals