Building Your First RAG Pipeline
3 of 22Building RAG Applications
Building Your First RAG Pipeline
Time to build one end-to-end. This notebook walks the full pipeline — load documents, chunk them, embed them, build a vector index, retrieve at query time, and generate an answer with citations — in roughly 100 lines of Python. Same recipe scales to production with bigger embedding models, real vector databases, and re-ranking layered on top. By the end you have a working "chat with your docs" system.
pip install "sentence-transformers==3.3.1" \
"chromadb==0.5.18" \
"openai==1.55.0" \
"tiktoken==0.8.0"OPENAI_API_KEY in your env. Total cost for
this notebook on a few thousand short docs: well under
$0.10. Or swap to Anthropic / a self-hosted LLM via
litellm — same shape.
1. The Toy Corpus
docs = [
"Our return window is 30 days for unopened items.",
"Opened laptops can be returned within 14 days for a 15% restocking fee.",
"All software downloads are non-refundable once accessed.",
"Free shipping is available on orders over $50 in the continental US.",
"International shipping costs vary by destination and weight.",
"Customer support is available Monday-Friday, 9am-6pm ET.",
"Premium support customers have 24/7 phone access.",
"Refunds are processed within 5-7 business days after we receive the return.",
"Gift cards cannot be returned but can be transferred to another account.",
"Order tracking is available from your account dashboard.",
]
print(f"{len(docs)} documents")
Ten short docs is enough to demonstrate the full pipeline. Real corpora are millions of chunks; the API is the same.
2. Chunking
# Our docs are already short. For longer text, split on sentences or tokens.
chunks = [{"text": d, "id": str(i), "source": f"doc_{i}.md"} for i, d in enumerate(docs)]
In production, longer documents get split into 200-800 token chunks with some overlap (Lesson 9 covers chunking strategies). Each chunk carries metadata back to its source — that's how citations work later.
3. Embedding
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = embed_model.encode(
[c["text"] for c in chunks],
normalize_embeddings=True, # cosine similarity friendly
show_progress_bar=True,
)
print("embedding shape:", embeddings.shape)
# (10, 384)
BGE-small is the strong open-source default in 2026 — 384 dim, fast, comparable quality to OpenAI text-embedding-3-small. Always normalize embeddings if your vector store uses dot product (it makes dot product = cosine similarity).
4. The Vector Store
import chromadb
client = chromadb.PersistentClient(path="./chroma")
collection = client.get_or_create_collection(
name="returns_kb",
metadata={"hnsw:space": "cosine"},
)
collection.upsert(
ids=[c["id"] for c in chunks],
documents=[c["text"] for c in chunks],
embeddings=embeddings.tolist(),
metadatas=[{"source": c["source"]} for c in chunks],
)
print("indexed:", collection.count(), "chunks")
Chroma is the simplest vector DB to learn — pure Python, embedded, no separate service. Lesson 6 covers Pinecone / Weaviate / Qdrant for production. The API is roughly the same across all of them.
5. Retrieval
def retrieve(query, k=3):
q_emb = embed_model.encode(query, normalize_embeddings=True).tolist()
results = collection.query(query_embeddings=[q_emb], n_results=k)
return [
{"text": doc, "source": meta["source"], "score": dist}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
)
]
for r in retrieve("Can I return my laptop after a month?"):
print(f"[{r['source']}] {r['text']} (dist={r['score']:.3f})")
Two top results should be the laptop-return policy and the general return policy. The semantic match is what makes RAG work — the query says "month", the doc says "30 days", and embedding similarity bridges them.
6. The Prompt
SYSTEM_PROMPT = (
"You are a helpful customer-support assistant. Answer the user's "
"question using ONLY the documents below. If the documents do not "
"contain enough information, say so explicitly and do not guess. "
"Cite sources by their bracketed number, e.g., [1] or [2]."
)
def format_prompt(query, chunks):
docs_str = "\n\n".join(
f"[{i+1}] (from {c['source']}) {c['text']}"
for i, c in enumerate(chunks)
)
return (
f"DOCUMENTS:\n{docs_str}\n\n"
f"QUESTION: {query}\n\n"
"ANSWER:"
)
Three things this prompt does well: (a) tells the model to only use the retrieved docs, (b) gives an explicit "I don't know" escape, (c) asks for citations the user can verify. Skipping any of these creates a footgun.
7. Generation
from openai import OpenAI
client_llm = OpenAI()
def answer(query, k=3):
chunks = retrieve(query, k=k)
prompt = format_prompt(query, chunks)
response = client_llm.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
temperature=0,
)
return response.choices[0].message.content, chunks
reply, sources = answer("Can I return my laptop after a month?")
print(reply)
print("\nSources:")
for s in sources:
print(f" - {s['source']}: {s['text']}")
Expected output: something like "Opened laptops can only be returned within 14 days, so a return after a month is not possible per [1]. Unopened items have a 30-day window per [2]." — the model uses the retrieved docs and cites them back.
8. The Whole Pipeline, in 50 Lines
# rag.py — production-shaped
from sentence_transformers import SentenceTransformer
import chromadb
from openai import OpenAI
EMBED = SentenceTransformer("BAAI/bge-small-en-v1.5")
CLIENT = OpenAI()
DB = chromadb.PersistentClient(path="./chroma").get_or_create_collection("kb")
def index(docs):
embs = EMBED.encode([d["text"] for d in docs],
normalize_embeddings=True).tolist()
DB.upsert(ids=[d["id"] for d in docs],
documents=[d["text"] for d in docs],
embeddings=embs,
metadatas=[{"source": d["source"]} for d in docs])
def retrieve(query, k=3):
q = EMBED.encode(query, normalize_embeddings=True).tolist()
r = DB.query(query_embeddings=[q], n_results=k)
return list(zip(r["documents"][0], r["metadatas"][0]))
def answer(query, k=3):
docs = retrieve(query, k=k)
formatted = "\n".join(f"[{i+1}] ({m['source']}) {t}"
for i, (t, m) in enumerate(docs))
prompt = f"DOCUMENTS:\n{formatted}\n\nQUESTION: {query}\nANSWER:"
out = CLIENT.chat.completions.create(
model="gpt-4o-mini", temperature=0,
messages=[
{"role": "system", "content":
"Use the documents to answer. If they don't help, say so. "
"Cite as [N]."},
{"role": "user", "content": prompt},
],
).choices[0].message.content
return out, docs
Three functions: index, retrieve,
answer. ~50 lines. Wrap with FastAPI or
Streamlit and you have a working "chat with my docs"
service. Sections 2-5 of this course turn the same
shape into a production-grade system.