Introduction to ML System Design

30 min readvideoSystem Design Principles

1 of 22ML System Design

Introduction to ML System Design

ML System Design is the engineering discipline of turning a model into a system someone can actually ship, scale, and operate. Modeling is roughly 10% of the work in production ML; the other 90% is everything around the model — data pipelines, feature stores, serving infrastructure, monitoring, evaluation, experimentation. This course covers that 90%, the part that decides whether your project ends in a paper or a product. It is also one of the most-asked interview topics at every senior ML role at FAANG-tier companies.

1. The Three Realities

Reality	What it means
The model is a small part of the system	The Sculley et al. 2015 paper "Hidden Technical Debt in ML Systems" famously showed the model is one tiny box on a diagram of pipelines, feature stores, monitoring, governance, and serving
Production ML is software engineering first	Reliability, latency, cost, observability, security all dominate over a 1% accuracy gain
Design decisions compound	The architecture you pick on day one shapes a year of velocity; bad early choices are expensive to undo

2. What "ML System Design" Actually Covers

Six concerns, in roughly the order you'd think about them on day one of a real project:

Requirements — what is the system supposed to do, for whom, with what constraints (latency, cost, compliance)?
Data — where does training data come from, how does it flow, how is it versioned, how do features arrive at serving time?
Modeling — what models are appropriate given data scale, latency budget, and accuracy needs?
Serving — how do predictions reach users (online API, batch table, edge device, hybrid)?
Monitoring & Feedback — how do you know the system still works tomorrow?
Operations — deployment, rollback, retraining, on-call, compliance.

3. The Canonical Production ML System

code

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Sources     │  │  Sources     │  │   Sources    │
│ (events, DB) │  │ (3rd-party)  │  │   (logs)     │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       └─────────────────┼─────────────────┘
                         ▼
                ┌──────────────────┐
                │  Data pipeline   │   batch + stream
                └────────┬─────────┘
                         ▼
                ┌──────────────────┐
                │  Feature store   │   offline + online
                └────────┬─────────┘
            offline ──┬───┴─── online
                      ▼          ▼
              ┌────────────┐  ┌─────────────┐
              │ Training   │  │ Online API  │
              │ pipeline   │  │ (low-latency│
              └─────┬──────┘  │  serving)   │
                    ▼         └─────┬───────┘
              ┌────────────┐        │
              │ Model      │◀───────┤
              │ Registry   │        │
              └─────┬──────┘        ▼
                    └──────▶  ┌────────────┐
                              │ Predictions│
                              │ + outcomes │
                              └─────┬──────┘
                                    ▼
                              Monitoring + retrain trigger

Almost every production ML system has this shape. Variations (batch-only, edge, real-time-only) drop or expand boxes. Knowing the canonical picture gives you a vocabulary for every later conversation.

4. Why Design Matters: Cost of Getting It Wrong

5. The Trade-Offs You'll Make Repeatedly

Axis	Trade-off
Latency vs accuracy	Larger models score better but cost more at serving time
Freshness vs cost	Real-time features beat hourly features; both beat daily; each step ~10× more expensive
Online vs batch	Online is responsive; batch is cheaper if predictions can be precomputed
Build vs buy	Custom infrastructure vs SageMaker / Vertex AI / Databricks
Coverage vs confidence	Predict for everyone with low confidence, or fewer with high — affects fallback design
Generalization vs personalization	One global model vs per-segment models

Senior ML system design is largely the ability to articulate these trade-offs and pick the right side for the specific problem.

6. Examples of Real Production Systems

YouTube recommendations — two-stage candidate generation + ranking; billions of users; sub-100ms latency; the canonical recsys design.
Google Search ranking — multi-stage retrieval → first-pass ranking → fine ranking → reranking; each stage shrinks the candidate set with cheaper-then-richer features.
Stripe fraud detection — sub-100ms decisions on payment authorization; combination of rules + multiple models; explicit human-in-the-loop for borderline cases.
Tesla Autopilot — on-device perception with intermittent cloud retraining; data flywheel where rare events get prioritized for labeling.
OpenAI ChatGPT serving — KV-cache management, speculative decoding, batching; latency and cost dominate over all else at scale.

7. The "Smallest Useful System"

You don't need every box on day one. The smallest end-to-end ML system that earns its keep:

One source table.
A SQL feature pipeline.
A training script logging to MLflow.
A FastAPI service loading the latest model.
Prometheus + Grafana on the API.

This is roughly six tools and three weeks of work. Ship it, learn from it, then grow into the rest of the canonical architecture deliberately.

8. Why ML System Design Interviews Exist

Senior ML interviews almost always include a system design round. The interviewer wants to see whether you can:

Decompose a vague problem into concrete sub-problems (scoping).
Articulate trade-offs between architectural choices.
Reason about scale (10K users vs 100M users).
Identify where things will go wrong before they go wrong.
Communicate clearly while drawing on a whiteboard.

Section 6 of this course is a focused interview-practice block. The same skills are what you exercise in real design reviews at work.

9. The Course Map

By Lesson 22 you will have:

A framework for gathering requirements and translating them into architecture (Section 1).
Hands-on with data pipelines, feature stores, and stream vs batch (Section 2).
Built a complete model serving pipeline (Section 3).
Designed a recommendation system end-to-end (Section 4).
Designed a search ranking system (Section 5).
Practiced the system design interview format and shipped a complete design document (Section 6).

10. The Mindset

Up next · Requirements Gathering for ML Systems