ML System Architecture Patterns

25 min readreadingSystem Design Principles

3 of 22ML System Design

ML System Architecture Patterns

Once requirements are pinned down, the next move is picking an architectural pattern. Most production ML systems are built from a small number of recurring shapes; recognizing them and knowing when to reach for each is most of the senior-design skill. This reading is the catalogue: six dominant patterns, the trade-offs each makes, and how they compose.

1. Pattern 1: Pure Batch Prediction

code

Sources → Pipeline → Train → Score everyone → Write results table
                                                       ↓
                                    Serving = SQL SELECT from results

Predictions are computed for every entity (user, item) on a schedule (daily / hourly) and written to a results table. The "serving" layer is a thin SQL query.

Pros	Cons
Simple operations; cheap	Predictions are stale by definition
Compute scales with population, not request rate	No personalization on the in-session signal
Easy auditability	Can't handle new entities not in last batch

Examples: nightly recommendation refresh, weekly churn scores, monthly customer-tier assignments. Default for any ML system whose predictions don't change within hours.

2. Pattern 2: Online Real-Time Prediction

code

Request → API → Feature lookup → Model inference → Response
              (online feature store, low-latency cache)

Predictions are computed at request time. Features come from an online feature store; the model runs in milliseconds; the response is the prediction.

Latency budget: 10-200 ms p99.
Compute scales with request rate, not population.
Can use request-time signals (e.g., the cart contents).
Required when predictions change in-session.

Examples: fraud check at payment, ad ranking per impression, search autocomplete. Default when freshness matters.

3. Pattern 3: Hybrid (Precompute + Lookup)

code

Batch job → score everyone → write to KV store
                                ↓
Request →  GET key  → KV store  → response  (1-5 ms)

Pre-compute predictions in batch; serve them via a key-value store. Looks online to the user (millisecond lookups) but costs like batch (no per-request model inference).

Examples: most production recommendation systems, churn-flag serving, propensity-score serving. The default for any finite-population recsys.

4. Pattern 4: Streaming Real-Time

code

Event stream → Consumer → Feature update → Decision → Output stream
              (Kafka)    (Flink/Spark)              (Kafka topic)

Events flow continuously through a streaming compute layer that maintains feature state and emits predictions on a downstream topic. Used when both inputs and outputs are inherently streaming.

Examples: real-time fraud scoring on payment streams; anomaly detection on metrics; price-stream classification for trading.

5. Pattern 5: Edge / On-Device

code

Device captures input → Local model inference → Local action
                       (no network round-trip)

The model runs on the user's device — phone, browser, embedded hardware. Used when latency is critical, network is unreliable, or privacy demands data stays local.

Latency: 1-100 ms with no network.
Privacy: data never leaves the device.
Trade-off: model must fit on-device hardware (quantization, distillation often required).

Examples: face unlock on phones, voice wake-word, on-device autocomplete, in-browser image generation, Tesla autopilot perception.

6. Pattern 6: Multi-Stage (Cascade) Architecture

code

Request → Coarse retrieval (cheap) → Re-ranking (medium) → Fine ranking (expensive)
          (millions → 1000s)         (1000s → 100s)         (100s → 10s)

Common in search and recommenders. Each stage uses cheaper features but a larger candidate pool; later stages use richer features on smaller candidate sets. Total compute stays bounded while quality stays high.

Stage 1: ANN / inverted index / heuristic filter. Sub-millisecond per request.
Stage 2: small ranking model on the shortlist. ~10 ms.
Stage 3: large ranking model + business rules + diversification. ~100 ms.

Examples: YouTube recommendations, Google Search, Spotify Daily Mix. The dominant production pattern at FAANG-scale.

7. Composing Patterns

Real systems combine these. A typical e-commerce recommendation system:

Offline — nightly batch retraining; nightly candidate generation per user (Pattern 1).
Hybrid lookup — pre-computed candidate lists stored in Redis (Pattern 3).
Online — re-rank with session features at request time (Pattern 2 + 6).

No single pattern fits a non-trivial system. The architectural skill is composing them based on freshness needs of each piece.

8. The Decision Tree

9. Capacity Planning Math: From Requirements to Hardware

Picking a pattern is half the work; sizing it is the other half. Senior interviewers expect you to translate the three scale numbers (DAU, peak QPS, population) into a concrete shopping list: how many GPUs, how much Redis memory, what Kafka throughput. The arithmetic is grade-school but the ratios you apply matter. Memorize them.

Start from the canonical request-rate identity. If $D$ is daily active users, $r$ is requests per active user per day, and the traffic peak is $p$ × the average (typical $p \approx 3$ – $5$ for consumer surfaces, up to $10$ × for time-zoned events), then:

peak QPS = \frac{D \cdot r}{86 400} \cdot p

A dating app with $D = 5, 000, 000$ , $r = 30$ recommendation fetches per user per day, and a peak ratio of $p = 4$ needs to hold roughly $\tfrac{5{,}000{,}000 \cdot 30}{86{,}400} \cdot 4 \approx 6{,}944$ QPS at peak. Not a back-of-the-napkin number — a capacity-planning input.

python

import math


def peak_qps(dau: int, requests_per_user: float,
             peak_ratio: float = 4.0) -> float:
    # Translate a usage spec into a peak-second request rate.
    return (dau * requests_per_user / 86_400) * peak_ratio


def replicas_needed(peak_qps: float, qps_per_replica: float,
                    headroom: float = 0.5) -> int:
    # Number of model-server replicas at peak.
    # headroom: fraction of capacity to keep idle as a safety buffer.
    #   0.5 means you size to 2x the steady-state need so a
    #   replica can die without paging anyone.
    effective_per_replica = qps_per_replica * (1 - headroom)
    return math.ceil(peak_qps / effective_per_replica)


# Worked example: 7K peak QPS, each GPU pod handles 80 QPS at the
# latency target, 50% headroom for redundancy.
qps = peak_qps(dau=5_000_000, requests_per_user=30, peak_ratio=4.0)
n_replicas = replicas_needed(qps, qps_per_replica=80, headroom=0.5)
print(f"Peak QPS: {qps:,.0f}")
print(f"Replicas required: {n_replicas}")
# Peak QPS: 6,944
# Replicas required: 174

Memory sizing for the online feature store follows the same shape. If the population is $N$ entities, each entity has a feature vector of $F$ bytes (commonly 1-10 KB for production recsys), and the replication factor is $k$ (typical 3 for Redis Cluster), the cluster needs:

M = N \cdot F \cdot k (bytes)

100M users × 4 KB per user × 3 replicas ≈ 1.2 TB of RAM — a meaningful number for a procurement conversation.

10. Architecture Decision Records (ADRs)

An architecture only survives the year if the reasons for each decision survive too. ADRs are the lightweight documentation form that captures why — written once at decision time, then frozen. The next engineer who picks up the system reads them instead of guessing.

The canonical ADR template is short on purpose: status, context, decision, consequences. Anything longer gets skimmed. Keep ADRs in docs/adr/ alongside code, numbered sequentially, filename slug for searchability:

code

# docs/adr/0007-hybrid-recsys-serving.md

# ADR-0007: Use precompute + Redis lookup for the homepage feed

## Status
Accepted (2024-09-12). Supersedes ADR-0003.

## Context
Homepage feed serves 8K QPS at peak for 50M users. Latency
budget p99 < 80 ms. Item catalog refreshes nightly; user
preferences drift on a multi-day timescale. We considered:

- Pure online (Pattern 2) — meets latency but doubles GPU spend
  vs precompute, and ranking model is too large to fit budget.
- Pure batch with SQL serving (Pattern 1) — fails latency: warehouse
  query is 200 ms+.
- Hybrid: nightly batch scoring -> Redis -> online re-ranker
  (Pattern 3 + 6).

## Decision
Adopt the hybrid pattern. Nightly Spark job scores the top-500
candidates per user, writes them to a Redis cluster keyed by
user_id, and a thin re-ranker reorders the top-50 using
session-time features (last 5 viewed items, current device).

## Consequences
+ p99 latency budget met (~35 ms p99 measured in load test).
+ GPU footprint stays small: only re-ranker runs online.
+ Cold-start users (no batch precompute) need a fallback path
  -> popularity-based defaults; tracked in ADR-0008.
- Predictions are at most 24 h stale on the long tail; acceptable
  per Product (signed off in PR #2341).
- Operational complexity: now we own a Redis cluster (sized in
  capacity-planning doc).

The discipline pays back at three predictable moments: when a new hire asks "why did we do it this way?", when a postmortem needs to attribute a design choice, and when a future migration needs to know which constraints are still binding. Don't skip.

11. A Worked Example: Putting It Together

Pull the pieces together with a single mock requirement. The product team wants a real-time fraud-scoring service for a payments platform. Three scale numbers handed to you on a sticky note: $D = 2, 000, 000$ DAU, average user makes 4 payments a day, peak ratio is 5 (paychecks land mid-day on Fridays). Latency budget: p99 < 150 ms, end-to-end. Data freshness: must reflect the last 5 minutes of the user's transaction history.

Decision tree:

Predictions are session-time and depend on a feature (transaction history) that updates every 5 minutes → pure batch is out.
The freshness window (5 min) is too tight for a nightly precompute → hybrid lookup alone is out.
Inputs are an event stream (payment events) — streaming maintenance of features is natural.
Decision must return synchronously to the payment flow (approve / decline) — the request path itself must be online.

Composition: streaming feature pipeline (Pattern 4) feeding an online feature store, with online prediction (Pattern 2) at request time. Capacity sizing:

python

qps = peak_qps(dau=2_000_000, requests_per_user=4, peak_ratio=5.0)
print(f"Peak QPS: {qps:,.0f}")  # Peak QPS: 463

# Each gradient-boosted-tree pod handles ~600 QPS at p99 = 80 ms.
n = replicas_needed(qps, qps_per_replica=600, headroom=0.5)
print(f"Replicas: {n}")  # Replicas: 2 -- but Min 3 for redundancy.

# Online feature store: 2M users * 8 KB feature vector * 3 replicas
mem_gb = 2_000_000 * 8_000 * 3 / 1e9
print(f"Feature store RAM: {mem_gb:.1f} GB")  # 48 GB -- 3x r6g.2xlarge

Result: a single-page architecture sketch you could whiteboard in 20 minutes — Kafka in front, Flink for feature aggregation, Redis for the online feature store, three GBT serving pods behind a load balancer, predictions logged to a delta-lake table for offline retraining. Each choice traceable to a requirement.

12. Anti-Patterns to Avoid

Warning

Common Mistakes

"Real-time everything" — over-engineered; most predictions don't need millisecond freshness.
"Train/serve skew" — different feature code paths at training and serving. The #1 cause of production accuracy loss.
"One mega-pipeline" — ingestion + features + training + serving in one DAG. Coupling kills velocity.
"No fallback" — when the model is unavailable, what happens? Always design a fallback (rule, last-known-good, default).
"No model registry" — hardcoded model paths in serving code. Promotion / rollback becomes a deploy.
"Sized for average load" — peak ratio is usually 3-5×. Provisioning for the average means paging your team every Friday at noon.
"No ADRs" — by month six nobody remembers why you picked what you picked, and the next migration is built on guesses.

13. The Composability Mindset

← Previous lessonRequirements Gathering for ML Systems

Up next · Quiz: System Design Fundamentals