Requirements Gathering for ML Systems

35 min readvideoSystem Design Principles

2 of 22ML System Design

Requirements Gathering for ML Systems

Most failed ML projects fail at the requirements phase. The team built a clever model that solved the wrong problem, optimized the wrong metric, met the wrong latency target, or ignored the wrong constraint. ML requirements are unusually treacherous: stakeholders specify their desired outcome, but the model lives in a world of probabilities, training data, and production constraints. This lesson is the framework for pinning requirements down before any code is written.

1. The Five Requirement Categories

Category	Question	Example
Functional	What should the system do?	"Flag transactions likely to be fraud"
Performance	How well, how fast, how scaled?	"AUC ≥ 0.92, p99 latency < 100 ms, 10K req/s"
Constraints	What can't we do?	"No PII in training data, GDPR-compliant, AWS-only"
Business	What value does it deliver?	"Reduce fraud loss by $X/year"
Operational	Who owns it, what's the SLA?	"On-call by Trust & Safety; 99.9% availability"

Skipping any category invites surprise later. Operational is the most-often-skipped: who pages whom when this breaks?

2. The Critical Question: Is ML Even the Right Tool?

3. Translating Business Goals to ML Metrics

Stakeholders speak in business outcomes ("more conversions", "less spam", "happier customers"). ML lives in metrics (AUC, F1, NDCG). The translation is usually non-trivial.

Business goal	Direct ML metric	Pitfalls
Reduce fraud loss	F1 with cost-weighted FP/FN	Don't just optimize accuracy; FN is much more expensive
Increase engagement	CTR or session length	Optimizing CTR alone produces clickbait
Improve search quality	NDCG, MRR, top-1 click rate	NDCG can rise while user-perceived quality drops; need offline eval + online experiment
Reduce customer churn	Calibrated probability + lift @ k	Calibration matters more than raw AUC for retention campaigns

Two ML metrics for every project: an offline metric computable from labels (AUC, F1), and a online business metric measured via A/B testing (revenue, retention, user reports). The two correlate but never perfectly.

4. The Latency Budget Conversation

Latency requirements are inherited from the surrounding system, not picked freely:

Use case	Typical p99 budget
Search autocomplete	20-50 ms
Web recommender (page render path)	50-200 ms
Fraud check at payment	50-200 ms
Voice assistant response	200-500 ms
Email spam classification	1-5 s
Daily reporting batch	Hours

The model's allotment is usually 30-60% of the budget; the rest goes to feature lookup, network, downstream calls. A common mistake: scoping a 200ms model into a 100ms p99 page render budget.

5. Scale: The Three Numbers

Always pin down three numbers early:

Daily active users (DAU) — drives feature store size and event volume.
Peak QPS — drives serving capacity. Peak typically 3-10× average.
Total population — drives offline batch compute (if you score all users nightly).

These numbers have order-of-magnitude effects on architecture. 10K DAU is a SQLite-and-FastAPI problem. 10M DAU needs a feature store and Kubernetes. 1B DAU needs a custom serving stack.

6. Data Requirements

Always ask:

Where does training data come from? Logged predictions, human labels, distant supervision, synthetic.
How is it labeled? Auto from outcomes, sampled human review, full coverage.
How fresh must it be? Daily, hourly, real-time?
What's the volume? 1M rows trains in minutes; 1T rows needs Spark.
What's the schema stability? Adding a column happens monthly; renames break things.
What about privacy / compliance? PII handling, retention, regional data residency.

7. The "Negative Requirements"

Constraints expressed as "do NOT" are as important as positive goals:

No discrimination on protected attributes (sometimes legally enforced).
No personally identifying information in logs.
Cannot use third-party libraries (regulated environments).
Must run on-prem (compliance, data sovereignty).
Must explain decisions (financial, medical).

These often dominate architecture choices. A model with no explainability requirements can use any state-of-the-art approach; a credit-decision model under ECOA must produce reason codes.

8. The Stakeholder Matrix

Stakeholder	Cares about
Product manager	User experience metrics, A/B test wins
Data scientist	Offline accuracy, feature richness
ML engineer	Reliability, latency, observability
Data engineer	Feature pipelines, schema stability
SRE / on-call	SLOs, paging, runbooks
Legal / compliance	Privacy, fairness, auditability
Finance	Compute costs, ROI

Real ML systems serve all seven. Asking each what they care about (and what would make them say "block this launch") often surfaces requirements no one had documented.

9. Writing It Down: The Design Doc

By the end of requirements gathering you should have a 1-3 page design doc with:

Problem statement — one paragraph; who is the user and what need does the system address.
Success metrics — primary business metric + corresponding ML metric + acceptable values.
Constraints — latency, scale, cost, compliance, dependencies.
Out of scope — explicit list of things this system does NOT do.
Risks & alternatives — what could kill this; what we considered and rejected.

10. The Discipline

← Previous lessonIntroduction to ML System Design

Up next · ML System Architecture Patterns