Requirements Gathering for ML Systems
2 of 22ML System Design
Requirements Gathering for ML Systems
Most failed ML projects fail at the requirements phase. The team built a clever model that solved the wrong problem, optimized the wrong metric, met the wrong latency target, or ignored the wrong constraint. ML requirements are unusually treacherous: stakeholders specify their desired outcome, but the model lives in a world of probabilities, training data, and production constraints. This lesson is the framework for pinning requirements down before any code is written.
1. The Five Requirement Categories
| Category | Question | Example |
|---|---|---|
| Functional | What should the system do? | "Flag transactions likely to be fraud" |
| Performance | How well, how fast, how scaled? | "AUC ≥ 0.92, p99 latency < 100 ms, 10K req/s" |
| Constraints | What can't we do? | "No PII in training data, GDPR-compliant, AWS-only" |
| Business | What value does it deliver? | "Reduce fraud loss by $X/year" |
| Operational | Who owns it, what's the SLA? | "On-call by Trust & Safety; 99.9% availability" |
Skipping any category invites surprise later. Operational is the most-often-skipped: who pages whom when this breaks?
2. The Critical Question: Is ML Even the Right Tool?
3. Translating Business Goals to ML Metrics
Stakeholders speak in business outcomes ("more conversions", "less spam", "happier customers"). ML lives in metrics (AUC, F1, NDCG). The translation is usually non-trivial.
| Business goal | Direct ML metric | Pitfalls |
|---|---|---|
| Reduce fraud loss | F1 with cost-weighted FP/FN | Don't just optimize accuracy; FN is much more expensive |
| Increase engagement | CTR or session length | Optimizing CTR alone produces clickbait |
| Improve search quality | NDCG, MRR, top-1 click rate | NDCG can rise while user-perceived quality drops; need offline eval + online experiment |
| Reduce customer churn | Calibrated probability + lift @ k | Calibration matters more than raw AUC for retention campaigns |
Two ML metrics for every project: an offline metric computable from labels (AUC, F1), and a online business metric measured via A/B testing (revenue, retention, user reports). The two correlate but never perfectly.
4. The Latency Budget Conversation
Latency requirements are inherited from the surrounding system, not picked freely:
| Use case | Typical p99 budget |
|---|---|
| Search autocomplete | 20-50 ms |
| Web recommender (page render path) | 50-200 ms |
| Fraud check at payment | 50-200 ms |
| Voice assistant response | 200-500 ms |
| Email spam classification | 1-5 s |
| Daily reporting batch | Hours |
The model's allotment is usually 30-60% of the budget; the rest goes to feature lookup, network, downstream calls. A common mistake: scoping a 200ms model into a 100ms p99 page render budget.
5. Scale: The Three Numbers
Always pin down three numbers early:
- Daily active users (DAU) — drives feature store size and event volume.
- Peak QPS — drives serving capacity. Peak typically 3-10× average.
- Total population — drives offline batch compute (if you score all users nightly).
These numbers have order-of-magnitude effects on architecture. 10K DAU is a SQLite-and-FastAPI problem. 10M DAU needs a feature store and Kubernetes. 1B DAU needs a custom serving stack.
6. Data Requirements
Always ask:
- Where does training data come from? Logged predictions, human labels, distant supervision, synthetic.
- How is it labeled? Auto from outcomes, sampled human review, full coverage.
- How fresh must it be? Daily, hourly, real-time?
- What's the volume? 1M rows trains in minutes; 1T rows needs Spark.
- What's the schema stability? Adding a column happens monthly; renames break things.
- What about privacy / compliance? PII handling, retention, regional data residency.
7. The "Negative Requirements"
Constraints expressed as "do NOT" are as important as positive goals:
- No discrimination on protected attributes (sometimes legally enforced).
- No personally identifying information in logs.
- Cannot use third-party libraries (regulated environments).
- Must run on-prem (compliance, data sovereignty).
- Must explain decisions (financial, medical).
These often dominate architecture choices. A model with no explainability requirements can use any state-of-the-art approach; a credit-decision model under ECOA must produce reason codes.
8. The Stakeholder Matrix
| Stakeholder | Cares about |
|---|---|
| Product manager | User experience metrics, A/B test wins |
| Data scientist | Offline accuracy, feature richness |
| ML engineer | Reliability, latency, observability |
| Data engineer | Feature pipelines, schema stability |
| SRE / on-call | SLOs, paging, runbooks |
| Legal / compliance | Privacy, fairness, auditability |
| Finance | Compute costs, ROI |
Real ML systems serve all seven. Asking each what they care about (and what would make them say "block this launch") often surfaces requirements no one had documented.
9. Writing It Down: The Design Doc
By the end of requirements gathering you should have a 1-3 page design doc with:
- Problem statement — one paragraph; who is the user and what need does the system address.
- Success metrics — primary business metric + corresponding ML metric + acceptable values.
- Constraints — latency, scale, cost, compliance, dependencies.
- Out of scope — explicit list of things this system does NOT do.
- Risks & alternatives — what could kill this; what we considered and rejected.