AIMaks

Requirements Gathering for ML Systems

35 min readvideoSystem Design Principles
2 of 22ML System Design

Requirements Gathering for ML Systems

Most failed ML projects fail at the requirements phase. The team built a clever model that solved the wrong problem, optimized the wrong metric, met the wrong latency target, or ignored the wrong constraint. ML requirements are unusually treacherous: stakeholders specify their desired outcome, but the model lives in a world of probabilities, training data, and production constraints. This lesson is the framework for pinning requirements down before any code is written.

1. The Five Requirement Categories

CategoryQuestionExample
FunctionalWhat should the system do?"Flag transactions likely to be fraud"
PerformanceHow well, how fast, how scaled?"AUC ≥ 0.92, p99 latency < 100 ms, 10K req/s"
ConstraintsWhat can't we do?"No PII in training data, GDPR-compliant, AWS-only"
BusinessWhat value does it deliver?"Reduce fraud loss by $X/year"
OperationalWho owns it, what's the SLA?"On-call by Trust & Safety; 99.9% availability"

Skipping any category invites surprise later. Operational is the most-often-skipped: who pages whom when this breaks?

2. The Critical Question: Is ML Even the Right Tool?

3. Translating Business Goals to ML Metrics

Stakeholders speak in business outcomes ("more conversions", "less spam", "happier customers"). ML lives in metrics (AUC, F1, NDCG). The translation is usually non-trivial.

Business goalDirect ML metricPitfalls
Reduce fraud lossF1 with cost-weighted FP/FNDon't just optimize accuracy; FN is much more expensive
Increase engagementCTR or session lengthOptimizing CTR alone produces clickbait
Improve search qualityNDCG, MRR, top-1 click rateNDCG can rise while user-perceived quality drops; need offline eval + online experiment
Reduce customer churnCalibrated probability + lift @ kCalibration matters more than raw AUC for retention campaigns

Two ML metrics for every project: an offline metric computable from labels (AUC, F1), and a online business metric measured via A/B testing (revenue, retention, user reports). The two correlate but never perfectly.

4. The Latency Budget Conversation

Latency requirements are inherited from the surrounding system, not picked freely:

Use caseTypical p99 budget
Search autocomplete20-50 ms
Web recommender (page render path)50-200 ms
Fraud check at payment50-200 ms
Voice assistant response200-500 ms
Email spam classification1-5 s
Daily reporting batchHours

The model's allotment is usually 30-60% of the budget; the rest goes to feature lookup, network, downstream calls. A common mistake: scoping a 200ms model into a 100ms p99 page render budget.

5. Scale: The Three Numbers

Always pin down three numbers early:

  • Daily active users (DAU) — drives feature store size and event volume.
  • Peak QPS — drives serving capacity. Peak typically 3-10× average.
  • Total population — drives offline batch compute (if you score all users nightly).

These numbers have order-of-magnitude effects on architecture. 10K DAU is a SQLite-and-FastAPI problem. 10M DAU needs a feature store and Kubernetes. 1B DAU needs a custom serving stack.

6. Data Requirements

Always ask:

  • Where does training data come from? Logged predictions, human labels, distant supervision, synthetic.
  • How is it labeled? Auto from outcomes, sampled human review, full coverage.
  • How fresh must it be? Daily, hourly, real-time?
  • What's the volume? 1M rows trains in minutes; 1T rows needs Spark.
  • What's the schema stability? Adding a column happens monthly; renames break things.
  • What about privacy / compliance? PII handling, retention, regional data residency.

7. The "Negative Requirements"

Constraints expressed as "do NOT" are as important as positive goals:

  • No discrimination on protected attributes (sometimes legally enforced).
  • No personally identifying information in logs.
  • Cannot use third-party libraries (regulated environments).
  • Must run on-prem (compliance, data sovereignty).
  • Must explain decisions (financial, medical).

These often dominate architecture choices. A model with no explainability requirements can use any state-of-the-art approach; a credit-decision model under ECOA must produce reason codes.

8. The Stakeholder Matrix

StakeholderCares about
Product managerUser experience metrics, A/B test wins
Data scientistOffline accuracy, feature richness
ML engineerReliability, latency, observability
Data engineerFeature pipelines, schema stability
SRE / on-callSLOs, paging, runbooks
Legal / compliancePrivacy, fairness, auditability
FinanceCompute costs, ROI

Real ML systems serve all seven. Asking each what they care about (and what would make them say "block this launch") often surfaces requirements no one had documented.

9. Writing It Down: The Design Doc

By the end of requirements gathering you should have a 1-3 page design doc with:

  1. Problem statement — one paragraph; who is the user and what need does the system address.
  2. Success metrics — primary business metric + corresponding ML metric + acceptable values.
  3. Constraints — latency, scale, cost, compliance, dependencies.
  4. Out of scope — explicit list of things this system does NOT do.
  5. Risks & alternatives — what could kill this; what we considered and rejected.

10. The Discipline

Up next · ML System Architecture Patterns