The ML Lifecycle and Technical Debt
2 of 44MLOps & Production ML
The ML Lifecycle and Technical Debt
ML systems accumulate hidden cost faster than most software. The 2014 Google paper "Machine Learning: The High-Interest Credit Card of Technical Debt" is required reading because every team rediscovers its lessons the hard way. This lesson walks the full ML lifecycle and names the specific debts that pile up at each stage — so you can pay them down deliberately instead of stumbling into them.
1. The Lifecycle
- Problem framing — translate a business question into something a model can predict.
- Data collection — gather, label, validate.
- Feature engineering — transform raw data into model inputs.
- Model training — fit candidate models, tune hyperparameters.
- Evaluation — measure on held-out data, compare to baselines.
- Packaging — bundle model + dependencies + serving code.
- Deployment — ship to production environment.
- Monitoring — track inputs, outputs, performance over time.
- Retraining — refresh on new data, redeploy.
Steps 1–5 happen in notebooks. Steps 6–9 are everything that notebooks alone don't give you. MLOps is mostly about steps 6–9.
2. The Hidden Costs (Sculley et al., 2015)
| Debt category | What it looks like | Why it bites |
|---|---|---|
| Boundary erosion | Model entangled with surrounding code; "changing anything changes everything" (CACE) | Refactoring blocked; tests fragile |
| Data dependencies | Hidden upstream data sources; consuming features from other teams' models | Upstream change silently breaks you |
| Feedback loops | Model output influences future training data | Bias amplifies; drift accelerates |
| Anti-patterns | Glue code, pipeline jungles, dead experimental code paths | Surface area for bugs grows superlinearly |
| Configuration debt | Hyperparameters spread across YAMLs, env vars, hardcoded constants | Reproducibility impossible |
| Real-world testing | No way to safely shadow-deploy or A/B test | Every deploy is a leap of faith |
3. The CACE Principle
4. Pipeline Jungles
A pipeline jungle is what happens when "I'll just add one more preprocessing step" runs for two years. Symptoms:
- Several scripts produce overlapping outputs nobody can disentangle.
- "Run script A, then notebook B, then SQL query C, then notebook D" is the deployment doc.
- No one knows which version of which script produced the model currently in production.
The fix is rarely "rewrite the jungle". The fix is contain the jungle behind a single orchestrated pipeline (Airflow, Prefect, Kubeflow), then strangle it from outside as you replace pieces.
5. Glue Code and Configuration Debt
Most ML codebases are 10% model and 90% glue: data loaders, preprocessing, post-processing, format conversions, retry logic. Glue code is unavoidable but should be recognized and contained:
- Move format conversions to a shared utility module.
- Use a single config object (Hydra, Pydantic Settings) instead of ten YAMLs and a constants.py.
- Treat data preprocessing as code that ships and is tested, not as a one-time script.
6. The Data Dependency Problem
Code dependencies have static analyzers that catch issues. Data dependencies don't. Three patterns to defend against:
- Schema validation: every dataset has an explicit schema (column names, types, ranges, null rules). Tools like Great Expectations or Pandera enforce this on every batch.
- Data versioning: pin a snapshot for every training run (Lesson 8 covers DVC).
- Lineage tracking: every model artifact records which data version produced it.
7. Feedback Loops Are Sneaky
When a model's output influences future inputs, you have a feedback loop. Examples:
- A recommender shows item X. Users click X (because it was shown). Next training set has more X-clicks. Recommender shows X even more.
- A fraud detector blocks suspicious transactions. Blocked transactions never resolve as "actually fraud". The label distribution skews.
- A self-driving car's "safe trajectory" model is trained on data from cars driving safely. It never sees the rare unsafe situations it most needs to handle.
Detecting feedback loops requires deliberate effort: random treatment holdouts, counterfactual logging, periodic full-population retraining.
8. Practical Habits That Pay Down Debt
- One config object per training run; persist it alongside the model.
- Pinned random seeds, recorded in the config.
- Schema check at every pipeline boundary.
- Model registry with semantic version tags (1.4.0-staging, 1.4.0-canary).
- "Is this model better?" defined in writing, ideally as code, before training.
- Shadow deploys (model runs on real traffic but its predictions aren't used) before live traffic.