The ML Lifecycle and Technical Debt

35 min readvideoMLOps Fundamentals

2 of 44MLOps & Production ML

The ML Lifecycle and Technical Debt

ML systems accumulate hidden cost faster than most software. The 2014 Google paper "Machine Learning: The High-Interest Credit Card of Technical Debt" is required reading because every team rediscovers its lessons the hard way. This lesson walks the full ML lifecycle and names the specific debts that pile up at each stage — so you can pay them down deliberately instead of stumbling into them.

1. The Lifecycle

Problem framing — translate a business question into something a model can predict.
Data collection — gather, label, validate.
Feature engineering — transform raw data into model inputs.
Model training — fit candidate models, tune hyperparameters.
Evaluation — measure on held-out data, compare to baselines.
Packaging — bundle model + dependencies + serving code.
Deployment — ship to production environment.
Monitoring — track inputs, outputs, performance over time.
Retraining — refresh on new data, redeploy.

Steps 1–5 happen in notebooks. Steps 6–9 are everything that notebooks alone don't give you. MLOps is mostly about steps 6–9.

2. The Hidden Costs (Sculley et al., 2015)

Debt category	What it looks like	Why it bites
Boundary erosion	Model entangled with surrounding code; "changing anything changes everything" (CACE)	Refactoring blocked; tests fragile
Data dependencies	Hidden upstream data sources; consuming features from other teams' models	Upstream change silently breaks you
Feedback loops	Model output influences future training data	Bias amplifies; drift accelerates
Anti-patterns	Glue code, pipeline jungles, dead experimental code paths	Surface area for bugs grows superlinearly
Configuration debt	Hyperparameters spread across YAMLs, env vars, hardcoded constants	Reproducibility impossible
Real-world testing	No way to safely shadow-deploy or A/B test	Every deploy is a leap of faith

3. The CACE Principle

4. Pipeline Jungles

A pipeline jungle is what happens when "I'll just add one more preprocessing step" runs for two years. Symptoms:

Several scripts produce overlapping outputs nobody can disentangle.
"Run script A, then notebook B, then SQL query C, then notebook D" is the deployment doc.
No one knows which version of which script produced the model currently in production.

The fix is rarely "rewrite the jungle". The fix is contain the jungle behind a single orchestrated pipeline (Airflow, Prefect, Kubeflow), then strangle it from outside as you replace pieces.

5. Glue Code and Configuration Debt

Most ML codebases are 10% model and 90% glue: data loaders, preprocessing, post-processing, format conversions, retry logic. Glue code is unavoidable but should be recognized and contained:

Move format conversions to a shared utility module.
Use a single config object (Hydra, Pydantic Settings) instead of ten YAMLs and a constants.py.
Treat data preprocessing as code that ships and is tested, not as a one-time script.

6. The Data Dependency Problem

Code dependencies have static analyzers that catch issues. Data dependencies don't. Three patterns to defend against:

Schema validation: every dataset has an explicit schema (column names, types, ranges, null rules). Tools like Great Expectations or Pandera enforce this on every batch.
Data versioning: pin a snapshot for every training run (Lesson 8 covers DVC).
Lineage tracking: every model artifact records which data version produced it.

7. Feedback Loops Are Sneaky

When a model's output influences future inputs, you have a feedback loop. Examples:

A recommender shows item X. Users click X (because it was shown). Next training set has more X-clicks. Recommender shows X even more.
A fraud detector blocks suspicious transactions. Blocked transactions never resolve as "actually fraud". The label distribution skews.
A self-driving car's "safe trajectory" model is trained on data from cars driving safely. It never sees the rare unsafe situations it most needs to handle.

Detecting feedback loops requires deliberate effort: random treatment holdouts, counterfactual logging, periodic full-population retraining.

8. Practical Habits That Pay Down Debt

One config object per training run; persist it alongside the model.
Pinned random seeds, recorded in the config.
Schema check at every pipeline boundary.
Model registry with semantic version tags (1.4.0-staging, 1.4.0-canary).
"Is this model better?" defined in writing, ideally as code, before training.
Shadow deploys (model runs on real traffic but its predictions aren't used) before live traffic.

← Previous lessonWhat Is MLOps and Why It Matters

Up next · MLOps Maturity Levels