Types of ML Systems
2 of 40Machine Learning Fundamentals
Types of ML Systems
"Machine learning" is an umbrella over a dozen distinct problem shapes that each have their own algorithms, evaluation, and pitfalls. Knowing the taxonomy is the difference between picking the right tool in five minutes and burning a week chasing the wrong one. This lesson is the catalogue: the four big learning paradigms, the practical task types within each, and a decision tree for picking the right shape for a new problem.
1. The Four Learning Paradigms
| Paradigm | Training data | Goal | Examples |
|---|---|---|---|
| Supervised | (input, label) pairs | Predict label from new input | Spam detection, price prediction |
| Unsupervised | Inputs only | Discover structure | Clustering customers, anomaly detection |
| Self-supervised | Inputs only, but the task is to predict missing parts | Learn general representations | BERT (predict masked words), MAE (predict masked patches) |
| Reinforcement | State, action, reward over time | Maximise long-term reward | Game-playing, robotics, ad serving |
Most production ML in 2026 is supervised; self-supervised drives the foundation models (LLMs, vision encoders) we fine-tune on top of; unsupervised covers the rest. RL is a smaller (but high-impact) niche.
2. Supervised: Regression vs Classification
| Property | Regression | Classification |
|---|---|---|
| Output | A real number (price, temperature, ETA) | A discrete label (spam / not spam, dog breed) |
| Loss | MSE, MAE, Huber | Cross-entropy, hinge, focal |
| Metric | R², RMSE, MAE | Accuracy, F1, AUC, log loss |
| Decision | Use the predicted value | Apply a threshold to predicted probability |
Sometimes the same problem can be framed either way: "predict click probability" is technically regression on [0, 1] but is usually framed as classification with cross-entropy loss. The framing affects which losses, metrics, and calibration techniques apply.
3. Classification Sub-Types
- Binary — one of two classes (spam / not, fraud / legit). The default and simplest case.
- Multi-class — one of K mutually exclusive classes (dog breed). Use softmax + cross-entropy.
- Multi-label — N independent yes/no decisions (this image has both a dog and a frisbee). Use sigmoid per class + BCE.
- Imbalanced binary — one class dominates (fraud is 0.1% of transactions). Reach for class weighting, focal loss, or threshold tuning.
- Open-set — at inference, an example may belong to none of the training classes. Adds an "unknown" threshold or out-of-distribution detector.
4. Unsupervised: Three Common Goals
| Goal | Methods | Use case |
|---|---|---|
| Clustering | K-Means, DBSCAN, Gaussian mixtures, hierarchical | Customer segmentation, document grouping |
| Dimensionality reduction | PCA, t-SNE, UMAP | Visualisation, denoising, compression |
| Anomaly detection | Isolation forest, one-class SVM, autoencoder reconstruction | Fraud / fault / intrusion detection |
Sections 5 and 7 cover all three, hands-on.
5. Online vs Batch Learning
| Property | Batch | Online |
|---|---|---|
| Training | Uses the full dataset, often retrained periodically | One example or mini-batch at a time, continuously |
| Adaptation | Slow; retrain to incorporate new data | Fast; the model updates as data arrives |
| Catastrophic forgetting | Not an issue | A real risk; old patterns can be overwritten |
| Examples | Most production ML | Real-time recsys, ad bidding, IoT anomaly detection |
Default to batch training. Reach for online only when freshness requirements or scale rule out batch retraining.
6. Instance-Based vs Model-Based
| Property | Instance-based | Model-based |
|---|---|---|
| Training | Memorise the dataset | Fit parameters that summarise the dataset |
| Prediction | Look up similar examples | Apply the learned function |
| Examples | k-NN, kernel methods | Linear, tree-based, neural networks |
| Inference cost | Grows with training-set size | Constant per example |
Model-based dominates production. Instance-based methods (especially k-NN) remain strong baselines and surface again in modern retrieval systems (vector search ≈ k-NN over embeddings).
7. Parametric vs Non-Parametric
- Parametric — fixed number of parameters regardless of dataset size (linear regression, logistic regression, fixed-architecture neural networks). Inference cost independent of data size.
- Non-parametric — model complexity grows with data (decision trees that grow until pure leaves, k-NN, kernel methods, Gaussian processes). More flexible; slower; needs more regularisation.
"Non-parametric" is misleading — these models often have more parameters than parametric ones. The distinction is whether the parameter count is fixed or grows with data.