The libraryMachine Learning Fundamentals

Setting Up Your ML Environment

35 min readnotebookThe ML Landscape

3 of 40Machine Learning Fundamentals

Setting Up Your ML Environment

Before any modeling, the environment. This notebook is the one-time setup that pays back for the rest of the course (and for any ML work afterwards): an isolated Python environment, the standard libraries pinned to known versions, and a tiny end-to-end workflow that proves everything is wired correctly. Same setup works on Linux, macOS, Windows (via WSL), and Colab.

1. Python: Use a Virtual Environment

code

# Pick one tool — uv is the modern default in 2026
pip install uv               # one-time, into your system Python

uv venv .venv --python 3.11  # create an isolated env
source .venv/bin/activate    # activate (macOS/Linux)
.venv\Scripts\activate       # activate (Windows)

uv pip install -r requirements.txt

Three rules:

Never install packages globally. Conflicts will eventually break unrelated projects.
Pin versions in requirements.txt. Re-running an experiment six months later requires the same library versions.
Use uv in 2026 — orders of magnitude faster than pip + behaves identically. conda still works but uv has eaten most of pip / pip-tools / virtualenv.

2. The Standard Stack

code

# requirements.txt
numpy==2.1.2
pandas==2.2.3
scikit-learn==1.5.2
matplotlib==3.9.2
seaborn==0.13.2
xgboost==2.1.2
lightgbm==4.5.0
jupyterlab==4.3.0

Eight packages cover roughly 95% of classical ML work:

Library	Role
numpy	n-dimensional arrays; foundation under everything
pandas	Tabular data (DataFrames); CSV / parquet I/O
scikit-learn	The classical-ML toolbox; algorithms, metrics, pipelines
matplotlib	Default plotting library; sns + seaborn for higher-level plots
xgboost / lightgbm	Gradient-boosted trees; the most-used tabular ML model in production
jupyterlab	Notebook IDE; the de-facto exploration tool

3. Test the Stack

code

import numpy as np, pandas as pd
import sklearn, matplotlib.pyplot as plt
import xgboost as xgb
print("numpy   ", np.__version__)
print("pandas  ", pd.__version__)
print("sklearn ", sklearn.__version__)
print("xgboost ", xgb.__version__)

If any import fails, the install didn't take cleanly — re-create the venv before continuing. Better to spend two minutes here than chase a confusing error later.

4. The 5-Minute End-to-End Workflow

code

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# 1. Data
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y)

# 2. Preprocess
scaler = StandardScaler().fit(X_tr)
X_tr_s = scaler.transform(X_tr)
X_te_s = scaler.transform(X_te)

# 3. Model
model = LogisticRegression(max_iter=1000)
model.fit(X_tr_s, y_tr)

# 4. Evaluate
pred = model.predict(X_te_s)
print(f"accuracy: {accuracy_score(y_te, pred):.3f}")
print(confusion_matrix(y_te, pred))

Five steps — load data, split, preprocess, fit, evaluate — every classical-ML pipeline you'll ever write has roughly this skeleton. Run it; expect ~97% accuracy on Iris. If you see this, you're set.

5. Project Layout That Survives

code

ml-fundamentals/
├── data/                    # raw / processed datasets (.gitignore)
├── notebooks/               # exploratory notebooks
├── src/
│   ├── data.py              # loaders + preprocessing
│   ├── features.py          # feature engineering
│   ├── train.py             # training entry point
│   └── evaluate.py          # evaluation script
├── models/                  # saved checkpoints (.gitignore)
├── tests/
├── requirements.txt
├── README.md
└── .gitignore

Why bother on day one: the layout you start with is the layout you keep. Adopt this structure even for "small" projects — they grow.

6. Reproducibility: The Five Knobs

code

import os, random, numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)

Always set seeds. Plus:

Pin library versions in requirements.txt.
Track the dataset version (date or content hash, not "current").
Save the trained model + the feature pipeline together (sklearn's Pipeline in Lesson 35).
Log every experiment's config + metrics (Section 4).

7. Notebooks vs Scripts

Use a notebook for	Use a script for
Exploration, plotting, looking-at-data	Anything that runs more than once
Iterating on a model	Reproducible training
Sharing results visually	Production / scheduled jobs
Teaching	Anything you'll diff in Git

Pattern that scales: prototype in a notebook, port to src/train.py when stable, run from the script thereafter. Notebooks are great for thinking, terrible for pipelines.

8. Editor Setup (Optional but Worth It)

VS Code or PyCharm with the Python extension covers most needs. Three settings worth turning on day one:

Format on save with ruff format or black.
Lint with ruff — fast, zero-config.
Type checking — basic mypy or Pyright; catches real bugs without much friction.

9. Where to Run It

Environment	When
Local laptop	This whole course; tabular ML doesn't need GPUs
Google Colab	Free CPU + occasional T4 GPU; great for casual experiments
Kaggle Kernels	Free CPU + P100 / T4; best when you also want the dataset
Cloud notebook (Vertex AI, SageMaker)	Production-grade; pricier; persistent storage

The whole ml-fundamentals course runs comfortably on a laptop. Reach for cloud only when training time becomes annoying.

10. Exercises

← Previous lessonTypes of ML Systems

Up next · The ML Project Lifecycle