Setting Up Your ML Environment
3 of 40Machine Learning Fundamentals
Setting Up Your ML Environment
Before any modeling, the environment. This notebook is the one-time setup that pays back for the rest of the course (and for any ML work afterwards): an isolated Python environment, the standard libraries pinned to known versions, and a tiny end-to-end workflow that proves everything is wired correctly. Same setup works on Linux, macOS, Windows (via WSL), and Colab.
1. Python: Use a Virtual Environment
# Pick one tool — uv is the modern default in 2026
pip install uv # one-time, into your system Python
uv venv .venv --python 3.11 # create an isolated env
source .venv/bin/activate # activate (macOS/Linux)
.venv\Scripts\activate # activate (Windows)
uv pip install -r requirements.txt
Three rules:
- Never install packages globally. Conflicts will eventually break unrelated projects.
- Pin versions in
requirements.txt. Re-running an experiment six months later requires the same library versions. - Use
uvin 2026 — orders of magnitude faster thanpip+ behaves identically.condastill works but uv has eaten most of pip / pip-tools / virtualenv.
2. The Standard Stack
# requirements.txt
numpy==2.1.2
pandas==2.2.3
scikit-learn==1.5.2
matplotlib==3.9.2
seaborn==0.13.2
xgboost==2.1.2
lightgbm==4.5.0
jupyterlab==4.3.0
Eight packages cover roughly 95% of classical ML work:
| Library | Role |
|---|---|
| numpy | n-dimensional arrays; foundation under everything |
| pandas | Tabular data (DataFrames); CSV / parquet I/O |
| scikit-learn | The classical-ML toolbox; algorithms, metrics, pipelines |
| matplotlib | Default plotting library; sns + seaborn for higher-level plots |
| xgboost / lightgbm | Gradient-boosted trees; the most-used tabular ML model in production |
| jupyterlab | Notebook IDE; the de-facto exploration tool |
3. Test the Stack
import numpy as np, pandas as pd
import sklearn, matplotlib.pyplot as plt
import xgboost as xgb
print("numpy ", np.__version__)
print("pandas ", pd.__version__)
print("sklearn ", sklearn.__version__)
print("xgboost ", xgb.__version__)
If any import fails, the install didn't take cleanly — re-create the venv before continuing. Better to spend two minutes here than chase a confusing error later.
4. The 5-Minute End-to-End Workflow
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
# 1. Data
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.2, random_state=0, stratify=y)
# 2. Preprocess
scaler = StandardScaler().fit(X_tr)
X_tr_s = scaler.transform(X_tr)
X_te_s = scaler.transform(X_te)
# 3. Model
model = LogisticRegression(max_iter=1000)
model.fit(X_tr_s, y_tr)
# 4. Evaluate
pred = model.predict(X_te_s)
print(f"accuracy: {accuracy_score(y_te, pred):.3f}")
print(confusion_matrix(y_te, pred))
Five steps — load data, split, preprocess, fit, evaluate — every classical-ML pipeline you'll ever write has roughly this skeleton. Run it; expect ~97% accuracy on Iris. If you see this, you're set.
5. Project Layout That Survives
ml-fundamentals/
├── data/ # raw / processed datasets (.gitignore)
├── notebooks/ # exploratory notebooks
├── src/
│ ├── data.py # loaders + preprocessing
│ ├── features.py # feature engineering
│ ├── train.py # training entry point
│ └── evaluate.py # evaluation script
├── models/ # saved checkpoints (.gitignore)
├── tests/
├── requirements.txt
├── README.md
└── .gitignore
Why bother on day one: the layout you start with is the layout you keep. Adopt this structure even for "small" projects — they grow.
6. Reproducibility: The Five Knobs
import os, random, numpy as np
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)
Always set seeds. Plus:
- Pin library versions in
requirements.txt. - Track the dataset version (date or content hash, not "current").
- Save the trained model + the feature pipeline together
(sklearn's
Pipelinein Lesson 35). - Log every experiment's config + metrics (Section 4).
7. Notebooks vs Scripts
| Use a notebook for | Use a script for |
|---|---|
| Exploration, plotting, looking-at-data | Anything that runs more than once |
| Iterating on a model | Reproducible training |
| Sharing results visually | Production / scheduled jobs |
| Teaching | Anything you'll diff in Git |
Pattern that scales: prototype in a notebook, port to
src/train.py when stable, run from the script
thereafter. Notebooks are great for thinking, terrible for
pipelines.
8. Editor Setup (Optional but Worth It)
VS Code or PyCharm with the Python extension covers most needs. Three settings worth turning on day one:
- Format on save with
ruff formatorblack. - Lint with ruff — fast, zero-config.
- Type checking — basic
mypyor Pyright; catches real bugs without much friction.
9. Where to Run It
| Environment | When |
|---|---|
| Local laptop | This whole course; tabular ML doesn't need GPUs |
| Google Colab | Free CPU + occasional T4 GPU; great for casual experiments |
| Kaggle Kernels | Free CPU + P100 / T4; best when you also want the dataset |
| Cloud notebook (Vertex AI, SageMaker) | Production-grade; pricier; persistent storage |
The whole ml-fundamentals course runs comfortably on a laptop. Reach for cloud only when training time becomes annoying.