AIMaks

Setting Up Your ML Environment

35 min readnotebookThe ML Landscape
3 of 40Machine Learning Fundamentals

Setting Up Your ML Environment

Before any modeling, the environment. This notebook is the one-time setup that pays back for the rest of the course (and for any ML work afterwards): an isolated Python environment, the standard libraries pinned to known versions, and a tiny end-to-end workflow that proves everything is wired correctly. Same setup works on Linux, macOS, Windows (via WSL), and Colab.

1. Python: Use a Virtual Environment

code
# Pick one tool — uv is the modern default in 2026
pip install uv               # one-time, into your system Python

uv venv .venv --python 3.11  # create an isolated env
source .venv/bin/activate    # activate (macOS/Linux)
.venv\Scripts\activate       # activate (Windows)

uv pip install -r requirements.txt

Three rules:

  • Never install packages globally. Conflicts will eventually break unrelated projects.
  • Pin versions in requirements.txt. Re-running an experiment six months later requires the same library versions.
  • Use uv in 2026 — orders of magnitude faster than pip + behaves identically. conda still works but uv has eaten most of pip / pip-tools / virtualenv.

2. The Standard Stack

code
# requirements.txt
numpy==2.1.2
pandas==2.2.3
scikit-learn==1.5.2
matplotlib==3.9.2
seaborn==0.13.2
xgboost==2.1.2
lightgbm==4.5.0
jupyterlab==4.3.0

Eight packages cover roughly 95% of classical ML work:

LibraryRole
numpyn-dimensional arrays; foundation under everything
pandasTabular data (DataFrames); CSV / parquet I/O
scikit-learnThe classical-ML toolbox; algorithms, metrics, pipelines
matplotlibDefault plotting library; sns + seaborn for higher-level plots
xgboost / lightgbmGradient-boosted trees; the most-used tabular ML model in production
jupyterlabNotebook IDE; the de-facto exploration tool

3. Test the Stack

code
import numpy as np, pandas as pd
import sklearn, matplotlib.pyplot as plt
import xgboost as xgb
print("numpy   ", np.__version__)
print("pandas  ", pd.__version__)
print("sklearn ", sklearn.__version__)
print("xgboost ", xgb.__version__)

If any import fails, the install didn't take cleanly — re-create the venv before continuing. Better to spend two minutes here than chase a confusing error later.

4. The 5-Minute End-to-End Workflow

code
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# 1. Data
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y)

# 2. Preprocess
scaler = StandardScaler().fit(X_tr)
X_tr_s = scaler.transform(X_tr)
X_te_s = scaler.transform(X_te)

# 3. Model
model = LogisticRegression(max_iter=1000)
model.fit(X_tr_s, y_tr)

# 4. Evaluate
pred = model.predict(X_te_s)
print(f"accuracy: {accuracy_score(y_te, pred):.3f}")
print(confusion_matrix(y_te, pred))

Five steps — load data, split, preprocess, fit, evaluate — every classical-ML pipeline you'll ever write has roughly this skeleton. Run it; expect ~97% accuracy on Iris. If you see this, you're set.

5. Project Layout That Survives

code
ml-fundamentals/
├── data/                    # raw / processed datasets (.gitignore)
├── notebooks/               # exploratory notebooks
├── src/
│   ├── data.py              # loaders + preprocessing
│   ├── features.py          # feature engineering
│   ├── train.py             # training entry point
│   └── evaluate.py          # evaluation script
├── models/                  # saved checkpoints (.gitignore)
├── tests/
├── requirements.txt
├── README.md
└── .gitignore

Why bother on day one: the layout you start with is the layout you keep. Adopt this structure even for "small" projects — they grow.

6. Reproducibility: The Five Knobs

code
import os, random, numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)

Always set seeds. Plus:

  • Pin library versions in requirements.txt.
  • Track the dataset version (date or content hash, not "current").
  • Save the trained model + the feature pipeline together (sklearn's Pipeline in Lesson 35).
  • Log every experiment's config + metrics (Section 4).

7. Notebooks vs Scripts

Use a notebook forUse a script for
Exploration, plotting, looking-at-dataAnything that runs more than once
Iterating on a modelReproducible training
Sharing results visuallyProduction / scheduled jobs
TeachingAnything you'll diff in Git

Pattern that scales: prototype in a notebook, port to src/train.py when stable, run from the script thereafter. Notebooks are great for thinking, terrible for pipelines.

8. Editor Setup (Optional but Worth It)

VS Code or PyCharm with the Python extension covers most needs. Three settings worth turning on day one:

  • Format on save with ruff format or black.
  • Lint with ruff — fast, zero-config.
  • Type checking — basic mypy or Pyright; catches real bugs without much friction.

9. Where to Run It

EnvironmentWhen
Local laptopThis whole course; tabular ML doesn't need GPUs
Google ColabFree CPU + occasional T4 GPU; great for casual experiments
Kaggle KernelsFree CPU + P100 / T4; best when you also want the dataset
Cloud notebook (Vertex AI, SageMaker)Production-grade; pricier; persistent storage

The whole ml-fundamentals course runs comfortably on a laptop. Reach for cloud only when training time becomes annoying.

10. Exercises

Up next · The ML Project Lifecycle