Software Engineering for ML // Megha Bose

Prerequisite: Docker & Containerization | Monitoring & Observability

ML code is code. This seems obvious, but in practice ML research and production systems are full of notebooks that cannot be re-run, experiments with no record of their hyperparameters, models uploaded to a shared drive with names like model_final_v3_FINAL.pt, and data pipelines with hardcoded paths. Sculley et al.’s 2015 paper “Hidden Technical Debt in Machine Learning Systems” documented these failure patterns at Google. A decade later, they’re still endemic. Software engineering discipline - versioning, testing, monitoring, reproducibility - applies to ML just as much as to web services.

Project Structure

Flat directories with notebooks at the root scale poorly. A clean ML project separates concerns:

my-ml-project/
├── data/           # raw and processed data (or DVC pointers)
├── src/
│   ├── data/       # data loading, preprocessing
│   ├── models/     # model definitions
│   ├── training/   # training loops, losses
│   └── evaluation/ # metrics, evaluation scripts
├── notebooks/      # exploration only, not production code
├── tests/
│   ├── test_data.py
│   ├── test_models.py
│   └── test_training.py
├── configs/        # YAML/TOML config files
├── pyproject.toml
└── Makefile

Notebooks are for exploration. Once a pattern is useful, move it into src/ where it can be imported, tested, and versioned properly.

Config Management

Hardcoded hyperparameters scattered through code are a maintenance nightmare. When you want to compare learning_rate=1e-3 versus 3e-3, you end up with copied files, magic constants, and no record of which run used which value.

Python dataclasses provide type-checked config with no extra dependencies:

from dataclasses import dataclass

@dataclass
class TrainingConfig:
    learning_rate: float = 1e-3
    batch_size: int = 32
    num_epochs: int = 100
    weight_decay: float = 1e-4
    gradient_clip: float = 1.0

Hydra composes configs from YAML files with command-line overrides:

python train.py learning_rate=3e-3 model=transformer data=wikitext

Each run’s full resolved config is logged automatically. Never hardcode hyperparameters - pass config objects everywhere.

Experiment Tracking

When you train a hundred models over a month, you need to remember what each one was. Experiment tracking tools log hyperparameters, metrics, system information, and artifacts for every run, and provide a UI to compare them.

Weights & Biases (W&B) is the most widely used:

import wandb

wandb.init(project="my-model", config=config)

for epoch in range(config.num_epochs):
    train_loss = train_one_epoch(model, loader, config)
    val_loss = evaluate(model, val_loader)
    wandb.log({"train_loss": train_loss, "val_loss": val_loss, "epoch": epoch})

wandb.finish()

W&B stores the full config, all logged metrics, system metrics (GPU utilization, memory), and any artifacts (model checkpoints, confusion matrices) you upload. Later, you can filter runs by metric, compare hyperparameter sweeps, and reproduce any run exactly.

MLflow is the open-source alternative, self-hostable, with a similar API.

Reproducibility

A result that cannot be reproduced is a result that cannot be trusted. For ML, reproducibility requires:

Seed everything. Random operations in Python, NumPy, and your deep learning framework must all be seeded.

import random, numpy as np, torch

def seed_everything(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

Pin package versions. Use a lockfile (uv.lock, requirements.txt from pip-compile) and commit it.
Log the environment. At the start of every run, log pip freeze output or the conda environment to W&B or MLflow.
Version your data. A model checkpoint is meaningless if you don’t know exactly which data it was trained on.

Data Versioning with DVC

DVC (Data Version Control) tracks large files (datasets, model checkpoints) in a separate storage backend (S3, GCS, local) while storing lightweight .dvc pointer files in git. This gives you git’s versioning semantics for data that is too large to commit directly.

dvc init
dvc add data/train.csv            # creates data/train.csv.dvc
git add data/train.csv.dvc .gitignore
git commit -m "add training data"

dvc push                          # upload to remote storage
# Later, on another machine:
dvc pull                          # download the data matching the current commit

DVC also defines pipelines: stages with explicit inputs and outputs. It tracks which outputs are stale (because an input changed) and re-runs only what’s needed - like make for ML pipelines.

Model Registry

A model registry stores versioned model artifacts with associated metadata (training metrics, data version, config). The lifecycle stages - Staging, Production, Archived - provide a controlled promotion process.

MLflow’s Model Registry and W&B Artifacts both serve this role. The pattern is: training job registers a model to the registry after evaluation; a human (or automated evaluation) promotes it from Staging to Production; the serving system loads the Production version.

This replaces the alternative: emailing model .pt files and hoping the right one makes it to production.

Testing ML Code

ML code is harder to test than deterministic code because “correctness” is often statistical. But many components are deterministic and should have unit tests:

Test data transforms:

def test_normalize_clips_to_unit_range():
    x = np.array([-10., 0., 10.])
    out = normalize(x)
    assert out.min() >= -1.0 and out.max() <= 1.0

Test model output shapes:

def test_model_output_shape():
    model = MyTransformer(d_model=128, n_heads=4, n_layers=2)
    x = torch.randint(0, 1000, (4, 32))  # batch=4, seq_len=32
    logits = model(x)
    assert logits.shape == (4, 32, 1000)

Test the training loop runs without error:

def test_training_step_runs():
    model = MyModel()
    batch = make_fake_batch(size=4)
    loss = training_step(model, batch)
    assert loss.item() == loss.item()  # not NaN
    loss.backward()

The most important test: overfit on a single batch. A model that can’t drive training loss to near-zero on a single batch has a bug in the forward pass or loss function.

CI for ML

A CI pipeline for ML should run:

Unit tests for data transforms, model shapes, and utilities.
A smoke test: train for exactly 1 step and verify loss is finite.
Data validation: check that new data files conform to expected schema and value ranges.
Linting and type checking: ruff, mypy.

Full training runs are too slow for CI. Reserve those for scheduled nightly jobs or manual triggers.

Technical Debt in ML Systems

Sculley et al. identified several failure patterns specific to ML:

Entangled data and model code: The preprocessing pipeline baked into the training script doesn’t match the one deployed in production - training-serving skew.
Undeclared consumers: Other teams depend on a model’s output format. Changing the model silently breaks them.
Dead experimental code: if use_old_attention_v2: branches accumulate until no one knows what’s enabled.
Glue code: 95% of a “ML system” is data pipelines, feature engineering, and serving - not the model.

Each of these is a software engineering problem, not an ML problem. Version your interfaces, delete dead code, and separate concerns.

Examples

W&B Experiment Tracking Setup

import wandb
from dataclasses import asdict

config = TrainingConfig(learning_rate=3e-3, batch_size=64)
run = wandb.init(project="image-classifier", config=asdict(config))

for step, (x, y) in enumerate(train_loader):
    loss = train_step(model, x, y, optimizer)
    if step % 100 == 0:
        val_acc = evaluate(model, val_loader)
        wandb.log({"loss": loss, "val_accuracy": val_acc, "step": step})

# Save model checkpoint as artifact
artifact = wandb.Artifact("model", type="model")
artifact.add_file("checkpoint.pt")
wandb.log_artifact(artifact)

DVC Pipeline

# dvc.yaml
stages:
  preprocess:
    cmd: python src/data/preprocess.py
    deps:
      - data/raw/
      - src/data/preprocess.py
    outs:
      - data/processed/

  train:
    cmd: python src/training/train.py
    deps:
      - data/processed/
      - src/training/train.py
      - configs/train.yaml
    outs:
      - models/checkpoint.pt
    metrics:
      - metrics/val_loss.json

Running dvc repro executes only the stages whose inputs have changed.

pytest Fixture for Model Tests

import pytest
import torch
from src.models import TransformerClassifier

@pytest.fixture
def model():
    return TransformerClassifier(vocab_size=1000, d_model=64, n_heads=4, n_classes=10)

@pytest.fixture
def fake_batch():
    return {
        "input_ids": torch.randint(0, 1000, (8, 16)),
        "labels": torch.randint(0, 10, (8,)),
    }

def test_forward_shape(model, fake_batch):
    logits = model(fake_batch["input_ids"])
    assert logits.shape == (8, 10)

def test_loss_is_finite(model, fake_batch):
    logits = model(fake_batch["input_ids"])
    loss = torch.nn.functional.cross_entropy(logits, fake_batch["labels"])
    assert torch.isfinite(loss)

Applying software engineering practices to ML is not overhead - it is how you build systems that stay working as data changes, models evolve, and teams grow.