Software Engineering for ML - Making Research Survive Contact With Production // Megha Bose

Helpful context:

You train a model that achieves 94% accuracy. Six months later, the same model evaluated on new data achieves 87%. You go to investigate. The original experiment ran in a Jupyter notebook that has been modified since. The weights file is named model_final_v3_FINAL.pt. There is no record of which dataset split it was trained on, no log of the hyperparameters, no indication of which preprocessing pipeline was applied. You cannot reproduce the result. You cannot diagnose the regression. You are flying blind.

This is why MLOps exists - not as bureaucratic overhead, but as the discipline that makes ML systems trustworthy over time. The core insight from Sculley et al.’s 2015 paper “Hidden Technical Debt in Machine Learning Systems” (published at Google, still widely cited) is that ML code is a small fraction of ML systems. The rest is data ingestion, feature engineering, monitoring, and serving infrastructure. And that infrastructure, if built without engineering discipline, accrues debt faster than any other software.

The Problem Is Reproducibility

Before any specific tool, the foundational question is: can you reproduce this result?

Reproducing a training run requires knowing four things precisely:

The exact code that ran (git commit hash)
The exact data that was used (data version and split)
The exact hyperparameters and configuration
The exact software environment (library versions, CUDA version, Python version)

Any of these missing means you cannot reproduce, which means you cannot diagnose failures, compare experiments fairly, or audit production models. This sounds obvious. In practice, teams violate all four constantly.

Seeding is table stakes:

import random, numpy as np, torch

def seed_everything(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

This alone is insufficient. cudnn.deterministic = True forces cuDNN to use deterministic algorithms but can reduce throughput by 20 - 30%. Different CUDA versions implement some operations differently, so the same seed produces different results on CUDA 11.8 vs CUDA 12.2. Hardware differences matter too - multi-GPU training with torch.distributed has non-deterministic gradient aggregation by default. “Reproducible” in practice means “reproducible on the same hardware and software version,” which is already useful, and requires genuine effort to achieve.

Experiment Tracking: The Audit Log for ML

Experiment tracking tools log hyperparameters, metrics, artifacts, system information, and code state for every training run. The essential functionality: given any two past runs, you can reproduce either one and understand exactly what differed between them.

Weights & Biases (W&B) is the most widely used in practice:

import wandb
from dataclasses import asdict

config = TrainingConfig(learning_rate=3e-3, batch_size=64, num_epochs=50)
run = wandb.init(project="image-classifier", config=asdict(config))

for epoch in range(config.num_epochs):
    train_loss = train_one_epoch(model, train_loader, config)
    val_acc = evaluate(model, val_loader)
    wandb.log({"train_loss": train_loss, "val_accuracy": val_acc, "epoch": epoch})

# Log model checkpoint as a versioned artifact
artifact = wandb.Artifact("model", type="model")
artifact.add_file("checkpoint.pt")
run.log_artifact(artifact)

W&B automatically captures the git commit hash, the pip environment, GPU utilization, system memory, and any artifacts you explicitly log. The UI allows filtering runs by metric, creating scatter plots of hyperparameter vs performance, and running sweeps (automated hyperparameter search across many configurations).

MLflow is the open-source, self-hosted alternative. The API is similar, the UI is less polished, but it runs on your own infrastructure - important for teams with data privacy constraints that prevent using external services.

The critical habit: start every experiment by calling wandb.init or mlflow.start_run. Do this automatically in your training script so it can’t be forgotten. Log the full resolved config object, not just selected hyperparameters. Future you, six months from now, will thank present you.

Data Versioning: The Ignored Half of the Problem

Model checkpoints get versioned. Datasets rarely do. This is backwards - the data is often the most important input and the hardest to recreate.

DVC (Data Version Control) tracks large files in external storage (S3, GCS, Azure Blob, local) while storing lightweight .dvc pointer files in git. You get git’s versioning semantics for data that is too large to commit directly.

dvc init
dvc add data/train.csv            # creates data/train.csv.dvc (committed to git)
git add data/train.csv.dvc .gitignore
git commit -m "add training data v1"

dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push                          # uploads data to S3

# On a new machine, or after git checkout of an older commit:
dvc pull                          # downloads exactly the data for this commit

DVC also defines reproducible pipelines: stages with explicit inputs, outputs, and commands. It tracks which outputs are stale and reruns only the stages whose inputs changed - like make for ML pipelines.

# dvc.yaml
stages:
  preprocess:
    cmd: python src/data/preprocess.py
    deps: [data/raw/, src/data/preprocess.py]
    outs: [data/processed/]

  train:
    cmd: python src/training/train.py --config configs/train.yaml
    deps: [data/processed/, src/training/train.py, configs/train.yaml]
    outs: [models/checkpoint.pt]
    metrics: [metrics/val_loss.json]

dvc repro checks which stage outputs are stale and reruns only those. This is the difference between ML pipelines as a series of manual steps you have to remember, and ML pipelines as a versioned, reproducible artifact.

Delta Lake (from Databricks, now open-source) solves a related problem: versioning for large tabular datasets that change incrementally. It adds ACID transactions and time travel to Parquet files on object storage. You can query “what did my training dataset look like three months ago?” without maintaining separate copies.

Feature Stores: Closing the Training-Serving Skew

Here is a subtle failure mode that causes silent accuracy degradation without any obvious error.

At training time, you compute features from raw data. At serving time, the same features need to be computed for incoming requests. If the training pipeline and serving pipeline are implemented separately - which they almost always are - they will drift. Different normalization logic. Different handling of missing values. Different timestamp semantics. The model was trained on features computed one way and served on features computed another way. This is training-serving skew, and it is extremely common.

Feature stores solve this by centralizing feature computation logic so that training and serving use the same code path.

Feast (open source) separates feature definitions (what a feature is and how to compute it) from the physical storage (where historical feature values are stored for training, and where current values are stored for serving). The same feature definition produces both the training dataset and the online serving lookup:

# Feature definition: same logic used for training and serving
user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=30),
    features=[
        Feature(name="purchase_count_7d", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
    ],
    batch_source=BigQuerySource(table="analytics.user_features"),
    stream_source=KafkaSource(topic="user_events"),
)

# At training time: retrieve historical features for labeled examples
training_df = store.get_historical_features(
    entity_df=labeled_examples,
    features=["user_features:purchase_count_7d", "user_features:avg_order_value"],
).to_df()

# At serving time: retrieve current features for incoming request
feature_vector = store.get_online_features(
    entity_rows=[{"user_id": user_id}],
    features=["user_features:purchase_count_7d"],
).to_dict()

Tecton is the commercial version of this idea, used at companies like Spotify, Atlassian, and Faire. It adds managed infrastructure, feature monitoring, and streaming feature pipelines. The tradeoff: Feast is free and self-managed; Tecton is expensive and managed.

Whether you need a feature store depends on scale. For a team with one or two models and a simple batch training pipeline, a shared preprocessing library may be sufficient. For teams running dozens of models with real-time feature requirements, a feature store is load-bearing infrastructure.

Model Registry: Controlling What Goes to Production

A model registry stores versioned model artifacts with associated metadata: training metrics, training data version, git commit, config, and lifecycle stage. The lifecycle stages - Staging, Production, Archived - provide a formal promotion process that replaces the informal “upload the weights file to S3 and hope” approach.

The pattern:

Training job trains a model, evaluates it, and registers it to the model registry with status Staging.
An automated evaluation job (or a human reviewer) compares the new model against the current Production model on a held-out evaluation set.
If the new model passes evaluation criteria, it is promoted to Production. The old production model moves to Archived.
The serving system is configured to load the Production model from the registry.

MLflow’s Model Registry implements this with a REST API and web UI. W&B Artifacts provides a similar capability through artifact lineage and aliases. AWS SageMaker Model Registry and GCP Vertex AI Model Registry provide managed versions of the same concept integrated with their respective cloud platforms.

The value of a model registry is not primarily technical - it is organizational. It forces a formal approval process for model promotion, creates an audit trail (“who promoted this model and when?"), and makes rollback trivial (“revert Production to the previous Archived version”).

The MLOps Stack: AWS and GCP

If you are building a production ML system on AWS, the relevant services are:

SageMaker Experiments: experiment tracking (MLflow alternative)
SageMaker Feature Store: online and offline feature storage
SageMaker Model Registry: versioned model catalog with lifecycle management
SageMaker Pipelines: orchestrated ML workflows (training → evaluation → registration → deployment)
SageMaker Endpoints: managed model serving with autoscaling

The advantage: everything is integrated and managed. The disadvantage: you are locked into AWS’s abstraction layer, which has different APIs, different concepts, and different limitations than the open-source ecosystem. Teams often run SageMaker for serving while using W&B for experiment tracking and DVC for data versioning, mixing managed and open-source tools.

On GCP:

Vertex AI Pipelines: Kubeflow-compatible ML workflow orchestration
Vertex AI Feature Store: fully managed feature store
Vertex AI Model Registry: model versioning and deployment
Vertex AI Experiments: integrated experiment tracking

Uber built Michelangelo as an internal ML platform that essentially reinvented all of these capabilities. DoorDash built a similar system. These internal platforms are the origin of most MLOps concepts now commercialized as managed services.

Testing ML Code: What’s Different

ML code is harder to test than deterministic software because “correctness” is often statistical. But many components are deterministic and should have unit tests. The principle: separate the numerics (which are hard to unit test) from the logic (which isn’t).

# Test data transforms - deterministic, should be tested
def test_normalize_clips_to_unit_range():
    x = np.array([-10., 0., 10.])
    out = normalize(x)
    assert out.min() >= -1.0 and out.max() <= 1.0

# Test model output shapes - deterministic, should be tested
def test_model_output_shape():
    model = MyTransformer(d_model=128, n_heads=4, n_layers=2)
    x = torch.randint(0, 1000, (4, 32))  # batch=4, seq_len=32
    logits = model(x)
    assert logits.shape == (4, 32, 1000)

# Smoke test: does the training loop run without error?
def test_training_step_runs():
    model = MyModel()
    batch = make_fake_batch(size=4)
    loss = training_step(model, batch)
    assert torch.isfinite(loss)  # not NaN
    loss.backward()              # gradients flow

The most important non-unit-test is the overfit a single batch test: take one batch of data, run gradient descent on it for 200+ steps, and verify that training loss approaches zero. A model that cannot memorize 4 examples has a bug in the forward pass or loss function. This catches silent correctness failures before they waste GPU hours.

For CI pipelines: run unit tests and smoke tests on every commit. Reserve full training runs for nightly jobs or manual triggers - they are too slow for pre-commit.

When MLOps is Overkill

MLOps tooling is fragmented, expensive to set up, and often over-engineered for small teams. The maturity question: which tools do you actually need?

Stage 1 (one person, one model): git, pinned requirements, a training script that logs to a file, model checkpoints saved with meaningful names. That’s it. Adding W&B is easy and worth it. Anything else is premature.

Stage 2 (small team, multiple models): W&B or MLflow for experiment tracking, DVC for data versioning, a simple model registry (W&B Artifacts or MLflow). Shared preprocessing library to avoid training-serving skew.

Stage 3 (large team, production ML): Full MLOps stack - feature store, automated pipelines, model registry with promotion workflow, drift monitoring, automated retraining. This is where Feast, SageMaker Pipelines, or internal platforms make sense.

The cost of going to Stage 3 too early is real: engineers spend months building MLOps infrastructure instead of building models, the tools add cognitive overhead to every workflow, and small teams lack the model volume to amortize the setup cost.

LLM-Specific MLOps: Prompt Versioning and Evaluation

Large language models introduce new challenges that traditional MLOps tooling wasn’t designed for.

Prompt versioning: LLM behavior is controlled by prompts as much as by weights. Prompt changes are code changes - they should be versioned, reviewed, and tested. Tools like PromptLayer and W&B Prompts treat prompt templates as versioned artifacts.

Evaluation frameworks: Traditional classification metrics (accuracy, F1) don’t capture LLM quality. HELM (Holistic Evaluation of Language Models) standardizes evaluation across multiple dimensions: accuracy, calibration, robustness, fairness, efficiency. Evaluating a fine-tuned model against HELM before promotion is the LLM analog of running an evaluation suite before promoting a traditional model.

LLM observability: Production LLM systems need monitoring for output quality, latency, cost, and safety - dimensions that don’t exist for traditional models. Tools like Langfuse, Helicone, and Arize Phoenix provide tracing and monitoring for LLM applications. Detecting when a model’s output distribution shifts (is it generating more refusals? more hallucinations?) requires a different kind of monitoring than watching prediction accuracy on a test set.

These are genuinely hard problems with no settled solutions. The LLM MLOps space in 2025 is where traditional MLOps was in 2018: a proliferation of tools, no dominant standard, and most of the hard work done by internal platforms at large companies.

Long-running LLM agent systems. Deploying a model as a chatbot is trivially different from deploying it as an autonomous agent that runs for hours, uses tools, and produces complex artifacts. A serious long-running agent is closer to a small software team running inside a sandbox than it is to a chatbot.

The three-component architecture. Production agentic systems typically use a planner-worker-evaluator loop. The planner takes a high-level goal and produces a structured task graph: a set of tasks with dependencies, success criteria, and tool requirements. The worker (usually the same or smaller LM) executes individual tasks from the graph, calling tools (code execution, web search, file read/write, API calls) and writing structured artifacts. The evaluator checks whether each task’s output meets the success criteria - this can be automated (unit tests, linting, schema validation) or LM-based (an LM grades the output against a rubric). The harness runs the loop: scheduler picks the next ready task, worker executes it, evaluator scores it, harness decides whether to accept, retry, or escalate to a human.

State and checkpointing. The hardest engineering problem in long-running agents is state management. The agent cannot keep all context in a single prompt - a multi-hour coding session has far more context than any context window. The solution is external memory: structured artifacts written to disk at each step (task status, intermediate results, error logs, next-step instructions). Every step writes its outputs explicitly before proceeding. The harness saves a checkpoint after each accepted task, so the agent can resume from any point without re-executing completed work.

Human-in-the-loop. Fully autonomous agents make mistakes that compound. Production systems define escalation criteria: actions above a certain risk level (deleting files, pushing to production, making external API calls with side effects) require human approval before execution. The harness pauses the loop, surfaces the pending action with context, waits for approval or rejection, then resumes. This is not a failure mode - it is a design choice that makes the system safe to deploy on consequential tasks.

Why the harness matters more than the model. The harness decides what files the agent can see, which tools it can call, where progress is written, how failures are detected, and when to involve a human. Without a well-engineered harness, even a capable model tends to do too much at once, lose context, leave work undocumented, or declare tasks done prematurely. Anthropic’s internal long-running agent work explicitly describes these as the failure modes that harness design solves - not model capability problems, but system design problems.

Summary

Tool Category	Open Source Options	Managed Options	When You Need It
Experiment tracking	MLflow, DVC	W&B, Comet	Always
Data versioning	DVC, Delta Lake	LakeFS, Pachyderm	When data changes
Feature store	Feast	Tecton, Vertex AI	Multiple models, real-time features
Model registry	MLflow	W&B, SageMaker	When models go to production
Pipeline orchestration	Kubeflow, Airflow	SageMaker Pipelines, Vertex AI	Production automation
LLM observability	Langfuse	Helicone, Arize	LLM production serving

MLOps Maturity Stage	Team Size	Tools Needed
Stage 1	1 person	git + W&B + checkpoint names
Stage 2	2 - 10 people	+ DVC + model registry + shared preprocessing
Stage 3	10+ people	Full stack + feature store + automated retraining

Read Next:

Diagnosing DL Failures - Reading the Symptoms Your Model Leaves Behind