Software Engineering for ML - Making Research Survive Contact With Production
Helpful context:
- Docker & Containerization - Packaging Code So It Runs the Same Everywhere
- Monitoring & Observability - Knowing What Your System Is Doing
You train a model that achieves 94% accuracy. Six months later, the same model evaluated on new data achieves 87%. You go to investigate. The original experiment ran in a Jupyter notebook that has been modified since. The weights file is named model_final_v3_FINAL.pt. There is no record of which dataset split it was trained on, no log of the hyperparameters, no indication of which preprocessing pipeline was applied. You cannot reproduce the result. You cannot diagnose the regression. You are flying blind.
This is why MLOps exists - not as bureaucratic overhead, but as the discipline that makes ML systems trustworthy over time. The core insight from Sculley et al.’s 2015 paper “Hidden Technical Debt in Machine Learning Systems” (published at Google, still widely cited) is that ML code is a small fraction of ML systems. The rest is data ingestion, feature engineering, monitoring, and serving infrastructure. And that infrastructure, if built without engineering discipline, accrues debt faster than any other software.
The Problem Is Reproducibility
Before any specific tool, the foundational question is: can you reproduce this result?
Reproducing a training run requires knowing four things precisely:
- The exact code that ran (git commit hash)
- The exact data that was used (data version and split)
- The exact hyperparameters and configuration
- The exact software environment (library versions, CUDA version, Python version)
Any of these missing means you cannot reproduce, which means you cannot diagnose failures, compare experiments fairly, or audit production models. This sounds obvious. In practice, teams violate all four constantly.
Seeding is table stakes:
import random, numpy as np, torch
def seed_everything(seed: int = 42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
This alone is insufficient. cudnn.deterministic = True forces cuDNN to use deterministic algorithms but can reduce throughput by 20 - 30%. Different CUDA versions implement some operations differently, so the same seed produces different results on CUDA 11.8 vs CUDA 12.2. Hardware differences matter too - multi-GPU training with torch.distributed has non-deterministic gradient aggregation by default. “Reproducible” in practice means “reproducible on the same hardware and software version,” which is already useful, and requires genuine effort to achieve.
Experiment Tracking: The Audit Log for ML
Experiment tracking tools log hyperparameters, metrics, artifacts, system information, and code state for every training run. The essential functionality: given any two past runs, you can reproduce either one and understand exactly what differed between them.
Weights & Biases (W&B) is the most widely used in practice:
import wandb
from dataclasses import asdict
config = TrainingConfig(learning_rate=3e-3, batch_size=64, num_epochs=50)
run = wandb.init(project="image-classifier", config=asdict(config))
for epoch in range(config.num_epochs):
train_loss = train_one_epoch(model, train_loader, config)
val_acc = evaluate(model, val_loader)
wandb.log({"train_loss": train_loss, "val_accuracy": val_acc, "epoch": epoch})
# Log model checkpoint as a versioned artifact
artifact = wandb.Artifact("model", type="model")
artifact.add_file("checkpoint.pt")
run.log_artifact(artifact)
W&B automatically captures the git commit hash, the pip environment, GPU utilization, system memory, and any artifacts you explicitly log. The UI allows filtering runs by metric, creating scatter plots of hyperparameter vs performance, and running sweeps (automated hyperparameter search across many configurations).
MLflow is the open-source, self-hosted alternative. The API is similar, the UI is less polished, but it runs on your own infrastructure - important for teams with data privacy constraints that prevent using external services.
The critical habit: start every experiment by calling wandb.init or mlflow.start_run. Do this automatically in your training script so it can’t be forgotten. Log the full resolved config object, not just selected hyperparameters. Future you, six months from now, will thank present you.
Data Versioning: The Ignored Half of the Problem
Model checkpoints get versioned. Datasets rarely do. This is backwards - the data is often the most important input and the hardest to recreate.
DVC (Data Version Control) tracks large files in external storage (S3, GCS, Azure Blob, local) while storing lightweight .dvc pointer files in git. You get git’s versioning semantics for data that is too large to commit directly.
dvc init
dvc add data/train.csv # creates data/train.csv.dvc (committed to git)
git add data/train.csv.dvc .gitignore
git commit -m "add training data v1"
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push # uploads data to S3
# On a new machine, or after git checkout of an older commit:
dvc pull # downloads exactly the data for this commit
DVC also defines reproducible pipelines: stages with explicit inputs, outputs, and commands. It tracks which outputs are stale and reruns only the stages whose inputs changed - like make for ML pipelines.
# dvc.yaml
stages:
preprocess:
cmd: python src/data/preprocess.py
deps: [data/raw/, src/data/preprocess.py]
outs: [data/processed/]
train:
cmd: python src/training/train.py --config configs/train.yaml
deps: [data/processed/, src/training/train.py, configs/train.yaml]
outs: [models/checkpoint.pt]
metrics: [metrics/val_loss.json]
dvc repro checks which stage outputs are stale and reruns only those. This is the difference between ML pipelines as a series of manual steps you have to remember, and ML pipelines as a versioned, reproducible artifact.
Delta Lake (from Databricks, now open-source) solves a related problem: versioning for large tabular datasets that change incrementally. It adds ACID transactions and time travel to Parquet files on object storage. You can query “what did my training dataset look like three months ago?” without maintaining separate copies.
Feature Stores: Closing the Training-Serving Skew
Here is a subtle failure mode that causes silent accuracy degradation without any obvious error.
At training time, you compute features from raw data. At serving time, the same features need to be computed for incoming requests. If the training pipeline and serving pipeline are implemented separately - which they almost always are - they will drift. Different normalization logic. Different handling of missing values. Different timestamp semantics. The model was trained on features computed one way and served on features computed another way. This is training-serving skew, and it is extremely common.
Feature stores solve this by centralizing feature computation logic so that training and serving use the same code path.
Feast (open source) separates feature definitions (what a feature is and how to compute it) from the physical storage (where historical feature values are stored for training, and where current values are stored for serving). The same feature definition produces both the training dataset and the online serving lookup:
# Feature definition: same logic used for training and serving
user_features = FeatureView(
name="user_features",
entities=["user_id"],
ttl=timedelta(days=30),
features=[
Feature(name="purchase_count_7d", dtype=ValueType.INT64),
Feature(name="avg_order_value", dtype=ValueType.FLOAT),
],
batch_source=BigQuerySource(table="analytics.user_features"),
stream_source=KafkaSource(topic="user_events"),
)
# At training time: retrieve historical features for labeled examples
training_df = store.get_historical_features(
entity_df=labeled_examples,
features=["user_features:purchase_count_7d", "user_features:avg_order_value"],
).to_df()
# At serving time: retrieve current features for incoming request
feature_vector = store.get_online_features(
entity_rows=[{"user_id": user_id}],
features=["user_features:purchase_count_7d"],
).to_dict()
Tecton is the commercial version of this idea, used at companies like Spotify, Atlassian, and Faire. It adds managed infrastructure, feature monitoring, and streaming feature pipelines. The tradeoff: Feast is free and self-managed; Tecton is expensive and managed.
Whether you need a feature store depends on scale. For a team with one or two models and a simple batch training pipeline, a shared preprocessing library may be sufficient. For teams running dozens of models with real-time feature requirements, a feature store is load-bearing infrastructure.
Model Registry: Controlling What Goes to Production
A model registry stores versioned model artifacts with associated metadata: training metrics, training data version, git commit, config, and lifecycle stage. The lifecycle stages - Staging, Production, Archived - provide a formal promotion process that replaces the informal “upload the weights file to S3 and hope” approach.
The pattern:
- Training job trains a model, evaluates it, and registers it to the model registry with status
Staging. - An automated evaluation job (or a human reviewer) compares the new model against the current
Productionmodel on a held-out evaluation set. - If the new model passes evaluation criteria, it is promoted to
Production. The old production model moves toArchived. - The serving system is configured to load the
Productionmodel from the registry.
MLflow’s Model Registry implements this with a REST API and web UI. W&B Artifacts provides a similar capability through artifact lineage and aliases. AWS SageMaker Model Registry and GCP Vertex AI Model Registry provide managed versions of the same concept integrated with their respective cloud platforms.
The value of a model registry is not primarily technical - it is organizational. It forces a formal approval process for model promotion, creates an audit trail (“who promoted this model and when?"), and makes rollback trivial (“revert Production to the previous Archived version”).
The MLOps Stack: AWS and GCP
If you are building a production ML system on AWS, the relevant services are:
- SageMaker Experiments: experiment tracking (MLflow alternative)
- SageMaker Feature Store: online and offline feature storage
- SageMaker Model Registry: versioned model catalog with lifecycle management
- SageMaker Pipelines: orchestrated ML workflows (training → evaluation → registration → deployment)
- SageMaker Endpoints: managed model serving with autoscaling
The advantage: everything is integrated and managed. The disadvantage: you are locked into AWS’s abstraction layer, which has different APIs, different concepts, and different limitations than the open-source ecosystem. Teams often run SageMaker for serving while using W&B for experiment tracking and DVC for data versioning, mixing managed and open-source tools.
On GCP:
- Vertex AI Pipelines: Kubeflow-compatible ML workflow orchestration
- Vertex AI Feature Store: fully managed feature store
- Vertex AI Model Registry: model versioning and deployment
- Vertex AI Experiments: integrated experiment tracking
Uber built Michelangelo as an internal ML platform that essentially reinvented all of these capabilities. DoorDash built a similar system. These internal platforms are the origin of most MLOps concepts now commercialized as managed services.
Testing ML Code: What’s Different
ML code is harder to test than deterministic software because “correctness” is often statistical. But many components are deterministic and should have unit tests. The principle: separate the numerics (which are hard to unit test) from the logic (which isn’t).
# Test data transforms - deterministic, should be tested
def test_normalize_clips_to_unit_range():
x = np.array([-10., 0., 10.])
out = normalize(x)
assert out.min() >= -1.0 and out.max() <= 1.0
# Test model output shapes - deterministic, should be tested
def test_model_output_shape():
model = MyTransformer(d_model=128, n_heads=4, n_layers=2)
x = torch.randint(0, 1000, (4, 32)) # batch=4, seq_len=32
logits = model(x)
assert logits.shape == (4, 32, 1000)
# Smoke test: does the training loop run without error?
def test_training_step_runs():
model = MyModel()
batch = make_fake_batch(size=4)
loss = training_step(model, batch)
assert torch.isfinite(loss) # not NaN
loss.backward() # gradients flow
The most important non-unit-test is the overfit a single batch test: take one batch of data, run gradient descent on it for 200+ steps, and verify that training loss approaches zero. A model that cannot memorize 4 examples has a bug in the forward pass or loss function. This catches silent correctness failures before they waste GPU hours.
For CI pipelines: run unit tests and smoke tests on every commit. Reserve full training runs for nightly jobs or manual triggers - they are too slow for pre-commit.
When MLOps is Overkill
MLOps tooling is fragmented, expensive to set up, and often over-engineered for small teams. The maturity question: which tools do you actually need?
Stage 1 (one person, one model): git, pinned requirements, a training script that logs to a file, model checkpoints saved with meaningful names. That’s it. Adding W&B is easy and worth it. Anything else is premature.
Stage 2 (small team, multiple models): W&B or MLflow for experiment tracking, DVC for data versioning, a simple model registry (W&B Artifacts or MLflow). Shared preprocessing library to avoid training-serving skew.
Stage 3 (large team, production ML): Full MLOps stack - feature store, automated pipelines, model registry with promotion workflow, drift monitoring, automated retraining. This is where Feast, SageMaker Pipelines, or internal platforms make sense.
The cost of going to Stage 3 too early is real: engineers spend months building MLOps infrastructure instead of building models, the tools add cognitive overhead to every workflow, and small teams lack the model volume to amortize the setup cost.
LLM-Specific MLOps: Prompt Versioning and Evaluation
Large language models introduce new challenges that traditional MLOps tooling wasn’t designed for.
Prompt versioning: LLM behavior is controlled by prompts as much as by weights. Prompt changes are code changes - they should be versioned, reviewed, and tested. Tools like PromptLayer and W&B Prompts treat prompt templates as versioned artifacts.
Evaluation frameworks: Traditional classification metrics (accuracy, F1) don’t capture LLM quality. HELM (Holistic Evaluation of Language Models) standardizes evaluation across multiple dimensions: accuracy, calibration, robustness, fairness, efficiency. Evaluating a fine-tuned model against HELM before promotion is the LLM analog of running an evaluation suite before promoting a traditional model.
LLM observability: Production LLM systems need monitoring for output quality, latency, cost, and safety - dimensions that don’t exist for traditional models. Tools like Langfuse, Helicone, and Arize Phoenix provide tracing and monitoring for LLM applications. Detecting when a model’s output distribution shifts (is it generating more refusals? more hallucinations?) requires a different kind of monitoring than watching prediction accuracy on a test set.
These are genuinely hard problems with no settled solutions. The LLM MLOps space in 2025 is where traditional MLOps was in 2018: a proliferation of tools, no dominant standard, and most of the hard work done by internal platforms at large companies.
Long-running LLM agent systems. Deploying a model as a chatbot is trivially different from deploying it as an autonomous agent that runs for hours, uses tools, and produces complex artifacts. A serious long-running agent is closer to a small software team running inside a sandbox than it is to a chatbot.
The three-component architecture. Production agentic systems typically use a planner-worker-evaluator loop. The planner takes a high-level goal and produces a structured task graph: a set of tasks with dependencies, success criteria, and tool requirements. The worker (usually the same or smaller LM) executes individual tasks from the graph, calling tools (code execution, web search, file read/write, API calls) and writing structured artifacts. The evaluator checks whether each task’s output meets the success criteria - this can be automated (unit tests, linting, schema validation) or LM-based (an LM grades the output against a rubric). The harness runs the loop: scheduler picks the next ready task, worker executes it, evaluator scores it, harness decides whether to accept, retry, or escalate to a human.
State and checkpointing. The hardest engineering problem in long-running agents is state management. The agent cannot keep all context in a single prompt - a multi-hour coding session has far more context than any context window. The solution is external memory: structured artifacts written to disk at each step (task status, intermediate results, error logs, next-step instructions). Every step writes its outputs explicitly before proceeding. The harness saves a checkpoint after each accepted task, so the agent can resume from any point without re-executing completed work.
Human-in-the-loop. Fully autonomous agents make mistakes that compound. Production systems define escalation criteria: actions above a certain risk level (deleting files, pushing to production, making external API calls with side effects) require human approval before execution. The harness pauses the loop, surfaces the pending action with context, waits for approval or rejection, then resumes. This is not a failure mode - it is a design choice that makes the system safe to deploy on consequential tasks.
Why the harness matters more than the model. The harness decides what files the agent can see, which tools it can call, where progress is written, how failures are detected, and when to involve a human. Without a well-engineered harness, even a capable model tends to do too much at once, lose context, leave work undocumented, or declare tasks done prematurely. Anthropic’s internal long-running agent work explicitly describes these as the failure modes that harness design solves - not model capability problems, but system design problems.
Summary
| Tool Category | Open Source Options | Managed Options | When You Need It |
|---|---|---|---|
| Experiment tracking | MLflow, DVC | W&B, Comet | Always |
| Data versioning | DVC, Delta Lake | LakeFS, Pachyderm | When data changes |
| Feature store | Feast | Tecton, Vertex AI | Multiple models, real-time features |
| Model registry | MLflow | W&B, SageMaker | When models go to production |
| Pipeline orchestration | Kubeflow, Airflow | SageMaker Pipelines, Vertex AI | Production automation |
| LLM observability | Langfuse | Helicone, Arize | LLM production serving |
| MLOps Maturity Stage | Team Size | Tools Needed |
|---|---|---|
| Stage 1 | 1 person | git + W&B + checkpoint names |
| Stage 2 | 2 - 10 people | + DVC + model registry + shared preprocessing |
| Stage 3 | 10+ people | Full stack + feature store + automated retraining |
Read Next: