Diagnosing DL Failures - Reading the Symptoms Your Model Leaves Behind // Megha Bose

Helpful context:

Software Engineering for ML - Making Research Survive Contact With Production

Your loss is NaN on epoch 3. Or it plateaus at 0.8 when the paper reports 0.3. Or it’s oscillating wildly but trending down, and you can’t tell if you should wait or intervene. Your model trains for six hours and produces garbage predictions. You don’t have a traceback. You have a number - and the number is wrong.

These failures are not mysterious. They are diagnosable. Each symptom has a short list of likely causes, and there is a systematic order in which to test them. What separates engineers who iterate quickly from those who burn compute on broken experiments is not luck - it is a debugging workflow applied consistently.

The Hierarchy: Where to Look First

Before touching hyperparameters, follow this order:

Data - wrong labels, normalization bugs, leakage, NaN in inputs
Architecture - wrong output shapes, missing activations, gradient-blocking layers
Training procedure - loss function mismatch, optimizer misconfiguration, backward() called incorrectly
Hyperparameters - learning rate, batch size, weight decay, scheduling

Most failures are at level 1 or 2. Jumping to level 4 first (the tempting “just tune the learning rate” response) wastes time and sometimes coincidentally masks the real bug.

The Golden Rule: Overfit a Single Batch

Before running any full training job, run this first:

# Take exactly one batch. Train on it for 200+ steps.
batch = next(iter(train_loader))
for step in range(200):
    optimizer.zero_grad()
    loss = model_loss(model, batch)
    loss.backward()
    optimizer.step()
    if step % 20 == 0:
        print(f"step {step}: loss={loss.item():.6f}")
# Expected: loss should approach ~0 (or the irreducible floor)

A model that cannot memorize 4 samples has a structural bug. The overfit test finds it cheaply - in seconds, not hours. If this test passes, the model and loss function are correct, and any remaining problems are in the training process or data at scale.

This test catches: wrong loss function (cross-entropy applied to a regression target won’t converge to zero), disconnected forward pass (a layer accidentally returns zeros), gradient not flowing (forgetting to call loss.backward()), and data pipeline bugs (targets not matching inputs). Run it before every serious training job.

Training Loss Goes to NaN

NaN loss is the loudest failure, but the cause is usually one of four things.

Learning rate too high. The most common cause. An LR too large causes gradient updates to overshoot, producing activations that overflow to infinity, which propagate as NaN. Fix: reduce LR by 10x and retry. If NaN disappears, you found it.

Exploding gradients. In deep networks and RNNs, gradients compound multiplicatively over many layers or time steps. The gradient norm grows exponentially until it overflows. Diagnose by logging the gradient norm before every optimizer step:

def compute_grad_norm(model):
    total = 0.0
    for p in model.parameters():
        if p.grad is not None:
            total += p.grad.detach().norm(2).item() ** 2
    return total ** 0.5

# In training loop, before optimizer.step():
norm = compute_grad_norm(model)
wandb.log({"grad_norm": norm})
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Log the pre-clip norm so you see when it spikes. Gradient clipping with max_norm=1.0 is standard for Transformers. For RNNs, max_norm=5.0 is conventional. Note: clip after computing the norm for logging, so you see the unclipped value as a diagnostic.

Numerical instability in the loss. Cross-entropy involves log(softmax(logits)). If logits are large, softmax produces exactly 0 before the log, yielding -inf. Always use the framework’s fused loss functions rather than composing softmax and log manually:

# WRONG: numerically unstable
loss = -torch.log(torch.softmax(logits, dim=-1)[range(B), labels]).mean()

# CORRECT: uses log-sum-exp internally
loss = torch.nn.functional.cross_entropy(logits, labels)

The safe version computes $\log\sum_i e^{x_i} = x_{\max} + \log\sum_i e^{x_i - x_{\max}}$, which is stable regardless of the magnitude of the logits.

NaN in the data. A single NaN in an input propagates through every operation, poisoning the loss. Check:

assert not torch.isnan(batch["x"]).any(), "NaN in input"
assert not torch.isnan(batch["y"]).any(), "NaN in target"

Add this assertion at the start of your training loop. The overhead is negligible; the diagnostic value is high.

When you encounter NaN, torch.autograd.detect_anomaly() is invaluable:

with torch.autograd.detect_anomaly():
    loss = model_loss(model, batch)
    loss.backward()

This traces the backward pass and raises an exception at the exact operation that first produced NaN, with a stack trace pointing to the forward operation that created the problematic tensor. It is slow (10 - 100x training speed), but you only need it for diagnosis.

Loss Does Not Decrease

A flat or slowly decreasing loss when the single-batch test passes means the training loop is correct but generalization is failing - or there is a subtler bug.

Learning rate too low. The model updates in the right direction but imperceptibly slowly. The LR range test (Leslie Smith, 2015) finds a good learning rate systematically: increase LR exponentially from $10^{-7}$ to $10^{-1}$ over a few hundred steps and plot loss vs LR. The loss will first decrease, then increase sharply as LR becomes unstable. The optimal LR for a 1-cycle schedule is one-tenth of the LR where the loss starts to increase.

from torch.optim.lr_scheduler import LambdaLR

# LR warmup followed by cosine decay - standard for Transformers
def lr_lambda(step):
    if step < warmup_steps:
        return step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return 0.5 * (1 + math.cos(math.pi * progress))

scheduler = LambdaLR(optimizer, lr_lambda)

LR warmup is not optional for large models with Adam. At initialization, model parameters are random, gradients are large and noisy, and the Adam optimizer hasn’t accumulated sufficient gradient history to estimate variance well. Starting with a small LR and warming up over a few thousand steps avoids early divergence.

Wrong loss function. Regression with cross-entropy won’t converge. Classification with MSE trains but produces poorly calibrated probabilities. Check that the loss function matches the task. Then check that the loss is computed on the right output - it is easy to accidentally compute loss on logits when the function expects probabilities or vice versa.

Data not shuffled. If training data is ordered by class and each batch is a single class, the model sees adversarial updates (it learns to predict class A, then class B, then forgets class A). Always shuffle.

Dead weights. If all weights in a layer are initialized identically, the gradients are identical, and all weights update identically - the layer is computing the same function in every neuron. This is the symmetry-breaking problem. Xavier/Glorot and He initialization exist specifically to break symmetry at initialization. If you initialized a layer manually (e.g., torch.zeros), you may have created a dead layer.

Vanishing and Exploding Gradients: What They Look Like

Gradient problems have diagnostic signatures in the loss curve that you can recognize without looking at the actual gradient values.

Vanishing gradients produce a training loss that decreases initially but then plateaus well above acceptable performance, even with a small learning rate. Early layers stop updating; only the last few layers change. The model effectively has far fewer parameters than its architecture suggests. The loss curve often looks like it’s converging to a local minimum, but the minimum is bad.

Exploding gradients produce a loss that decreases for a while, then suddenly spikes to a very large value, often followed by NaN. If you’re logging gradient norms, you’ll see a sudden jump from normal values (0.1 - 5.0 for a healthy Transformer) to thousands, followed by NaN.

Detect vanishing gradients by checking gradient norms per layer:

for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad_norm={param.grad.norm().item():.4f}")

If early layers (embedding, first attention block) have gradient norms of $10^{-6}$ while later layers have $10^{-1}$, you have vanishing gradients. Solutions: residual connections (the primary cure - gradients flow through the skip connection even when the main path is saturated), layer norm or batch norm (reduces saturation), or GELU/SiLU activations instead of sigmoid/tanh.

Activation Saturation: The Silent Killer

Sigmoid and tanh saturate: for large positive inputs, sigmoid outputs 1.0; for large negative inputs, it outputs 0.0. In both saturated regions, the derivative is essentially zero, which means gradients don’t flow through saturated neurons. A neuron that is always saturated is dead - it never updates and contributes nothing to the model’s capacity.

Batch normalization and layer normalization address this at the root. By normalizing activations to have mean 0 and variance 1 before each layer, they keep inputs in the active (non-saturated) region of sigmoid/tanh. You can verify this empirically:

# Register a forward hook to monitor activation statistics
activation_stats = {}

def make_hook(name):
    def hook(module, input, output):
        activation_stats[name] = {
            "mean": output.mean().item(),
            "std": output.std().item(),
            "frac_dead": (output == 0).float().mean().item()  # for ReLU
        }
    return hook

for name, module in model.named_modules():
    if isinstance(module, (nn.ReLU, nn.GELU, nn.Sigmoid, nn.Tanh)):
        module.register_forward_hook(make_hook(name))

If frac_dead is consistently above 0.5 for ReLU layers, more than half your neurons are dead. Solutions: lower the learning rate (high LR drives neurons into always-negative territory), switch to LeakyReLU or GELU (which never fully die), use better initialization, or add batch normalization.

Overfitting vs Underfitting: Reading the Divergence

The train/validation loss divergence is the primary signal:

Training loss decreasing, validation loss increasing: overfitting. The model is memorizing training data.
Both losses flat well above target: underfitting. The model lacks capacity or training is insufficient.
Validation loss flat while training loss decreases from step 1: distribution mismatch. The validation set has different characteristics than training - often caused by preprocessing applied to train but not val, or data leakage in the wrong direction.
Both losses flat from step 1: zero gradient. Check that loss.backward() is being called, optimizer.zero_grad() runs at the right time, and parameters require gradients.

For overfitting remedies, the order of effectiveness in practice (not in textbooks):

More data: the most reliable regularizer. If you can get more, get more.
Data augmentation: synthetic expansion. For images: random crops, flips, color jitter. For text: back-translation, synonym replacement.
Weight decay (L2): use AdamW rather than Adam, with weight_decay=1e-4 as a starting point.
Dropout: effective for Transformers (post-attention, post-FFN) and fully-connected layers. Less effective for convolutions.
Early stopping: monitor validation loss and stop when it stops improving for $k$ epochs.

W&B Gradient Histograms

W&B can log gradient histograms automatically:

wandb.init(...)
wandb.watch(model, log="gradients", log_freq=100)

This records a histogram of gradient values for every parameter every 100 steps. Healthy gradients: roughly Gaussian, mean near zero, standard deviation in the range 0.01 - 1.0. Pathological gradients: all zeros (dead layer), all very small ($<10^{-5}$, vanishing), bimodal (mode collapse), or very large ($>100$, exploding).

The histogram view is more informative than scalar gradient norms because it distinguishes a uniformly small gradient (a quietly dying layer) from a gradient that is large on average but has many zeros (sparse gradient, common in embeddings and attention).

The Reproducibility Problem: The Uncomfortable Truth

Deep learning results are less reproducible than they appear. The same training script, same seed, same data produces different results across:

Different CUDA versions (some operations are nondeterministic by design for performance)
Different PyTorch versions (optimizer implementations change, default initializations change)
Different hardware (A100 vs V100 - tensor core precision differs)
Different batch sizes (gradient noise scale changes; result: different minima)
Multiple GPUs (nondeterministic AllReduce unless explicitly configured otherwise)

The practical implication: a single reported training run is not a reproducible result. The appropriate practice is to run training 5 times with different random seeds and report mean ± standard deviation. If the standard deviation is large relative to the improvement you’re claiming, the result is noise, not signal.

This is uncomfortable because it multiplies training compute by 5x. But claiming a 0.3% accuracy improvement from a single run is claiming a result that may not replicate - and the ML literature is full of such claims.

For hyperparameter tuning: use W&B Sweeps or Optuna with a small seed set (3 seeds per configuration) and budget for multiple runs. Don’t select hyperparameters based on a single seed’s result.

Loss Curve Pathologies: A Reference

Pattern	Likely Cause	First Check
NaN at step 1	NaN in data or weights	`assert not torch.isnan(batch).any()`
NaN after K steps	LR too high or exploding gradients	Log grad norms; reduce LR by 10x
Flat from step 1	Zero gradient	Is `backward()` called? Do params require grad?
Spikes in loss	Bad batches or LR near instability	Check data quality; add gradient clipping
Decreasing then plateau	LR too small or capacity limit	LR range test; try larger model
Train ↓, val flat from start	Distribution mismatch	Check preprocessing parity between train/val
Train ↓, val ↑ after epoch K	Overfitting after epoch K	More data, augmentation, weight decay
Oscillating wildly	LR too high without diverging	Reduce LR by 3x; add LR warmup

Summary

Failure Mode	Symptom	Fix
Exploding gradients	Grad norm spikes, then NaN	Gradient clipping, reduce LR
Vanishing gradients	Early layer norms near 0	Residual connections, LayerNorm
Dead ReLUs	High fraction of zero activations	LeakyReLU/GELU, lower LR
Activation saturation	Sigmoid/tanh outputs near 0 or 1	BatchNorm/LayerNorm
Numerical instability	NaN from log(0) or softmax overflow	Use fused loss functions
LR too high	Oscillating or diverging loss	LR range test, reduce by 10x
LR too low	Slow or no improvement	LR range test, cosine annealing
Training-serving skew	Validation accuracy doesn’t match production	Shared preprocessing code
Non-reproducibility	Results vary across runs	Run 5 seeds, report mean ± std

The overfit-one-batch test catches most level 1 and 2 failures. Gradient norm logging catches level 3 failures early, before they manifest as NaN or wasted compute. Everything else is pattern matching against the table above.

Read Next:

How Computers Execute Programs - From Instruction Fetch to Writeback