Neural Networks as ODEs - Depth as Continuous Time // Megha Bose

Helpful context:

Look at this update rule for a deep residual network (ResNet):

$$\mathbf{x_{k+1}} = \mathbf{x_k} + f(\mathbf{x_k}, \theta_k)$$

The hidden state $\mathbf{x_k} \in \mathbb{R}^d$ at layer $k$ is updated by adding a residual block $f(\mathbf{x_k}, \theta_k)$ parameterized by $\theta_k$. This is the defining feature of ResNets - the architecture that first made very deep networks trainable.

Now look at Euler’s method for solving the ODE $\frac{d\mathbf{x}}{dt} = F(\mathbf{x}, t)$ with step size $h$:

$$\mathbf{x_{n+1}} = \mathbf{x_n} + h \cdot F(\mathbf{x_n}, t_n)$$

With step size $h = 1$ and $F(\mathbf{x}, t_k)$ identified with $f(\mathbf{x}, \theta_k)$, these two equations are identical.

A ResNet with $L$ layers computes exactly $L$ steps of Euler’s method on the vector field $F(\mathbf{x}, t)$, where the vector field is parameterized by the network weights. Layer depth corresponds to time. The number of layers corresponds to the number of Euler steps.

Once you see this, a natural question follows: what happens if you make the residual blocks thinner and take the number of layers to infinity? The discrete Euler steps become a continuous ODE. The discrete residual network becomes a Neural ODE.

Section 1: The Euler-ResNet Correspondence

Let us make the correspondence precise and examine what it implies.

Euler’s method for $\frac{d\mathbf{x}}{dt} = F(\mathbf{x}, t)$ with step size $h = 1$:

$$\mathbf{x}(t + 1) \approx \mathbf{x}(t) + F(\mathbf{x}(t), t)$$

Unrolling from $t = 0$ to $t = L$ through integer steps:

$$\mathbf{x}(k+1) = \mathbf{x}(k) + F(\mathbf{x}(k), k), \quad k = 0, 1, \ldots, L-1$$

The ResNet update at each layer $k$:

$$\mathbf{x_{k+1}} = \mathbf{x_k} + f(\mathbf{x_k}, \theta_k)$$

These are the same equation. Identify $F(\mathbf{x}, k) = f(\mathbf{x}, \theta_k)$: the vector field at time $k$ is the $k$-th residual block.

Discomfort check. Why does Euler’s method give exactly the ResNet update rule?

The key is the residual structure. A plain feedforward layer computes $\mathbf{x_{k+1}} = g(\mathbf{x_k}, \theta_k)$ - the output is an entirely new representation. A residual block computes $\mathbf{x_{k+1}} = \mathbf{x_k} + f(\mathbf{x_k}, \theta_k)$ - the output is the old representation plus a small correction. This correction $f(\mathbf{x_k}, \theta_k)$ plays the role of $h \cdot F(\mathbf{x_k})$ in Euler’s method: a step in the direction the vector field is pointing. When $f$ is small (thin, lightly-parameterized blocks), the correction is small - exactly the regime where the Euler approximation is accurate and the continuous ODE picture is valid.

The depth-time correspondence. In Euler’s method, $L$ steps with step size $h$ integrate from $t = 0$ to $t = Lh$. A ResNet with $L$ layers integrates from layer $0$ to layer $L$, with time horizon $T = L$. Making the network deeper is equivalent to integrating for longer. Adding more layers in a specific range corresponds to using smaller step sizes there - higher resolution integration.

What a ResNet is learning. Under this correspondence, a trained ResNet is learning a vector field $F(\mathbf{x}, t)$ such that the Euler integration of that field transforms the input representation $\mathbf{x_0}$ into the output representation $\mathbf{x_L}$ in a way that makes the classification loss small. The network is not learning arbitrary layer-to-layer mappings; it is learning a vector field that drives the representation toward the right region of space.

This explains an empirical observation: deep ResNets with many layers often have many layers where $f$ is nearly zero - “identity” layers that do almost nothing. In the ODE view, these are time intervals where the vector field is small and the representation is already near the right region.

Section 1b: What a ResNet Is Really Learning

Let us push the Euler-ResNet correspondence further. In Euler’s method, you choose a step size $h$ and a fixed number of steps $N$. The total integration time is $T = Nh$. You can get the same $T$ with many small steps or fewer large steps.

In a ResNet, the “step size” is implicitly determined by how strongly the residual block changes the representation. A very small, lightly-regularized residual block corresponds to small $h$. A large, expressive residual block corresponds to large $h$.

Standard ResNets use $h = 1$ (one step per layer). But nothing says $h$ must be $1$. You could multiply the residual by a small constant: $\mathbf{x_{k+1}} = \mathbf{x_k} + \epsilon f(\mathbf{x_k}, \theta_k)$ for small $\epsilon$. This corresponds to Euler’s method with step size $h = \epsilon$. With $L$ such layers, you integrate to time $T = \epsilon L$.

The limit $\epsilon \to 0$, $L \to \infty$ with $\epsilon L = T$ fixed. This is exactly the limit that converts a discrete Euler scheme into a continuous ODE. The discrete residual network becomes a continuous-time dynamical system. The weight matrices $\theta_k$ at each layer become a single weight function $\theta(t)$ - or, in the simplest Neural ODE, a single fixed weight $\theta$ with $t$ passed as an extra input.

Weight sharing versus independent weights. A key question: as we take more and more layers, do we use the same weights (weight sharing) or independent weights at each layer? With independent weights, each layer has its own $\theta_k$, and the limit is an infinite sequence of independent functions. With shared weights, the limit is a single function $f(\mathbf{x}, t, \theta)$ evaluated at different $t$ values - a Neural ODE.

The weight-sharing limit is the Neural ODE. It has far fewer parameters than the independent-weight limit. This parameter efficiency is both a strength (less overfitting) and a weakness (less capacity to represent arbitrary transformations).

Section 2: Neural ODEs

The ODE view of ResNets motivates a natural continuous limit.

The Neural ODE (Chen et al., 2018):

$$\frac{d\mathbf{x}(t)}{dt} = f(\mathbf{x}(t), t, \theta)$$

where $f(\cdot, \cdot, \theta)$ is a neural network with parameters $\theta$ (shared across all $t$). The hidden state $\mathbf{x}(t)$ starts at $\mathbf{x}(0)$ (the input, after an initial embedding) and evolves continuously.

The output is $\mathbf{x}(T)$ for some terminal time $T$, computed by solving the ODE numerically from $t = 0$ to $t = T$:

$$\mathbf{x}(T) = \mathbf{x}(0) + \int_0^T f(\mathbf{x}(t), t, \theta) dt$$

The forward pass is an ODE solve. You provide the vector field $f$ and initial condition $\mathbf{x}(0)$, and a numerical solver computes $\mathbf{x}(T)$.

Discomfort check. Is “continuous depth” just an infinitely deep network?

Not quite - and the difference matters. An infinitely deep network would have infinitely many independent parameter sets $\theta_0, \theta_1, \theta_2, \ldots$, one per layer. A Neural ODE has a single shared parameter $\theta$ that defines the vector field $f(\mathbf{x}, t, \theta)$ at all times $t$. The same network $f$ is evaluated at different $(t, \mathbf{x}(t))$ values as the state evolves. This is weight sharing across depth - the same function is applied repeatedly, with $t$ as an additional input encoding “where in the depth we are.” This is closer to a recurrent architecture (shared weights across time steps) than to a feedforward network (independent weights per layer).

Two perspectives on the same architecture. A Neural ODE can be seen as:

A continuous-depth residual network, where the “depth” is the integration time $T$.
A continuous-time recurrent network, where the “time” is the integration variable $t$ and the hidden state $\mathbf{x}(t)$ is the recurrent state.

Both perspectives are useful. The residual network perspective makes the connection to ResNets clear. The recurrent network perspective makes the connection to sequence models and time series clear.

Section 3: Why Neural ODEs Are Interesting

Reason 1: Adaptive computation. In a standard ResNet, the number of layers $L$ is fixed at design time. A Neural ODE’s computation can be adaptive: modern ODE solvers with error control automatically take more steps where the vector field changes rapidly, and fewer steps where it is smooth. Harder inputs can automatically receive more computation (more solver steps) than easier inputs. This is adaptive inference - something standard neural networks cannot do without explicit mechanisms.

Concretely: an adaptive solver like DOPRI5 estimates the local integration error at each step and adjusts $h$ to keep the error below a specified tolerance. For a complex input where $f(\mathbf{x}(t), t, \theta)$ varies rapidly in $t$, the solver takes many small steps. For a simple input where $f$ is nearly constant, it takes few large steps. The number of function evaluations - the computational cost - is therefore input-dependent.

Reason 2: Memory efficiency via the adjoint method. Standard backpropagation through a ResNet with $L$ layers requires storing all $L$ intermediate activation states - $O(L \cdot d)$ memory. For very deep networks, this is a bottleneck.

For a Neural ODE, the adjoint method (Section 4) computes gradients using only $O(d)$ memory, regardless of the number of solver steps. This makes Neural ODEs particularly attractive for memory-constrained settings or when very high accuracy (many solver steps) is needed.

Reason 3: Mathematical elegance and connections. Neural ODEs connect deep learning to the rich theory of dynamical systems, control theory, and variational calculus. This enables analysis and design tools that do not exist for discrete networks: stability theory, Lyapunov functions, Hamiltonian structure, controllability.

Section 4: The Adjoint Method

Training a Neural ODE requires $\frac{\partial \mathcal{L}}{\partial \theta}$, the gradient of the loss with respect to parameters. The loss $\mathcal{L}$ depends on $\mathbf{x}(T)$, which depends on $\theta$ through the ODE.

Why the standard approach is expensive. In a ResNet with $L$ layers, backpropagation stores the activations at all $L$ layers during the forward pass, then uses them during the backward pass. Memory is $O(L \cdot d)$. For a Neural ODE solved with $N$ steps, the naive analog - backpropagating through all $N$ solver steps - has the same $O(N \cdot d)$ cost. If the solver uses 500 steps and the state dimension is $d = 512$, this requires storing 256,000 floating-point values per example per forward pass.

Gradient checkpointing (used in large language models to manage memory) solves the same problem for discrete networks: instead of storing all activations, recompute them during the backward pass at the cost of extra forward passes. The adjoint method is exactly gradient checkpointing taken to the continuous-time limit.

The naive approach. Backpropagate through all solver steps. If the solver takes $N$ steps, store all $N$ intermediate states $\mathbf{x}(t_0), \mathbf{x}(t_1), \ldots, \mathbf{x}(t_N) = \mathbf{x}(T)$, then compute gradients backward through each step. Memory: $O(N \cdot d)$. For $N = 500$ steps and $d = 256$, this is 128,000 stored values - manageable but costly.

The adjoint method. Compute the same gradients with $O(d)$ memory.

Define the adjoint state $\mathbf{a}(t) = \frac{\partial \mathcal{L}}{\partial \mathbf{x}(t)}$. This is the gradient of the loss with respect to the hidden state at time $t$. At the final time, we know $\mathbf{a}(T) = \frac{\partial \mathcal{L}}{\partial \mathbf{x}(T)}$ from the loss function.

The adjoint ODE. By differentiating the ODE constraint and applying the chain rule (the details require calculus of variations), the adjoint satisfies:

$$\frac{d\mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f}{\partial \mathbf{x}}(\mathbf{x}(t), t, \theta)$$

Starting from $\mathbf{a}(T)$ and integrating backward from $T$ to $0$ gives $\mathbf{a}(0)$.

Discomfort check. Why does the adjoint ODE run backward in time?

Because gradient information flows backward. In standard backpropagation through a discrete network, you compute the gradient at the output layer, then use the chain rule to propagate it backward through each preceding layer. The adjoint ODE is the continuous-time analog: $\mathbf{a}(t) = \frac{\partial \mathcal{L}}{\partial \mathbf{x}(t)}$ at a given time $t$ depends on $\mathbf{a}$ at a later time $t + dt$ (closer to the output), so the ODE must be integrated backward in time - from $T$ back to $0$. The minus sign and the backward integration direction are two sides of the same coin. The term $-\mathbf{a}^T (\partial f / \partial \mathbf{x})$ is the continuous-time chain rule: how does the gradient at time $t$ come from the gradient at time $t + dt$, mediated by the Jacobian of the vector field?

Gradient with respect to parameters. Simultaneously with the adjoint ODE, accumulate:

$$\frac{\partial \mathcal{L}}{\partial \theta} = -\int_T^0 \mathbf{a}(t)^T \frac{\partial f}{\partial \theta}(\mathbf{x}(t), t, \theta) dt$$

This integral runs backward from $T$ to $0$ and can be accumulated step by step during the backward solve.

The backward pass in practice. Three ODEs solved simultaneously in a single backward integration:

Re-compute $\mathbf{x}(t)$ by integrating the forward ODE backward (to recover the trajectory without storing it).
Integrate the adjoint ODE for $\mathbf{a}(t)$.
Accumulate the $\frac{\partial \mathcal{L}}{\partial \theta}$ integral.

All three share the same time grid. Memory usage at any moment: three state vectors of size $d$ each - $O(d)$ total.

The cost: running the forward ODE twice (forward pass and backward re-integration) plus the adjoint. The memory saving (from $O(N \cdot d)$ to $O(d)$) comes at the price of this extra computation.

Connection to optimal control. The adjoint method is identical to Pontryagin’s minimum principle in optimal control theory. The adjoint state $\mathbf{a}(t)$ is the “costate” or “costrate” variable. The gradient of the loss with respect to parameters is the gradient of the Hamiltonian with respect to the control (parameters). This connection - that neural network training via the adjoint method is a form of optimal control - is one of the beautiful mathematical structures revealed by the Neural ODE framework.

Section 4b: Deriving the Adjoint ODE

Where does $\frac{d\mathbf{a}}{dt} = -\mathbf{a}^T \frac{\partial f}{\partial \mathbf{x}}$ come from? The derivation uses the chain rule applied to the constraint that $\mathbf{x}(t)$ satisfies the ODE.

Consider the loss $\mathcal{L}[\mathbf{x}(T)]$. Perturb the solution by a small function $\delta\mathbf{x}(t)$ satisfying $\delta\mathbf{x}(0) = 0$ (we are not changing the initial condition). The perturbation propagates through the ODE:

$$\frac{d(\delta\mathbf{x})}{dt} = \frac{\partial f}{\partial \mathbf{x}} \delta\mathbf{x}$$

The resulting change in the loss is:

$$\delta\mathcal{L} = \mathbf{a}(T)^T \delta\mathbf{x}(T)$$

where $\mathbf{a}(T) = \frac{\partial \mathcal{L}}{\partial \mathbf{x}(T)}$. Using the Duhamel formula (or Gronwall’s inequality), $\delta\mathbf{x}(T) = \int_0^T \Phi(T, t) \delta\mathbf{x}(t) dt$ where $\Phi$ is the state transition matrix satisfying $\frac{d\Phi}{dt} = \frac{\partial f}{\partial \mathbf{x}} \Phi$. After integrating by parts and requiring that $\delta\mathcal{L}$ matches the gradient formula for all perturbations $\delta\mathbf{x}$, one obtains the adjoint equation. The adjoint state $\mathbf{a}(t)$ satisfies $\frac{d\mathbf{a}}{dt} = -\mathbf{a}^T \frac{\partial f}{\partial \mathbf{x}}$ with terminal condition $\mathbf{a}(T) = \frac{\partial \mathcal{L}}{\partial \mathbf{x}(T)}$.

This derivation is the continuous-time analog of the discrete backpropagation derivation. Where discrete backprop applies the chain rule through a sequence of matrix multiplications, the adjoint ODE applies it through a matrix-valued differential equation.

Section 5: Continuous Normalizing Flows

Neural ODEs have a particularly elegant application in generative modeling.

The setup. A normalizing flow transforms a simple base distribution $p_0(\mathbf{z})$ (a Gaussian, say) into a complex target distribution $p(\mathbf{x})$ via an invertible mapping $\mathbf{x} = g(\mathbf{z})$. The change-of-variables formula gives the log-likelihood:

$$\log p(\mathbf{x}) = \log p_0(\mathbf{z}) - \log \left|\det \frac{\partial g}{\partial \mathbf{z}}\right|$$

The Jacobian determinant costs $O(d^3)$ to compute directly. Discrete normalizing flows avoid this by using specially structured $g$ (e.g., coupling layers with triangular Jacobians), but this restricts the expressiveness of the model.

The continuous approach. If $\mathbf{x}(t)$ evolves as a Neural ODE $\frac{d\mathbf{x}}{dt} = f(\mathbf{x}(t), t, \theta)$, the probability density $p(\mathbf{x}, t)$ evolves according to the continuity equation:

$$\frac{\partial p}{\partial t} + \nabla \cdot (f \cdot p) = 0$$

This is just conservation of probability: the total probability in any region changes only because probability flows in or out through the boundary. The same equation governs fluid density in fluid mechanics.

Taking $\log$ and working along the trajectory:

$$\frac{\partial \log p(\mathbf{x}(t), t)}{\partial t} = -\text{tr}\left(\frac{\partial f}{\partial \mathbf{x}}(\mathbf{x}(t), t, \theta)\right)$$

The log-density changes at a rate equal to the negative trace of the Jacobian - not the log-determinant, just the trace. The trace costs $O(d)$ per computation instead of $O(d^3)$ for the determinant.

Stochastic trace estimation. Even computing the full Jacobian to extract its trace is $O(d^2)$. Hutchinson’s estimator reduces this to $O(d)$: for $\epsilon$ with $\mathbb{E}[\epsilon\epsilon^T] = I$,

$$\text{tr}(J) = \mathbb{E}_\epsilon[\epsilon^T J \epsilon]$$

and $J\epsilon$ (a Jacobian-vector product) can be computed in $O(d)$ time via automatic differentiation (a single backward pass). Sample $\epsilon$ from a standard normal or Rademacher distribution, compute $\epsilon^T J \epsilon$, and average over a few samples.

FFJORD. Putting this together: to compute the log-likelihood of a sample $\mathbf{x}$, integrate the Neural ODE backward from $t = T$ to $t = 0$ to find $\mathbf{z} = \mathbf{x}(0)$, simultaneously accumulating $\int_T^0 \text{tr}(\partial f / \partial \mathbf{x}) dt$ via Hutchinson’s estimator. The log-likelihood is:

$$\log p(\mathbf{x}) = \log p_0(\mathbf{z}) + \int_T^0 \text{tr}\left(\frac{\partial f}{\partial \mathbf{x}}\right) dt$$

This is FFJORD (Free-Form Jacobian of Reversible Dynamics, Grathwohl et al., 2018). The key advantage over discrete normalizing flows: no architectural constraints on $f$. The vector field can be any neural network.

Section 6: Hamiltonian Neural Networks

Neural ODEs are flexible but unconstrained. For physical systems with known conservation laws, we can embed that structure directly.

Hamiltonian mechanics. Many physical systems conserve energy $H(\mathbf{q}, \mathbf{p})$ (the Hamiltonian), where $\mathbf{q}$ is the position vector and $\mathbf{p}$ is the momentum vector. Hamilton’s equations describe the dynamics:

$$\dot{q}_i = \frac{\partial H}{\partial p_i}, \quad \dot{p}_i = -\frac{\partial H}{\partial q_i}$$

Key properties of Hamiltonian systems: $H$ is conserved ($\frac{dH}{dt} = 0$), phase space volume is preserved (Liouville’s theorem), and the vector field is divergence-free ($\nabla \cdot F = 0$).

A generic Neural ODE $\frac{d\mathbf{x}}{dt} = f(\mathbf{x}, t, \theta)$ has none of these properties by default. It can learn to conserve something approximately, but not exactly.

Hamiltonian Neural Networks (Greydanus et al., 2019) learn $H_\theta(\mathbf{q}, \mathbf{p})$ from trajectory data and define the dynamics via Hamilton’s equations. By construction, this system conserves $H_\theta$ exactly. The learned system is Hamiltonian. Not approximately Hamiltonian - exactly Hamiltonian.

The result: Hamiltonian Neural Networks generalize dramatically better to long-time prediction. A standard Neural ODE, trained on short trajectories, accumulates energy errors over time - it is not constrained to stay on any energy surface. A Hamiltonian Neural Network cannot accumulate energy errors because conservation is built into the architecture, not learned from data.

The training works like this: given position-velocity trajectory data $\{(\mathbf{q}(t), \mathbf{p}(t))\}$, compute the observed $\dot{\mathbf{q}}$ and $\dot{\mathbf{p}}$ from the data, and train $H_\theta$ so that $\partial H_\theta / \partial \mathbf{p} \approx \dot{\mathbf{q}}$ and $-\partial H_\theta / \partial \mathbf{q} \approx \dot{\mathbf{p}}$. The loss is on these partial derivatives, not on $H_\theta$ itself (which is unobservable from dynamics alone).

Lagrangian Neural Networks (Cranmer et al., 2020) take a complementary approach: learn $\mathcal{L}(\mathbf{q}, \dot{\mathbf{q}})$ and derive dynamics from the Euler-Lagrange equations $\frac{d}{dt}\frac{\partial \mathcal{L}}{\partial \dot{q}_i} - \frac{\partial \mathcal{L}}{\partial q_i} = 0$. Both approaches enforce the same physical structure through different mathematical routes.

Section 6b: Why the Trace Instead of the Determinant?

The continuity equation gives $\partial \log p / \partial t = -\text{tr}(\partial f / \partial \mathbf{x})$ - the trace, not the log-determinant. Why?

For an infinitesimal time step, the flow map $g_{\Delta t}: \mathbf{x}(t) \mapsto \mathbf{x}(t + \Delta t) = \mathbf{x}(t) + \Delta t \cdot f(\mathbf{x}(t), t, \theta)$ has Jacobian:

$$\frac{\partial g_{\Delta t}}{\partial \mathbf{x}} = I + \Delta t \frac{\partial f}{\partial \mathbf{x}} + O(\Delta t^2)$$

The determinant of this Jacobian:

$$\det\left(I + \Delta t \frac{\partial f}{\partial \mathbf{x}}\right) = 1 + \Delta t \cdot \text{tr}\left(\frac{\partial f}{\partial \mathbf{x}}\right) + O(\Delta t^2)$$

(using the identity $\det(I + \epsilon A) = 1 + \epsilon \text{tr}(A) + O(\epsilon^2)$, which follows from the characteristic polynomial expansion). Taking the log:

$$\log \det\left(I + \Delta t \frac{\partial f}{\partial \mathbf{x}}\right) \approx \Delta t \cdot \text{tr}\left(\frac{\partial f}{\partial \mathbf{x}}\right)$$

Dividing by $\Delta t$ and taking $\Delta t \to 0$ gives the instantaneous rate $\text{tr}(\partial f / \partial \mathbf{x})$. The trace is the infinitesimal version of the log-determinant. For finite flows (finite integration time), the log-determinant accumulates as $\int_0^T \text{tr}(\partial f / \partial \mathbf{x}) dt$, which is what FFJORD computes via ODE integration.

This explains why continuous normalizing flows can use unrestricted architectures: the trace, unlike the determinant, is cheap to compute via automatic differentiation and stochastic estimation.

Section 7: Stability and Stiffness

Not all Neural ODEs are well-behaved in practice. The dynamical properties of the learned vector field $f$ determine whether training is easy or nearly impossible.

Why stability matters. For a Neural ODE to be trainable and useful, the forward ODE must be stable: solutions should not explode during integration. If the Jacobian $\partial f / \partial \mathbf{x}$ at typical trajectory points has eigenvalues with large positive real parts, perturbations to the state grow exponentially. Concretely: if $\lambda_{\max}$ is the largest real part of the Jacobian’s eigenvalues, perturbations grow like $e^{\lambda_{\max} t}$. For stable representations, we need $\lambda_{\max} \leq 0$, or at least small and positive.

Antisymmetric RNN (Chang et al., 2019) enforces stability by design: use weight matrix $W - W^T$ (antisymmetric, so all eigenvalues are purely imaginary) for the hidden state update. This guarantees neutral stability - no explosion, no decay. A related idea for Neural ODEs: if $\partial f / \partial \mathbf{x}$ is antisymmetric, the ODE is Hamiltonian and trajectories lie on energy surfaces.

Stiffness. An ODE is stiff when the Jacobian $\partial f / \partial \mathbf{x}$ has eigenvalues with very different magnitudes - say, one eigenvalue with real part $-1$ and another with real part $-1000$. Explicit solvers like Euler and RK4 require step sizes $h < 2/1000$ for stability, forcing tiny steps even when the interesting dynamics evolve on timescale $1$.

For Neural ODEs, stiffness translates to a large number of function evaluations per forward pass. Instead of 50-100 evaluations (typical for a well-conditioned ODE), a stiff Neural ODE might require 1000-10000 evaluations. Since each evaluation runs the neural network $f$, this is expensive.

The stiffness-expressivity trade-off. More expressive vector fields (larger weight matrices, more layers in $f$) tend to produce stiffer dynamics. The network has more capacity to create rapid variations in $\mathbf{x}(t)$, which the solver must resolve with small steps. Designing $f$ to produce well-conditioned, non-stiff dynamics while retaining expressivity is an open research problem.

Implicit solvers for stiff Neural ODEs. Implicit methods (backward Euler, implicit Runge-Kutta) can handle stiff dynamics with much larger step sizes. Each implicit step requires solving a nonlinear system $$\mathbf{y}{n+1} = \mathbf{y}n + hf(t{n+1}, \mathbf{y}{n+1})$$ for $\mathbf{y}_{n+1}$, typically via Newton’s method. The cost per step is higher, but far fewer steps are needed. Libraries like diffrax provide implicit solvers (Kvaerno3, implicit Euler) for stiff Neural ODEs.

Section 7b: Numerical Solver Comparison

Understanding the differences between solvers is essential for using Neural ODEs effectively.

Fixed-step explicit methods.

Euler’s method: $\mathbf{y_{n+1}} = \mathbf{y_n} + h f(t_n, \mathbf{y_n})$. One function evaluation per step. Global error $O(h)$. Stability region: $h\lambda$ must lie in the unit disk centered at $-1$ in the complex plane (for the test equation $y' = \lambda y$). For purely imaginary eigenvalues (oscillatory systems), Euler is unstable for any finite $h$ - the solution spirals outward instead of on a closed orbit.

Runge-Kutta 4 (RK4): four function evaluations per step, global error $O(h^4)$. Stability region is much larger than Euler’s but still bounded. For neural network vector fields with large Jacobian eigenvalues, even RK4 may require very small $h$.

Explicit midpoint (RK2): $k_1 = f(t_n, \mathbf{y_n})$, $k_2 = f(t_n + h/2, \mathbf{y_n} + (h/2)k_1)$, $\mathbf{y_{n+1}} = \mathbf{y_n} + hk_2$. Two evaluations, $O(h^2)$ error. Larger stability region than Euler, cheaper than RK4.

Adaptive explicit methods.

DOPRI5 (Dormand-Prince): the standard adaptive solver for non-stiff Neural ODEs. Uses a 5th-order Runge-Kutta method with an embedded 4th-order method for error estimation. Takes 6 function evaluations per accepted step. The error estimate drives the step size adaptation: if the estimated error exceeds the tolerance, reject the step and halve $h$; if it is well below tolerance, accept the step and increase $h$.

DOPRI5 is implemented as the default in both torchdiffeq and diffrax. The adaptive strategy means the number of function evaluations (NFE) varies with the input. Measuring NFE during inference gives a proxy for “input difficulty” in the ODE sense.

Implicit methods for stiff problems.

Backward Euler: $\mathbf{y_{n+1}} = \mathbf{y_n} + h f(t_{n+1}, \mathbf{y_{n+1}})$. Stable for all $h$ (A-stable). Requires solving a nonlinear equation at each step - typically via Newton’s method, which itself requires the Jacobian $\partial f / \partial \mathbf{x}$. In the neural network context, computing the Jacobian is expensive: $O(d^2)$ memory and $O(d^2)$ computation via automatic differentiation.

Kvaerno3: a third-order implicit Runge-Kutta method with good stability properties. Available in diffrax. Used for stiff Neural ODEs where explicit methods would require thousands of steps for stability.

The stiffness detector. Some libraries automatically detect stiffness during the forward pass (by monitoring whether the solver is repeatedly rejecting steps and shrinking $h$) and switch from an explicit to an implicit solver. This automatic switching is not yet standard in neural network ODE libraries but is available in scientific computing packages.

Section 8: Practical Implementation

The ODE solver interface. In torchdiffeq (PyTorch):

from torchdiffeq import odeint_adjoint as odeint

def f(t, x):          # x: (batch, d), t: scalar
    return net(t, x)  # any nn.Module

x_T = odeint(f, x_0, t_span, method='dopri5')

The solver handles adaptive step sizing, error control, and gradient computation via the adjoint. The user provides the vector field $f$ (as a neural network) and the time span to integrate over.

In diffrax (JAX):

from diffrax import diffeqsolve, ODETerm, Dopri5

term = ODETerm(lambda t, x, args: net(t, x))
sol  = diffeqsolve(term, Dopri5(), t0=0, t1=T, dt0=0.1, y0=x_0)
x_T  = sol.ys[-1]

Both provide JIT compilation and automatic differentiation through the solver.

Solver selection:

Non-stiff, smooth dynamics: dopri5 (adaptive Runge-Kutta 4/5). Default for most Neural ODEs.
Stiff dynamics: kvaerno3 or radau (implicit Runge-Kutta). Necessary when the Jacobian has large eigenvalues.
Fixed-step for speed: euler or rk4 when the integration interval is short and accuracy requirements are modest.
FFJORD (normalizing flows): dopri5 with Hutchinson trace estimation.

Memory vs. speed. The adjoint method reduces memory from $O(N \cdot d)$ to $O(d)$, where $N$ is the number of solver steps. For a solver taking 500 steps with $d = 512$ dimensional states, this is a 500x reduction in peak memory. The cost is extra computation: the forward ODE is re-integrated during the backward pass.

For batch training with many examples, the memory saving is the dominant consideration - storing full trajectories for thousands of examples simultaneously is often infeasible. For inference or fine-tuning with a few examples, standard backpropagation through the solver may be faster.

Section 8b: Latent Neural ODEs for Irregular Time Series

The most compelling practical application of Neural ODEs is irregular time series modeling. Many real-world time series are observed at non-uniform intervals: medical measurements taken at different times for different patients, financial data with trading halts, sensor data with intermittent failures.

The problem with standard recurrent networks. An RNN processes inputs at fixed time steps. To handle irregular intervals, you either bin observations into fixed time windows (losing precision), interpolate to fill in missing values (introducing error), or use a special-purpose architecture (complex engineering). None of these is natural.

The Latent Neural ODE approach. Encode the observed sequence into a latent initial state $\mathbf{z}(t_0)$ using a recognition network (e.g., another RNN run backward over the observations). Then evolve $\mathbf{z}(t)$ via a Neural ODE:

$$\frac{d\mathbf{z}(t)}{dt} = f(\mathbf{z}(t), \theta)$$

Decode $\mathbf{z}(t_i)$ at any desired time $t_i$ to produce a prediction. The ODE solver can evaluate $\mathbf{z}$ at arbitrary times without any special handling of gaps - it simply integrates from the last observed time to the next desired time.

This makes the Latent Neural ODE a natural generative model for irregular time series: the latent dynamics are continuous, predictions can be made at any time, and the uncertainty is naturally encoded in the latent state.

Latent Neural ODE in the variational framework. In practice (Rubanova et al., 2019), the model is trained as a variational autoencoder (VAE):

Encoder: A bidirectional RNN encodes the observed sequence $\{(t_i, \mathbf{x}(t_i))\}$ into the approximate posterior over the initial latent state $q(\mathbf{z}(t_0) | \{(t_i, \mathbf{x}(t_i))\})$.
Dynamics: Sample $\mathbf{z}(t_0)$ from the posterior, then solve the Neural ODE forward in time.
Decoder: At each desired output time $t_j$, decode $\mathbf{z}(t_j)$ to a prediction $\hat{\mathbf{x}}(t_j)$.
Loss: ELBO = reconstruction loss + KL divergence from prior.

The result is a continuous-time latent dynamical model that handles irregular observations, missing data, and variable-length sequences naturally. On medical time series with highly irregular observation times, Latent Neural ODEs outperform discrete-time baselines by handling the time irregularity correctly rather than approximately.

Section 9: When Neural ODEs Work and When They Do Not

Where Neural ODEs shine.

Irregularly sampled time series. Standard recurrent networks assume uniform time steps. A Neural ODE can handle data observed at irregular times $t_1, t_2, \ldots$ by integrating from each $t_i$ to $t_{i+1}$ - the integration interval adapts to the observation spacing. This is natural and elegant; it requires no ad hoc handling of missing observations.

Continuous normalizing flows. For generative modeling where exact log-likelihood computation is needed and architectural flexibility is valued, FFJORD offers a clean alternative to architecturally constrained discrete flows.

Physics-informed learning. When the dynamics come from a physical system with known structure (Hamiltonian, Lagrangian, dissipative), the Neural ODE framework integrates smoothly with that structure. Physics-informed Neural Networks (PINNs) use similar ideas.

Memory-constrained settings. When the depth is large (many Euler steps) and the state dimension is large, the adjoint method’s $O(d)$ memory makes Neural ODEs feasible where backprop through the solver would run out of memory.

Where they do not.

Large-scale vision and language. Tasks where standard ResNets, transformers, and convolutional networks excel do not benefit from Neural ODEs. The ODE solver overhead - typically 5-10x more compute than the equivalent ResNet - makes them impractical. Batch processing becomes unpredictable when different examples take different numbers of solver steps.

Stiff problems. As discussed, stiff dynamics force tiny steps and make the forward pass expensive. For general-purpose architectures without physical priors, stiffness is difficult to avoid.

When interpretability is not needed. The ODE framework provides mathematical tools for analysis (stability, Lyapunov functions, conservation). If you do not need those tools, the framework provides no benefit over a plain ResNet.

Section 9b: Augmented Neural ODEs

A significant limitation of Neural ODEs: the solution of an ODE forms a flow - a diffeomorphism. This means trajectories starting from different initial conditions cannot intersect. If two distinct inputs map to two distinct outputs, the ODE must route them along non-intersecting paths.

For some tasks - particularly those requiring trajectories to “cross” in the state space - this constraint is too restrictive. The standard Neural ODE in $\mathbb{R}^d$ cannot, for example, learn the mapping that takes a 1D input $x$ to $\sin(x)$ if the training data has multiple inputs mapping to the same output range through different routes.

Augmented Neural ODEs (Dupont et al., 2019) resolve this by augmenting the state space. Instead of evolving $\mathbf{x}(t) \in \mathbb{R}^d$, augment to $[\mathbf{x}(t), \mathbf{a}(t)] \in \mathbb{R}^{d + p}$ where $\mathbf{a}(0) = \mathbf{0}$ is an auxiliary state initialized to zero. The ODE evolves in the higher-dimensional space:

$$\frac{d}{dt}\begin{pmatrix} \mathbf{x}(t) \\ \mathbf{a}(t) \end{pmatrix} = f\left(\begin{pmatrix} \mathbf{x}(t) \\ \mathbf{a}(t) \end{pmatrix}, t, \theta\right)$$

The output is still $\mathbf{x}(T)$ (ignoring the auxiliary dimensions). But now the trajectories have extra room to maneuver in the augmented space, bypassing the crossing constraint. Two trajectories that would cross in $\mathbb{R}^d$ can separate in $\mathbb{R}^{d+p}$ without crossing.

The tradeoff: more parameters, higher-dimensional ODE (more expensive to solve), and less interpretability. But augmentation often dramatically reduces the number of function evaluations needed (fewer NFE), making the overall computation faster despite the larger state.

Section 9c: NODE versus RNN - A Precise Comparison

It is worth being precise about what Neural ODEs offer compared to recurrent networks, since both process sequential data using shared weights.

RNN update: $\mathbf{h_t} = \sigma(W\mathbf{h_{t-1}} + U\mathbf{x_t} + b)$. The hidden state $\mathbf{h_t}$ is a nonlinear function of the previous hidden state and the current input $\mathbf{x_t}$. Time steps are assumed equally spaced (or the model is oblivious to the time gaps). All weights $(W, U, b)$ are shared across all time steps.

Neural ODE: $\frac{d\mathbf{h}}{dt} = f(\mathbf{h}(t), t, \theta)$. The hidden state evolves continuously. The vector field $f$ is shared across all time values. There is no notion of a discrete “input” at each time step - inputs are either encoded into the initial condition or provided as time-varying parameters.

Key differences:

Time representation: RNNs process sequences at discrete steps. Neural ODEs treat time as a continuous variable - they can output predictions at any $t$, not just at the training time steps.

Input handling: In a standard Neural ODE, there are no inputs during the integration (the dynamics are autonomous or time-parameterized). To handle input sequences, you need to either encode the input into the initial condition (latent ODE approach) or modify the dynamics at each input time (controlled ODE approach, using $d\mathbf{h}/dt = f(\mathbf{h}, t, \theta) + g(\mathbf{x}(t), \phi)$).

Memory: RNNs backpropagate through all time steps, storing all hidden states - $O(T \cdot d)$ memory. Neural ODEs with the adjoint method use $O(d)$ memory, at the cost of re-running the forward ODE during the backward pass.

Gradient flow: RNNs suffer from vanishing/exploding gradients when the eigenvalues of the recurrence Jacobian are far from $1$ in magnitude. Neural ODEs manage gradient flow differently - through the adjoint ODE, which also can be unstable if the forward ODE has unstable dynamics. The problems are analogous but the mechanisms differ.

Controlled Neural ODEs. To handle inputs at arbitrary times, model the dynamics as:

$$\frac{d\mathbf{h}}{dt} = f(\mathbf{h}(t), \theta) \cdot \frac{d\mathbf{x}}{dt}(t)$$

where $\mathbf{x}(t)$ is the input signal (e.g., linearly interpolated between observed values). This is a controlled ODE, driven by the derivative of the input. When integrated, this gives an output that depends on the entire path of the input, not just its value at specific times. Kidger et al. (2020) showed this approach - the Neural Controlled Differential Equation - outperforms standard RNNs and Neural ODEs on irregular time series.

Section 10: Connections to Other Ideas

Gradient flow and optimization. The continuous-time limit of gradient descent is the ODE $\frac{d\theta}{dt} = -\nabla_\theta L(\theta)$. Gradient descent with step size $h$ is Euler’s method on this ODE. Analysis of gradient descent - convergence rates, sensitivity to learning rate, behavior near saddles - is analysis of this dynamical system. Neural ODEs close the loop: the training dynamics of gradient descent are themselves an ODE, and the forward dynamics of the Neural ODE are also an ODE. Two ODEs, nested.

Score-based diffusion. Modern diffusion models (DDPM, score-matching) define a stochastic process that gradually adds noise to data: $d\mathbf{x} = f(\mathbf{x}, t) dt + g(t) d\mathbf{W}$ (a stochastic differential equation). The denoising score function $\nabla \log p(\mathbf{x}, t)$ learned by the network plays the role of the drift term. The probability flow ODE - the deterministic ODE that has the same marginal densities as the SDE - is a Neural ODE:

$$\frac{d\mathbf{x}}{dt} = f(\mathbf{x}, t) - \frac{1}{2}g(t)^2 \nabla \log p(\mathbf{x}, t)$$

Solving this ODE backward from noise to data generates samples. Neural ODEs and diffusion models are deeply connected through this probability flow formulation.

The NTK and infinite width. As neural network width goes to infinity, the network dynamics during training become governed by the Neural Tangent Kernel (NTK). The evolution of the network’s predictions is a linear ODE in function space: $\frac{d\hat{f}}{dt} = -\eta K (\hat{f} - f^)$ where $K$ is the NTK and $f^$ is the target. This linearization - and its implications for training dynamics and generalization - is another place where ODE theory illuminates deep learning.

Second-order ODEs and acceleration. Gradient descent is the Euler method on the gradient flow ODE $\frac{d\theta}{dt} = -\nabla L$. Momentum-based methods like Nesterov acceleration correspond to a second-order ODE:

$$\ddot{\theta} + \frac{r}{t}\dot{\theta} = -\nabla L(\theta)$$

for some $r > 0$ (the continuous limit of Nesterov’s algorithm, due to Su, Boyd, and Candes 2016). The oscillatory behavior familiar from momentum methods - the loss decreases faster than gradient descent but occasionally “overshoots” - is exactly the underdamped behavior of the second-order ODE with a particular damping coefficient. The ODE framework lets you analyze convergence rates, optimal damping, and the connection between different optimization algorithms through their continuous-time limits.

Deep equilibrium models. A deep equilibrium model (DEQ) finds the fixed point $\mathbf{x}^* = f(\mathbf{x}^, \theta)$ directly, without running many forward passes. This is related to Neural ODEs via the connection $\mathbf{x}^ = \lim_{t \to \infty} \mathbf{x}(t)$ when the ODE is stable and has a unique equilibrium. DEQs trade the ODE solver for a root-finding algorithm (Newton’s method or Broyden’s method), which can be more efficient when only the fixed point is needed. The memory cost is also $O(d)$ via the implicit function theorem (not the adjoint). Both Neural ODEs and DEQs are members of the family of implicit depth models - networks whose effective depth is determined at inference time rather than fixed at design time.

Summary

Concept	Key idea
ResNet-Euler correspondence	$\mathbf{x_{k+1}} = \mathbf{x_k} + f(\mathbf{x_k}, \theta_k)$ is Euler’s method with $h = 1$ on $d\mathbf{x}/dt = F(\mathbf{x}, t)$
Neural ODE	Replace discrete layers with continuous ODE $d\mathbf{x}/dt = f(\mathbf{x}(t), t, \theta)$; output is $\mathbf{x}(T)$
Weight sharing	Same $\theta$ at all $t$; closer to RNN than deep feedforward network
Adaptive computation	Adaptive solvers take more steps for harder inputs; variable inference cost
Adjoint method	Backward ODE computes $\partial \mathcal{L}/\partial \theta$ with $O(d)$ memory instead of $O(N \cdot d)$
Adjoint ODE	$d\mathbf{a}/dt = -\mathbf{a}^T \partial f/\partial \mathbf{x}$; runs backward from $t = T$ to $t = 0$
Continuity equation	$\partial \log p / \partial t = -\text{tr}(\partial f / \partial \mathbf{x})$; enables FFJORD
Trace vs. determinant	Trace is $O(d)$ via Hutchinson estimator; log-det is $O(d^3)$; trace is the infinitesimal log-det
Stiffness	Large Jacobian eigenvalues force tiny steps; main practical barrier
Augmented NODEs	Add auxiliary dimensions to allow trajectory crossings in state space
Latent NODE	Encode sequence to initial condition; evolve with ODE; decode at arbitrary times
Controlled NODE	$d\mathbf{h}/dt = f(\mathbf{h}) \cdot d\mathbf{x}/dt$; handles irregular time series inputs
Hamiltonian NNs	Learn $H_\theta$; dynamics via Hamilton’s equations; exact energy conservation
Probability flow ODE	Diffusion model sampler is a Neural ODE with score function as vector field
Deep equilibrium	Find $\mathbf{x}^* = f(\mathbf{x}^*, \theta)$ directly; related to Neural ODE fixed point

A Closing Note on Why This Matters

The Neural ODE framework is not primarily useful because it produces better accuracy on benchmarks. In most settings, well-tuned ResNets or transformers outperform Neural ODEs on standard tasks.

What the framework offers is a different way of thinking about deep networks. By making the connection to continuous-time dynamical systems explicit, it opens up a toolkit: Lyapunov stability analysis, Hamiltonian mechanics, variational calculus, control theory, the theory of flows on manifolds. These tools were developed over centuries for physical systems, and they now apply directly to neural network architectures.

The deeper insight is that the discrete architecture choices in standard deep learning - number of layers, layer widths, skip connections, attention patterns - often correspond to continuous mathematical structures that are better studied in their natural, continuous form. Just as the physics of a string vibrating is better studied as a wave equation than as a limit of springs and masses, some neural network phenomena may be better understood through their continuous-time limits.

Whether this insight leads to practical advances, or remains primarily a mathematical elegance, will depend on how well the community can bridge the gap between the beautiful theory and the messy realities of large-scale training.

Read Next:

Recurrent Neural Networks - Memory Hidden in the Hidden State