Paper Review: Neural Ordinary Differential Equations 📂Machine Learning

Paper Review: Neural Ordinary Differential Equations

Overview and Summary

“Neural Ordinary Differential Equations” is a paper published in 2018 by Ricky T. Q. Chen and three others, and it was selected for 2018 NeurIPS Best Papers. It proposes a method to approximate a simple first-order differential equation, which is a non-autonomous system, using neural networks.

$\dfrac{\mathrm{d}y}{\mathrm{d}t} = f(y, t)$

Notably, what the neural network approximates (predicts) is not the $y$ but the rate of change $f$ . By integrating both sides of the above equation from any $0$ to $T$ , we get the following:

$y(T) - y(0) = \int\limits_{0}^{T} f(y, t) \mathrm{d}t \implies y(T) = y(0) + \int\limits_{0}^{T} f(y, t) \mathrm{d}t$

Therefore, if $f$ can be accurately approximated, we can obtain $y(T)$ for any $T$ .

1 Introduction

Consider a time series dataset consisting of 8 data points with equal time intervals, as depicted in the figure below.

In such a scenario, our typical objective is to predict data $\mathbf{h}_{9}$ , $\mathbf{h}_{10}$ , $\mathbf{h}_{11}$ , $\dots$ for $t_{9}$ , $t_{10}$ , $t_{11}$ , $\dots$ . One straightforward approach is to model this using residual networks, RNN, normalizing flow, etc., as follows:

$\mathbf{h}_{t+1} = \mathbf{h}_{t} + f(\mathbf{h}_{t}; \theta) \tag{1}$

Here, $f$ is a neural network and $\theta$ are the parameters (weights) to be learned by $f$ . While this method is simple and intuitive, and one of the first to be tried, it has the following disadvantages:

It is difficult to interpolate data. It is hard to predict data between $t_{3}$ and $t_{4}$ using a model like $(1)$ , especially for arbitrary time $t_{3.472}$ .
It is challenging to apply this to data with uneven time intervals. In $(1)$ , the rule for updating $\mathbf{h}_{t+1}$ is adding $f$ one, two, or integer times, so if the time intervals are not uniform, the equation in $(1)$ does not hold.

The method proposed in this paper is to conceptualize the discrete autonomous system $(1)$ as a continuous non-autonomous system as follows:

$\dfrac{\mathrm{d}\mathbf{h}}{\mathrm{d}t} = f(\mathbf{h}(t), t; \theta) \tag{2}$

A “non-autonomous system where the external force (or velocity) $f$ is defined by an artificial neural network with parameters $\theta$ ” is referred to as Neural Ordinary Differential Equations (Neural ODE, NODE).

Many high-performance solvers (i.e., integrators of $f$ ) for solving such ODEs have already been researched, thus allowing us to predict $\mathbf{h}(t)$ for any time $t$ if $f$ is well approximated via neural networks. By integrating both sides:

$\mathbf{h}(t) = \mathbf{h}(0) + \int\limits_{0}^{t} f(\mathbf{h}(t), s; \theta) \mathrm{d}s \tag{3}$

The right-hand side can be obtained using an ODE solver given an initial value $\mathbf{h}(0)$ and $f$ , which is denoted in the paper as follows:

$\mathbf{h}(t_{1}) = \operatorname{ODESolve}(\mathbf{h}(t_{0}), f, t_{0}, t_{1}, \theta)$

The advantages of defining the model using an ODE solver and calculating values are explained in the paper as follows:

Memory efficiency:
In Chapter 2 below, a method for calculating the gradient of the loss function without using the backpropagation algorithm is introduced. Thus, we do not need to store intermediate computation results when computing the model’s function value, meaning memory usage is fixed regardless of model size.
Adaptive computation:
Over 120 years since the Euler method, more accurate and efficient ODE solvers have been developed, especially recent solvers offering dynamic evaluation strategy adjustments for error monitoring and high accuracy.
Scalable and invertible normalizing flows:
One advantage of continuous models is that they make variable transformation formula calculations easy, as explained in Chapter 4, meaning efficient normalizing flow models can be created.
Continuous time-series models:
Unlike discrete RNNs that require evenly spaced data for training and prediction, continuously defined Neural ODEs can perform predictions at any time $t$ , which is further discussed in Chapter 5.

2 Reverse-mode Automatic Differentiation of ODE Solutions

The core idea and principle of Neural ODE is explained under Introduction. Chapter 2 covers the training methods. The author has developed a method to implement Neural ODE and publicly available it on GitHub. If implementing it oneself is not necessary and there is no particular interest in the training methods, this chapter can be skipped. Chapters 3-5 cover potential applications of Neural ODE, therefore if one is interested in applying Neural ODE in their field, it is recommended to read Chapters 3-5 closely. Notably, Chapter 4 relates to flow matching, which is a leading trend in generative models following diffusion models.

The observed values (ground truth data) are constants independent of the neural network parameter $\theta$ , so the loss function can be simplified as follows:

$L(\mathbf{z}(t_{1})) = L \left( \mathbf{z}(t_{0}) + \int_{t_{0}}^{t_{1}} f(\mathbf{z}(t), t; \theta) \mathrm{d}t \right) = L \left( \operatorname{OSESolve}(\mathbf{z}(t_{0}), f_{\theta}, t_{0}, t_{1}) \right)$

For calculating the gradient of the loss function as discussed in the paper, the adjoint sensitivity method is used, which involves integrating another ODE in reverse time. The outcome is as follows:

$\dfrac{\mathrm{d}L}{\mathrm{d}\theta} = \int\limits_{t_{1}}^{t_{0}} \left( \dfrac{\partial L}{\partial \mathbf{z}(t)} \right)^{\mathsf{T}} \dfrac{\partial f(\mathbf{z}(t), t; \theta)}{\partial \theta} \mathrm{d}t$

Detailed information is provided in Appendix B, C, D (Download Link). Also, the code for the examples described below and learning-related methods are included in the torchdiffeq package, publicly available on GitHub.

3 Replacing Residual Networks with ODEs for Supervised Learning

If Neural ODE is considered a replacement of $(1)$ with $(3)$ , it can be seen as a generalization of residual networks.

$\text{residual: } \mathbf{h}_{t+1} = \mathbf{h}_{t} + f(\mathbf{h}_{t}; \theta) \quad\overset{\text{generalization}}{\implies}\quad \dfrac{\mathrm{d}\mathbf{h}}{\mathrm{d}t} = f(\mathbf{h}(t), t; \theta)$

Hence, Neural ODE can be used in supervised learning that utilizes ResNet. The authors compared performances by replacing ResNet parts with Neural ODE in the classification problem of the MNIST dataset.

For numerically solving the initial value problems of ODEs, the chosen solver is the Adams method, using the implicit method provided by the scipy.integrate package. The optimization method, the adjoint sensitivity method, was implemented with pytorch.autograd.

Experimental results showed that ODE-Net demonstrated nearly similar performance to ResNet despite having approximately $1/3$ amount of parameters compared to ResNet. $L$ indicates the number of layers in ResNet; the computation time and memory required for backpropagation in ResNet are proportional to $L$ . $\tilde{L}$ refers to ODE solver’s number of function evaluations (NFE), which indicates how much the integration interval will be divided. ODE solvers, conducting numerical integration, get divided into integration intervals, and NFE refers to the number of these intervals. If, for instance, $[0, 1]$ interval is integrated at $0.05$ intervals, it becomes $\text{NFE} = \dfrac{1}{0.05} = 20$ . The paper mentions this can be interpreted as the layer count of ODE-Net. A larger NFE allows for more precise computation, but takes longer calculation time, so choosing an appropriate number is crucial. Neural ODE allows the model to choose NFE automatically.

Figures 3a and 3b respectively show the error and computation time with NFE during forward calculation, presenting results aligning well with intuition. Figure 3c demonstrates that backward calculations are approximately half the level of forward calculations, indicating the proposed adjoint sensitivity method is quite efficient. Figure 3d illustrates that NFE increases as training progresses, suggesting the model becomes more complex during training and that it becomes more refined to represent it.

4 Continuous Normalizing Flows

Discrete modeling like $(1)$ also occurs in normalizing flow.

$\mathbf{z}_{1} = f(\mathbf{z}_{0})$

$\log p(\mathbf{z}_{1}) = \log p(\mathbf{z}_{0}) - \log \left| \det \dfrac{\partial f}{\partial \mathbf{z}_{0}} \right| \tag{a4}$

Take the planar flow, for example:

$\mathbf{z}_{t+1} = \mathbf{z}_{t} + \mathbf{u} \tanh (\mathbf{w}^{\mathsf{T}}\mathbf{z}_{t} + b)$

$\log p(\mathbf{z}_{t+1}) = \log p(\mathbf{z}_{t}) - \log \left| 1 + \mathbf{u}^{\mathsf{T}} \dfrac{\partial \tanh}{\partial \mathbf{z}} \right|$

The most critical aspect to consider when training a normalizing flow is the calculation of the determinant of $(a4)$ . It has a cubic computational cost in terms of the dimension or the number of weights of $\mathbf{z}$ . Interestingly, switching from the discrete structure $(1)$ to the continuous structure $(3)$ simplifies this related computation.

$\textbf{Theorem 1 }\text{(Instantaneous Change of Variables).}$

Let $\mathbf{z}(t)$ be a continuous random variable and $p(\mathbf{z}(t))$ its probability density function. Suppose a non-autonomous differential equation is given by:

$\dfrac{\mathrm{d} \mathbf{z}}{\mathrm{d}t} = f(\mathbf{z}(t), t) \tag{a5}$

Assume $f$ is Lipschitz continuous concerning variable $\mathbf{z}$ and continuous regarding variable $t$ , the derivative of the log probability density is:

$\dfrac{\partial \log p(\mathbf{z}(t))}{\partial t} = -\Tr \left( \dfrac{\partial f}{\partial \mathbf{z}(t)} \right)$

Here, $\Tr$ denotes the trace.

The determinant calculation of $(a4)$ is transformed into a simple trace calculation. Unlike typical discrete normalizing flows, $f$ need not be bijective. Reformulating the example planar flow into a continuous form consistent with $\textbf{Theorem 1}$ yields the following:

$\dfrac{\mathrm{d}\mathbf{z}(t)}{\mathrm{d}t} = \mathbf{u} \tanh(\mathbf{w}^{\mathsf{T}}\mathbf{z}(t) + b), \qquad \dfrac{\partial \log p (\mathbf{z}(t))}{\partial t} = -\Tr \left( \mathbf{u} \dfrac{\partial \tanh(\mathbf{w}^{\mathsf{T}}\mathbf{z}(t) + b)}{\partial \mathbf{z}(t)} \right)$

Using multiple hidden units with linear cost

Although determinants are not linear, the trace is linear, so $\Tr \sum\limits_{n} J_{n} = \sum\limits_{n} \Tr J_{n}$ holds. Therefore, if $f$ in $(a5)$ is expressed as a linear combination, the derivative of the log density function can be easily expressed as follows:

$\dfrac{\mathrm{d} \mathbf{z}(t)}{\mathrm{d}t} = \sum\limits_{n=1}^{M} f_{n}(\mathbf{z}(t)), \qquad \dfrac{\mathrm{d}\log p(\mathbf{z}(t))}{\mathrm{d}t} = -\sum\limits_{n=1}^{M} \Tr \left( \dfrac{\partial f_{n}}{\partial \mathbf{z}(t)} \right)$

This indicates a linear increase in computational cost concerning the number of hidden layer nodes $M$ . Normalizing flows are typically constructed with a single node in many stacked layers due to the computational cost of $\mathcal{O}(M^{3})$ .

Time-dependent dynamics

The paper employs a method called gating to define flows as follows:

$\dfrac{\mathrm{d} \mathbf{z}}{\mathrm{d}t} = \sum\limits_{n} \sigma_{n}(t)f_{n}(\mathbf{z}), \qquad \sigma_{n}(t) \in (0 1)$

That is, it is assumed that $f(\mathbf{z}, t)$ is variable-separable. Here, both $\sigma_{n}$ and $f_{n}$ are neural networks to be learned. The paper calls this continuous normalizing flow. [Flow matching] is a method to train CNF more efficiently.

4.1 Experiments with Continuous Normalizing Flows

Compare CNF and NF. CNF is implemented as described above and trained for 10,000 iterations with the Adam optimizer. NF is implemented per its originating paper and trained for 500,000 iterations with the RMSprop optimizer. CNF requires substantially fewer iterations. Results are visible in the figure below.

In the lower Figure 5, the first two rows appear to be results for CNF, and the last row for NF. CNF undergoes a smooth transformation, while NF does not, and fails to model the Two Moons dataset correctly.

5 A Generative Latent Function Time-Series Model

As Neural ODE approximates non-autonomous systems with neural networks, it can be applied to forecasting, interpolation, imputation of time series data. The training method proposed in the paper is as follows:

Suppose a time series dataset $\left\{ \mathbf{x}_{i}, t_{i} \right\}_{i=1}^{N}$ is given.
Use an encoder implemented with RNN to extract latent variables $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ . $(\boldsymbol{\mu}, \log\boldsymbol{\sigma}^{2}) = \operatorname{RNNencoder}\left( \left\{ \mathbf{x}_{i}, t_{i} \right\} \right)$
Sample $\boldsymbol{\epsilon}$ from a Gaussian distribution $N(\mathbf{0}, I)$ to generate an initial value $\mathbf{z}_{t_{0}}$ of the latent variables using $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ . $\mathbf{z}_{t_{0}} = \exp(0.5 \log\boldsymbol{\sigma}^{2}) \odot \boldsymbol{\epsilon} + \boldsymbol{\mu}$ This gives an effect akin to mapping $\operatorname{RNNencoder}$ to latent space $N(\boldsymbol{\mu}, \diag(\boldsymbol{\sigma}^{2}))$ while allowing backpropagation.
Compute $\mathbf{z}_{t_{i}}$ using the initial value of the latent variable $\mathbf{z}_{t_{0}}$ and Neural ODE. $(\mathbf{z}_{t_{1}}, \mathbf{z}_{t_{2}}, \dots, \mathbf{z}_{t_{N}}) = \operatorname{ODESolve}(\mathbf{z}_{t_{0}}, f, t_{0}, \dots, t_{N}, \theta)$
Obtain predicted time series data $\mathbf{x}_{i}$ by inputting $\mathbf{z}_{t_{i}}$ into a fully connected network (FCN) defined as a decoder. $\mathbf{x}_{t_{i}} = \operatorname{FCNdecoder}(\mathbf{z}_{t_{i}})$
Maximize ELBO. $\text{ELBO} = \sum\limits_{i} \log p(\mathbf{x}_{t_{i}} | \mathbf{z}_{t_{i}}) + \log p(\mathbf{z}_{t_{0}}) - \log q(\mathbf{z}_{t_{0}} | \left\{ \mathbf{x}_{t_{i}}, t_{i} \right\})$

5.1 Time-series Latent ODE Experiments

The neural network used for the experiment is as follows:

$\operatorname{RNNencoder}$ : RNN with 25 nodes
The latent space is 4-dimensional. $(\boldsymbol{\mu}), (\boldsymbol{\sigma}) \in \mathbb{R}^{2 \times 2}$
NeuralODE $f$ : MLP with one hidden layer of 20 nodes
$\operatorname{FCNdecoder}$ : Another MLP with one hidden layer of 20 nodes

The dataset condition is as follows:

Generate 1,000 samples of 2D spiral data.
Half are clockwise, the other half counterclockwise.
Each trajectory begins at a different initial value, sampling 100 points at equal time intervals.
Add Gaussian noise to observations.
During training, use 30/50/100 points, randomly drawn without replacement, among the 100 points in each trajectory.

In the table below, you can observe the significantly superior performance of NeuralODE, and it maintains performance even when data is scarce compared to when there is abundant data.

The figures below compare predictions made by RNN and NeuralODE, showing that NeuralODE produces smoother results and makes better predictions in untrained outer regions.

The following figure depicts the trajectory of latent variables for the first two dimensions, where the trajectories split into two clusters of clockwise and counterclockwise spirals, indicating that the proposed method describes latent variables well.