Paper Review: DeepONet 📂Machine Learning

Paper Review: DeepONet

Overview and Summary

Follow the references, equation numbers, and notation in the paper as closely as possible.

For accessibility, this review is based on the version available on arXiv rather than the journal published version. Although the problems covered in the experimental section differ slightly, the core focus is not on the experimental results and performance but on the explanation of the DeepONet method itself.

DeepONet is a deep learning technique proposed for learning operators. An operator is a function that maps functions to functions (explained in detail in the main text). Specifically, for a function $u$, an operator $G$ is defined as follows.

$$ G : u \mapsto G(u) $$

Here, $u$ is also a function and $Gu = G(u)$ is a function as well. The first key point is that “DeepONet learns operators,” and the second point is that it “approximates $Gu$ as a series.” Given an appropriate function space $X$, let’s call its basis $\left\{ \phi_{k} \right\}$. Then $Gu \in X$ can be expressed as follows.

$$ Gu = \sum_{k=1}^{\infty} c_{k}\phi_{k} $$

DeepONet learns $c_{k}$ and $\phi_{k}$, where the part learning the coefficients $c_{k}$ is called the branch network and the part learning the basis $\left\{ \phi_{k} \right\}$ is called the trunk network.

Implementation

Implementing with PyTorch
Implementing with Julia

1 Introduction

The universal approximation theorem guarantees that neural networks can approximate arbitrary continuous functions. This provides a theoretical basis for the effective functioning of artificial neural networks and deep learning techniques, which have been successful across various fields. Even more surprisingly, artificial neural networks can approximate all nonlinear functionals and (nonlinear) operators.

For readers unfamiliar with mathematics, let’s briefly explain functions, functionals, and operators. These are fundamentally functions (which map a single element in the domain to exactly one element in the codomain). However, in contexts where the terms functional and operator are used, they have a slightly special meaning. Typically, a function means mapping numbers (or vectors) to numbers (or vectors). Polynomial functions, trigonometric functions, and other commonly dealt functions fall under this context.
$$ \text{function}: \mathbb{R}^{n} \to \mathbb{R}^{m} $$
A function that maps functions to numbers (scalars) is specifically called a functional. A concrete example is the definite integral. If we define a functional $I_{[a,b]}$ as $\displaystyle I_{[a,b]}(f) = \int_{a}^{b} f(x)dx$, for each $f$ given, this functional maps the area under the curve of $f$ over the interval $[a, b]$. If we consider $X$ as an appropriate function space, a functional can be expressed as follows.
$$ \text{functional}: X \to \mathbb{R} $$
Operators map functions to functions. Examples include indefinite integrals and derivatives.
$$ \text{operator}: X \to X $$ For a function $f$, defining an operator $D$ as $D(f) = \dfrac{df}{dx}$ makes it a differential operator that maps a given function to its derivative. Defining an operator $I$ as $\displaystyle I(f) = \int f(x) dx$ maps a given function to its indefinite integral.

Now, the terms function, functional, and operator used below have the meanings given in the explanation above. Before delving into our main discussion, let’s introduce the notation used throughout the paper. $G$ represents an operator where the variable is the function $u$.

$$ G : u \mapsto G(u) $$

Since $G$ is an operator, its function value $G(u)$ is also a function, and its variable is denoted as $y$.

$$ G(u) : y \mapsto G(u)(y) $$

Therefore, both $y$ and $G(u)(y)$ are real numbers.

$$ y, G(u)(y) \in \mathbb{R} $$

The goal of this paper is to learn operators, and for that, we consider a neural network that takes both $u$ and $y$ as inputs and outputs $G(u)(y)$.

$$ \text{network} : (u, y) \mapsto G(u)(y) $$

Theoretically, operator $G$ takes the function $u$ itself as a variable, but for computer simulations, discretization is necessary, and a finite number of function values $u(x_{1})$, $u(x_{2})$, $\dots$, and $u(x_{m})$ are used as inputs to the neural network. These are referred to as sensors in the paper. Thus, the proposed neural network has the following structure (Fig. 1A).

Figure 1A

Theorem 1 (Universal Approximation Theorem for Operator) Let $\sigma$ be a non-polynomial function. Let $X$ be a Banach space, and $K_{1} \subset X$, $K_{2} \subset X$ be compact sets. Let $V \subset C(K_{1})$ be a compact set, and $G : V \to C(K_{2})$ be a nonlinear continuous operator.
Then, for any $\epsilon > 0$, there exist positive integers $n$, $p$, $m$, and constants $c_{i}^{k}$, $\xi_{ij}^{k}$, $\theta_{i}^{k}$, $\zeta_{k} \in \mathbb{R}$, $w_{k} \in \mathbb{R}^{d}$, $x_{j} \in K_{1}$ ($i = 1,\dots,n$, $k = 1,\dots,p$, $j = 1,\dots,m$) such that the following holds.
$$ \left| G(u)(y) - \sum\limits_{k=1}^{p} \underbrace{\sum\limits_{i=1}^{n} c_{i}^{k}\sigma\left( \sum\limits_{j=1}^{m} \xi_{ij}^{k}u(x_{j}) + \theta_{i}^{k} \right)}_{branch} \underbrace{\sigma(w_{k} \cdot y + \zeta_{k})}_{trunk} \right| < \epsilon \quad \text{for all } u \in V, y \in K_{2} \tag{1} $$

In this paper, the above approximation is divided into two parts, called branch and trunk.

Although the approximation theorem suggests that neural networks could learn nonlinear operators, it does not suggest how to effectively train them. Even though the universal approximation theorem implies that any [MLP] should be able to approximate any continuous function, CNNs or other neural network architectures perform better on image-related tasks. A useful network should be easy to train and have good generalization performance. The authors aim to propose a new methodology that makes this possible.

To demonstrate that the proposed method is suitable for learning nonlinear operators, they impose very weak constraints on the data. Specifically, the input data $u_{i}$ must share the same sensors. However, these sensors don’t necessarily need to be on a uniform grid, and there are no constraints on variable $y$. This condition is well illustrated in Fig. 1B.

Figure 1B

The authors name the proposed architecture DeepONet (Deep Operator Network), which is composed of a branch net for the input function ($u(x_{1}), \dots, u(x_{m})$) and a trunk net for the output function variable ($y$). Details are elaborated in Section 2.

The paper considers two types of operators represented by ordinary differential equations (ODE) and partial differential equations (PDE).

2 Methodology

2.1 Deep operator networks (DeepONets)

The authors focus on operator learning in general situations and impose the constraint that input functions ($u$) must share the same sensors. The inputs to the proposed neural network are divided into two parts, as seen in Fig. 1A: $[u(x_{1}), \dots, u(x_{m})]$ and $y$. There are no restrictions on the network architecture; the paper uses basic [fully-connected neural networks] (FNNs) to showcase performance capabilities. It is noted that [CNN], RNN architectures, or attention mechanisms could be integrated if desired.

Initially, the trunk network takes $y$ as input and outputs $[t_{1}, t_{2}, \dots, t_{p}]^{T} \in \mathbb{R}^{p}$. Each of the $p$ branch networks takes $[u(x_{1}), \dots, u(x_{m})]$ as input and outputs $b_{k} \in \mathbb{R}$ each ($k = 1,2,\dots,p$). These are combined as in Equation $(1)$ as follows.

$$ G(u)(y) \approx \sum_{k=1}^{p} b_{k}t_{k} = \sum_{k=1}^{p} b_{k}([u(x_{1}), u(x_{2}), \cdots, u(x_{m})]) t_{k}(y) $$

It is notable to mention that the activation function is applied even in the last layer of the trunk net. Although not explicitly evident in these equations, this approach can be viewed as approximating the function $Gu = G(u)$ as a series. Given an appropriate function space $X$ with basis $\left\{ \phi_{k} \right\}$, it can be expressed as follows.

$$ Gu = \sum_{k=1}^{\infty} b_{k}\phi_{k} $$

In other words, interpreting $t_{k} = \phi_{k}(y)$ as $t_{k}$ being the basis and $b_{k}$ the coefficient part of the series, DeepONet approximates $Gu$ by decomposing it into a series instead of approximating it directly. Although Theorem 1 does not require this, adding a bias (constant term) as shown improves generalization performance.

$$ G(u)(y) \approx \sum_{k=1}^{p} b_{k}t_{k} + b_{0} $$

In practice, $p$ should be at least 10 or more; a larger $p$ increases computational cost. Therefore, the paper introduces Stacked DeepONet, which employs separate branch networks for each $b_{k}$ (Fig. 1C), and Unstacked DeepONet, where a single network learns all $b_{k}$ (Fig. 1D). All codes related to DeepONet can be found at https://github.com/lululxvi/deepxde, although it can be challenging to locate specific components among the author’s other works featured there.

Figure 1C and 1D

2.2 Data generation

The paper discusses two function spaces: Gaussian random field (GRF) and [orthogonal polynomial space]. The authors used a GRF with a mean of $0$.

$$ u \sim \cal{G}(0, k_{l}(x_{1}, x_{2})) $$

Here, $k_{l}(x_{1}, x_{2}) = \exp (- \| x_{1} - x_{2} \|^{2} / 2l^{2})$ is the covariance kernel. As an orthogonal polynomial space, the Chebyshev polynomials are chosen. Let $M > 0$ be and $T_{i}$ be the first kind of Chebyshev polynomial.

$$ V_{\text{poly}} = \left\{ \sum\limits_{i=0}^{N-1} a_{i} T_{i}(x): |a_{i}| \le M \right \} $$

The dataset was generated by random sampling of $a_{i} \in [-M, M]$. For each generated dataset, the Runge-Kutta method solved the ODE systems and the finite difference method was used to find reference solutions for second-order PDEs.

3 Number of sensors for identifying nonlinear dynamic systems

In this section, the need to discuss the number of sensors required to achieve arbitrary accuracy $\varepsilon$ in solving nonlinear dynamic systems using DeepONet is highlighted.

4 Simulation results

In this section, it is first confirmed that DeepONet provides better performance than FNN even for the simplest linear problems, followed by results for three nonlinear ODE and PDE problems. For all problems, the optimizer used is Adam with a learning rate $0.001$, and unless explicitly mentioned, the network sizes are as shown in the table below.

Table 1 and 2

4.1 A simple 1D dynamic system

The one-dimensional dynamic system is expressed as follows.

$$ \begin{align*} \dfrac{ds(x)}{dx} &= g(s(x), u(x), x), \qquad x\in[0, 1] \\ s(0) &= 0 \end{align*} $$

The goal is to find the solution $s(x) \text{ on } [0,1]$ for any given $u$.

4.1.1 Linear case: $g(s(x), u(x), x) = u(x)$

First, let’s consider a very simple case.

$$ \begin{align*} \dfrac{ds(x)}{dx} &= u(x), \qquad x\in[0, 1] \\ s(0) &= 0 \end{align*} $$

In this case, the operator $G : u \mapsto s$ is the following indefinite integral operator.

$$ G : u(x) \mapsto s(x) = \int_{0}^{x} u(\tau)d\tau $$

To compare, FNN was trained to learn $G$ by adjusting depth and width. Increasing the depth doesn’t significantly affect performance, but increasing the width reduces training error; however, generalization performance (test error) doesn’t improve (Fig. 2).

Figure 2

In contrast, DeepONet shows little difference between training and test errors (Fig. 3A). Performance slightly improves by adding a bias $b_{0}$. Moreover, Unstacked DeepONet, though having larger training errors, has lower test errors, which are more important. Unstacked DeepONet is also faster and uses considerably less memory due to fewer parameters.

Figure 3

4.1.2 Nonlinear case: $g(s(x), u(x), x) = −s^{2}(x) + u(x)$

In this case, the focus moves more to comparing Unstacked DeepONet and Stacked DeepONet. By observing the correlation between training and test errors, it’s clear that Unstacked DeepONet is stronger (Fig. 4A). It displayed even stronger correlations when tested with different learning rates and initial values (Fig. 4B).

Figure 4

4.2 Gravity pendulum with an external force

This subsection addresses the pendulum motion with an external force as follows.

$$ \begin{align*} \dfrac{ds_{1}}{dt} &= s_{2} \\ \dfrac{ds_{2}}{dt} &= -k \sin s_{1} + u(t) \\ s_{1}(0) &= 0, \quad s_{2}(0) = 0 \end{align*} $$

The following content demonstrates how well DeepONet works for this problem, discussing the number of sensors, error convergence, etc.

4.3 Diffusion-reaction system with a source term

The following diffusion-reaction equation is addressed.

$$ \dfrac{\partial s}{\partial t} = D \dfrac{\partial^{2} s}{\partial^{x}} + ks^{2} + u(x),\qquad x\in [0,1], t\in [0,1] $$ $$ \text{with zero initial/boundary conditions} $$

Unlike previous examples, $u(x)$ has a 1D variable, whereas $s(x, t)$ has a 2D variable. It is shown that DeepONet also works well here. The training data for one $u$ is as follows.

$$ \begin{align*} \big( (u, (x_{1}, t_{1})), s(x_{1}, t_{1}) \big) \\ \big( (u, (x_{2}, t_{2})), s(x_{2}, t_{2}) \big) \\ \vdots \\ \big( (u, (x_{p}, t_{p})), s(x_{p}, t_{p}) \big) \end{align*} $$

$(u, (x_{i}, t_{i}))$ is the input for DeepONet, and $s(x_{i}, t_{i})$ is the final output. Concretely, $u$ is the input for the branch, and $(x_{i}, t_{i})$ is for the trunk. Such structured data is generated and used for training for each different $u$.

5 Conclusion

This paper proposes DeepONet, a method for learning nonlinear operators. DeepONet is composed of branches, which learn coefficients, and trunks, which learn the basis. The body of the paper analyzes various factors affecting test errors (e.g., number of sensors, maximum prediction time, complexity of the input function space, size of the training dataset, and network size). It theoretically derives how the approximation error is impacted by various factors and shows that results align with calculations.

However, there is still much to research regarding a theoretical analysis of DeepONet itself. While the paper only uses FNN, potential explorations could involve connections with CNNs or attention mechanisms, as well as other neural network architectures and techniques.