logo

Paper Review: DeepONet 📂Machine Learning

Paper Review: DeepONet

Overview and Summary

  • Follow the references, equation numbers, and notation in the paper as closely as possible.

For accessibility, this review is based on the version available on arXiv rather than the journal published version. Although the problems covered in the experimental section differ slightly, the core focus is not on the experimental results and performance but on the explanation of the DeepONet method itself.

DeepONet is a deep learning technique proposed for learning operators. An operator is a function that maps functions to functions (explained in detail in the main text). Specifically, for a function uu, an operator GG is defined as follows.

G:uG(u) G : u \mapsto G(u)

Here, uu is also a function and Gu=G(u)Gu = G(u) is a function as well. The first key point is that “DeepONet learns operators,” and the second point is that it “approximates GuGu as a series.” Given an appropriate function space XX, let’s call its basis {ϕk}\left\{ \phi_{k} \right\}. Then GuXGu \in X can be expressed as follows.

Gu=k=1ckϕk Gu = \sum_{k=1}^{\infty} c_{k}\phi_{k}

DeepONet learns ckc_{k} and ϕk\phi_{k}, where the part learning the coefficients ckc_{k} is called the branch network and the part learning the basis {ϕk}\left\{ \phi_{k} \right\} is called the trunk network.

Implementation

1 Introduction

The universal approximation theorem guarantees that neural networks can approximate arbitrary continuous functions. This provides a theoretical basis for the effective functioning of artificial neural networks and deep learning techniques, which have been successful across various fields. Even more surprisingly, artificial neural networks can approximate all nonlinear functionals and (nonlinear) operators.

For readers unfamiliar with mathematics, let’s briefly explain functions, functionals, and operators. These are fundamentally functions (which map a single element in the domain to exactly one element in the codomain). However, in contexts where the terms functional and operator are used, they have a slightly special meaning. Typically, a function means mapping numbers (or vectors) to numbers (or vectors). Polynomial functions, trigonometric functions, and other commonly dealt functions fall under this context.

function:RnRm \text{function}: \mathbb{R}^{n} \to \mathbb{R}^{m}

A function that maps functions to numbers (scalars) is specifically called a functional. A concrete example is the definite integral. If we define a functional I[a,b]I_{[a,b]} as I[a,b](f)=abf(x)dx\displaystyle I_{[a,b]}(f) = \int_{a}^{b} f(x)dx, for each ff given, this functional maps the area under the curve of ff over the interval [a,b][a, b]. If we consider XX as an appropriate function space, a functional can be expressed as follows.

functional:XR \text{functional}: X \to \mathbb{R}

Operators map functions to functions. Examples include indefinite integrals and derivatives.

operator:XX \text{operator}: X \to X For a function ff, defining an operator DD as D(f)=dfdxD(f) = \dfrac{df}{dx} makes it a differential operator that maps a given function to its derivative. Defining an operator II as I(f)=f(x)dx\displaystyle I(f) = \int f(x) dx maps a given function to its indefinite integral.

Now, the terms function, functional, and operator used below have the meanings given in the explanation above. Before delving into our main discussion, let’s introduce the notation used throughout the paper. GG represents an operator where the variable is the function uu.

G:uG(u) G : u \mapsto G(u)

Since GG is an operator, its function value G(u)G(u) is also a function, and its variable is denoted as yy.

G(u):yG(u)(y) G(u) : y \mapsto G(u)(y)

Therefore, both yy and G(u)(y)G(u)(y) are real numbers.

y,G(u)(y)R y, G(u)(y) \in \mathbb{R}

The goal of this paper is to learn operators, and for that, we consider a neural network that takes both uu and yy as inputs and outputs G(u)(y)G(u)(y).

network:(u,y)G(u)(y) \text{network} : (u, y) \mapsto G(u)(y)

Theoretically, operator GG takes the function uu itself as a variable, but for computer simulations, discretization is necessary, and a finite number of function values u(x1)u(x_{1}), u(x2)u(x_{2}), \dots, and u(xm)u(x_{m}) are used as inputs to the neural network. These are referred to as sensors in the paper. Thus, the proposed neural network has the following structure (Fig. 1A).

Figure 1A

Theorem 1 (Universal Approximation Theorem for Operator) Let σ\sigma be a non-polynomial function. Let XX be a Banach space, and K1XK_{1} \subset X, K2XK_{2} \subset X be compact sets. Let VC(K1)V \subset C(K_{1}) be a compact set, and G:VC(K2)G : V \to C(K_{2}) be a nonlinear continuous operator.

Then, for any ϵ>0\epsilon > 0, there exist positive integers nn, pp, mm, and constants cikc_{i}^{k}, ξijk\xi_{ij}^{k}, θik\theta_{i}^{k}, ζkR\zeta_{k} \in \mathbb{R}, wkRdw_{k} \in \mathbb{R}^{d}, xjK1x_{j} \in K_{1} (i=1,,ni = 1,\dots,n, k=1,,pk = 1,\dots,p, j=1,,mj = 1,\dots,m) such that the following holds.

G(u)(y)k=1pi=1ncikσ(j=1mξijku(xj)+θik)branchσ(wky+ζk)trunk<ϵfor all uV,yK2(1) \left| G(u)(y) - \sum\limits_{k=1}^{p} \underbrace{\sum\limits_{i=1}^{n} c_{i}^{k}\sigma\left( \sum\limits_{j=1}^{m} \xi_{ij}^{k}u(x_{j}) + \theta_{i}^{k} \right)}_{branch} \underbrace{\sigma(w_{k} \cdot y + \zeta_{k})}_{trunk} \right| < \epsilon \quad \text{for all } u \in V, y \in K_{2} \tag{1}

In this paper, the above approximation is divided into two parts, called branch and trunk.

Although the approximation theorem suggests that neural networks could learn nonlinear operators, it does not suggest how to effectively train them. Even though the universal approximation theorem implies that any [MLP] should be able to approximate any continuous function, CNNs or other neural network architectures perform better on image-related tasks. A useful network should be easy to train and have good generalization performance. The authors aim to propose a new methodology that makes this possible.

To demonstrate that the proposed method is suitable for learning nonlinear operators, they impose very weak constraints on the data. Specifically, the input data uiu_{i} must share the same sensors. However, these sensors don’t necessarily need to be on a uniform grid, and there are no constraints on variable yy. This condition is well illustrated in Fig. 1B.

Figure 1B

The authors name the proposed architecture DeepONet (Deep Operator Network), which is composed of a branch net for the input function (u(x1),,u(xm)u(x_{1}), \dots, u(x_{m})) and a trunk net for the output function variable (yy). Details are elaborated in Section 2.

The paper considers two types of operators represented by ordinary differential equations (ODE) and partial differential equations (PDE).

2 Methodology

2.1 Deep operator networks (DeepONets)

The authors focus on operator learning in general situations and impose the constraint that input functions (uu) must share the same sensors. The inputs to the proposed neural network are divided into two parts, as seen in Fig. 1A: [u(x1),,u(xm)][u(x_{1}), \dots, u(x_{m})] and yy. There are no restrictions on the network architecture; the paper uses basic [fully-connected neural networks] (FNNs) to showcase performance capabilities. It is noted that [CNN], RNN architectures, or attention mechanisms could be integrated if desired.

Initially, the trunk network takes yy as input and outputs [t1,t2,,tp]TRp[t_{1}, t_{2}, \dots, t_{p}]^{T} \in \mathbb{R}^{p}. Each of the pp branch networks takes [u(x1),,u(xm)][u(x_{1}), \dots, u(x_{m})] as input and outputs bkRb_{k} \in \mathbb{R} each (k=1,2,,pk = 1,2,\dots,p). These are combined as in Equation (1)(1) as follows.

G(u)(y)k=1pbktk=k=1pbk([u(x1),u(x2),,u(xm)])tk(y) G(u)(y) \approx \sum_{k=1}^{p} b_{k}t_{k} = \sum_{k=1}^{p} b_{k}([u(x_{1}), u(x_{2}), \cdots, u(x_{m})]) t_{k}(y)

It is notable to mention that the activation function is applied even in the last layer of the trunk net. Although not explicitly evident in these equations, this approach can be viewed as approximating the function Gu=G(u)Gu = G(u) as a series. Given an appropriate function space XX with basis {ϕk}\left\{ \phi_{k} \right\}, it can be expressed as follows.

Gu=k=1bkϕk Gu = \sum_{k=1}^{\infty} b_{k}\phi_{k}

In other words, interpreting tk=ϕk(y)t_{k} = \phi_{k}(y) as tkt_{k} being the basis and bkb_{k} the coefficient part of the series, DeepONet approximates GuGu by decomposing it into a series instead of approximating it directly. Although Theorem 1 does not require this, adding a bias (constant term) as shown improves generalization performance.

G(u)(y)k=1pbktk+b0 G(u)(y) \approx \sum_{k=1}^{p} b_{k}t_{k} + b_{0}

In practice, pp should be at least 10 or more; a larger pp increases computational cost. Therefore, the paper introduces Stacked DeepONet, which employs separate branch networks for each bkb_{k} (Fig. 1C), and Unstacked DeepONet, where a single network learns all bkb_{k} (Fig. 1D). All codes related to DeepONet can be found at https://github.com/lululxvi/deepxde, although it can be challenging to locate specific components among the author’s other works featured there.

Figure 1C and 1D

2.2 Data generation

The paper discusses two function spaces: Gaussian random field (GRF) and [orthogonal polynomial space]. The authors used a GRF with a mean of 00.

uG(0,kl(x1,x2)) u \sim \cal{G}(0, k_{l}(x_{1}, x_{2}))

Here, kl(x1,x2)=exp(x1x22/2l2)k_{l}(x_{1}, x_{2}) = \exp (- \| x_{1} - x_{2} \|^{2} / 2l^{2}) is the covariance kernel. As an orthogonal polynomial space, the Chebyshev polynomials are chosen. Let M>0M > 0 be and TiT_{i} be the first kind of Chebyshev polynomial.

Vpoly={i=0N1aiTi(x):aiM} V_{\text{poly}} = \left\{ \sum\limits_{i=0}^{N-1} a_{i} T_{i}(x): |a_{i}| \le M \right \}

The dataset was generated by random sampling of ai[M,M]a_{i} \in [-M, M]. For each generated dataset, the Runge-Kutta method solved the ODE systems and the finite difference method was used to find reference solutions for second-order PDEs.

3 Number of sensors for identifying nonlinear dynamic systems

In this section, the need to discuss the number of sensors required to achieve arbitrary accuracy ε\varepsilon in solving nonlinear dynamic systems using DeepONet is highlighted.

4 Simulation results

In this section, it is first confirmed that DeepONet provides better performance than FNN even for the simplest linear problems, followed by results for three nonlinear ODE and PDE problems. For all problems, the optimizer used is Adam with a learning rate 0.0010.001, and unless explicitly mentioned, the network sizes are as shown in the table below.

Table 1 and 2

4.1 A simple 1D dynamic system

The one-dimensional dynamic system is expressed as follows.

ds(x)dx=g(s(x),u(x),x),x[0,1]s(0)=0 \begin{align*} \dfrac{ds(x)}{dx} &= g(s(x), u(x), x), \qquad x\in[0, 1] \\ s(0) &= 0 \end{align*}

The goal is to find the solution s(x) on [0,1]s(x) \text{ on } [0,1] for any given uu.

4.1.1 Linear case: g(s(x),u(x),x)=u(x)g(s(x), u(x), x) = u(x)

First, let’s consider a very simple case.

ds(x)dx=u(x),x[0,1]s(0)=0 \begin{align*} \dfrac{ds(x)}{dx} &= u(x), \qquad x\in[0, 1] \\ s(0) &= 0 \end{align*}

In this case, the operator G:usG : u \mapsto s is the following indefinite integral operator.

G:u(x)s(x)=0xu(τ)dτ G : u(x) \mapsto s(x) = \int_{0}^{x} u(\tau)d\tau

To compare, FNN was trained to learn GG by adjusting depth and width. Increasing the depth doesn’t significantly affect performance, but increasing the width reduces training error; however, generalization performance (test error) doesn’t improve (Fig. 2).

Figure 2

In contrast, DeepONet shows little difference between training and test errors (Fig. 3A). Performance slightly improves by adding a bias b0b_{0}. Moreover, Unstacked DeepONet, though having larger training errors, has lower test errors, which are more important. Unstacked DeepONet is also faster and uses considerably less memory due to fewer parameters.

Figure 3

4.1.2 Nonlinear case: g(s(x),u(x),x)=s2(x)+u(x)g(s(x), u(x), x) = −s^{2}(x) + u(x)

In this case, the focus moves more to comparing Unstacked DeepONet and Stacked DeepONet. By observing the correlation between training and test errors, it’s clear that Unstacked DeepONet is stronger (Fig. 4A). It displayed even stronger correlations when tested with different learning rates and initial values (Fig. 4B).

Figure 4

4.2 Gravity pendulum with an external force

This subsection addresses the pendulum motion with an external force as follows.

ds1dt=s2ds2dt=ksins1+u(t)s1(0)=0,s2(0)=0 \begin{align*} \dfrac{ds_{1}}{dt} &= s_{2} \\ \dfrac{ds_{2}}{dt} &= -k \sin s_{1} + u(t) \\ s_{1}(0) &= 0, \quad s_{2}(0) = 0 \end{align*}

The following content demonstrates how well DeepONet works for this problem, discussing the number of sensors, error convergence, etc.

4.3 Diffusion-reaction system with a source term

The following diffusion-reaction equation is addressed.

st=D2sx+ks2+u(x),x[0,1],t[0,1] \dfrac{\partial s}{\partial t} = D \dfrac{\partial^{2} s}{\partial^{x}} + ks^{2} + u(x),\qquad x\in [0,1], t\in [0,1] with zero initial/boundary conditions \text{with zero initial/boundary conditions}

Unlike previous examples, u(x)u(x) has a 1D variable, whereas s(x,t)s(x, t) has a 2D variable. It is shown that DeepONet also works well here. The training data for one uu is as follows.

((u,(x1,t1)),s(x1,t1))((u,(x2,t2)),s(x2,t2))((u,(xp,tp)),s(xp,tp)) \begin{align*} \big( (u, (x_{1}, t_{1})), s(x_{1}, t_{1}) \big) \\ \big( (u, (x_{2}, t_{2})), s(x_{2}, t_{2}) \big) \\ \vdots \\ \big( (u, (x_{p}, t_{p})), s(x_{p}, t_{p}) \big) \end{align*}

(u,(xi,ti))(u, (x_{i}, t_{i})) is the input for DeepONet, and s(xi,ti)s(x_{i}, t_{i}) is the final output. Concretely, uu is the input for the branch, and (xi,ti)(x_{i}, t_{i}) is for the trunk. Such structured data is generated and used for training for each different uu.

5 Conclusion

This paper proposes DeepONet, a method for learning nonlinear operators. DeepONet is composed of branches, which learn coefficients, and trunks, which learn the basis. The body of the paper analyzes various factors affecting test errors (e.g., number of sensors, maximum prediction time, complexity of the input function space, size of the training dataset, and network size). It theoretically derives how the approximation error is impacted by various factors and shows that results align with calculations.

However, there is still much to research regarding a theoretical analysis of DeepONet itself. While the paper only uses FNN, potential explorations could involve connections with CNNs or attention mechanisms, as well as other neural network architectures and techniques.