logo

Mathematical Foundations of Deep Learning, Proof of the Universal Approximation Theorem 📂Machine Learning

Mathematical Foundations of Deep Learning, Proof of the Universal Approximation Theorem

정리

If $\sigma$ is a continuous sigmoidal function, then $$ S := \left\{ G(x) = \sum_{k=1}^{N} \alpha_{k} \sigma \left( y_{k}^{T} x+ \theta_{k} \right) : y_{k} \in \mathbb{R}^{n} \land \alpha_{k} , \theta_{k} \in \mathbb{R} \land N \in \mathbb{N} \right\} $$ is uniformly dense in $C\left( I_{n} \right)$. In other words, for every $f \in C \left( I_{n} \right)$ and $\varepsilon > 0$, there exists a $G \in S$ that satisfies the following. $$ \left\| G - f \right\| < \varepsilon $$


  • A function $\sigma : \mathbb{R} \to \mathbb{R}$ that satisfies the following is called a sigmoidal function. $$ \sigma (t) \to \begin{cases} 1 & \text{as } t \to + \infty \\ 0 & \text{as } t \to - \infty \end{cases} $$
  • $I_{n} := [0,1]^{n}$ is the $n$-dimensional unit cube, which is the Cartesian product of $n$ closed unit intervals $[0,1]$.
  • $C \left( I_{n} \right)$ is a class of continuous function space.
  • $y^{T}$ is the transpose matrix of $y$, and $y^{T} x$ is identical to the inner product $\left< x, y \right>$ of $x$ and $y$.

Why Does Deep Learning Work?

The Cybenko theorem, also known as the Universal Approximation Theorem, is a very important theorem that theoretically underpins why deep learning, or one-hidden-layer artificial neural networks, works. This theorem was proven by Cybenko in 1989, and indirectly supports many deep learning techniques that are hitting the State Of The Art (SOTA) in the 2020s. This indicates that despite the incredible performance of current deep learning techniques, their theoretical foundation is weak.

In the formula, $x \in I_{n}$ is the input data, $y \in \mathbb{R}^{n}$ are the weights, and $\theta \in \mathbb{R}$ can be considered as the bias. Consequently, the sigmoidal function $\sigma$ could be regarded as an activation function. Considering that the idea behind the activation function is imitating a threshold, whether a signal is transmitted or not can be seen as a kind of distinction―discrimination, which is importantly related to the discussion about discriminant functions from a measure theoretic perspective. Simply put, Cybenko’s theorem states that any desired $f$ can be approximated as a finite linear combination of discriminant functions with appropriate weights and biases.

In the context of machine learning, $f$ is the function we truly seek, the function that does what we want, whether it identifies dogs or cats from photos, translates Korean text into English, or performs more complex and fascinating functions.

Several more theorems are proven in Cybenko’s paper, including ones commonly thought about in classification problems, or allowing discontinuities in the functions we wish to approximate. We won’t necessarily cover those here, but understanding the proof of the Cybenko theorem introduced in this post should make comprehending those results not too difficult.

Proof 1

Strategy: One must first be familiar with the proof of Auxiliary Lemma 1 and proof of Auxiliary Lemma 2, which are lengthy and challenging. These proofs require a background in functional analysis and measure theory, as well as Banach spaces, making them difficult for even mathematics undergraduates to understand. If these lemmas are proven, then the Cybenko theorem merely combines the two. It demonstrates that a continuous sigmoidal function is a good discriminant function in terms of measure theory, and a linear combination of discriminant functions can approximate $f \in C \left( I_{n} \right)$ functionally.


  1. G. Cybenko. (1989). Approximation by Superpositions of a Sigmoidal Function p6. ↩︎