딥러닝의 수학적 근거, 시벤코 정리 증명
Theorem
$\sigma$ is said to be a continuous sigmoidal function if $$ S := \left\{ G(x) = \sum_{k=1}^{N} \alpha_{k} \sigma \left( y_{k}^{T} x+ \theta_{k} \right) : y_{k} \in \mathbb{R}^{n} \land \alpha_{k} , \theta_{k} \in \mathbb{R} \land N \in \mathbb{N} \right\} $$ is uniformly dense in $C\left( I_{n} \right)$. In other words, for all $f \in C \left( I_{n} \right)$ and $\varepsilon > 0$, there exists a $G \in S$ that satisfies the following. $$ \left\| G - f \right\| < \varepsilon $$
- A function that satisfies the following is called a sigmoidal function: $$ \sigma (t) \to \begin{cases} 1 & \text{as } t \to + \infty \\ 0 & \text{as } t \to - \infty \end{cases} $$
- $I_{n} := [0,1]^{n}$ is a $n$-dimensional unit cube obtained by taking the Cartesian product of $n$ unit closed intervals $[0,1]$.
- $C \left( I_{n} \right)$ is the class of the space of continuous functions.
- $y^{T}$ is the transpose matrix of $y$, and $y^{T} x$ is the inner product of $x$ and $y$, equal to $\left< x, y \right>$.
Why Deep Learning Works
Cybenko’s theorem, commonly known as the universal approximation theorem, provides a theoretical foundation for why deep learning, specifically artificial neural networks with a single hidden layer, works. Proven by Cybenko in 1989, this theorem indirectly supports numerous deep learning techniques that have reached SOTA (State Of The Art) performance levels in the 2020s. This also implies that although deep learning methods are achieving incredible results, their theoretical basis is still somewhat weak.
In the equations, $x \in I_{n}$ represents input data, $y \in \mathbb{R}^{n}$ represents weights, and $\theta \in \mathbb{R}$ represents biases. Thus, the sigmoidal function $\sigma$ can be viewed as an activation function. Considering that the idea behind activation functions is a mimicry of thresholds, the transmission or non-transmission of a signal can be seen as a sort of discrimination important in measure theory. Interpreting Cybenko’s theorem in layman’s terms, for any desired $f$, one can find appropriate weights and biases to approximate it closely using a finite linear combination of discrimination functions.
- This is similar to the relationship between Support Vector Machines and the representer theorem.
In the context of machine learning, $f$ represents the function we truly want to find, capable of performing tasks like distinguishing between a dog and a cat in an image or translating Korean text to English, or even more complex and fascinating functions.
Cybenko’s paper further proves a few additional theorems, such as allowing discontinuities in classification problems or approximation functions. While these won’t be detailed here, understanding the proof of Cybenko’s theorem introduced in this post will make understanding these results straightforward as well.
Proof 1
Strategy: First, familiarize yourself with the proofs of Lemma 1 and Lemma 2, which are long and complex. These proofs require prior knowledge of functional analysis, measure theory, and Banach spaces, making them challenging even for undergraduate math majors. Once these lemmas are proven, Cybenko’s theorem itself is merely a combination of the two. It demonstrates that continuous sigmoidal functions are good discriminatory functions in measure theory, and that a linear combination of discriminatory functions can approximate $f \in C \left( I_{n} \right)$ in functional analysis.
Lemma 1: Discrimination of Sigmoidal Functions. A bounded measurable sigmoidal function is a discriminatory function.
If a sigmoidal function is continuous, it is measurable and must be bounded, thus it is a discriminatory function.
Lemma 2: Uniform Density of Discriminatory Functions. If $\sigma$ is a continuous discriminatory function, $$ \displaystyle S := \left\{ G(x) = \sum_{k=1}^{N} \alpha_{k} \sigma \left( y_{k}^{T} x+ \theta_{k} \right) : y_{k} \in \mathbb{R}^{n} \land \alpha_{k} , \theta_{k} \in \mathbb{R} \land N \in \mathbb{N} \right\} $$ is uniformly dense in $C\left( I_{n} \right)$. In other words, for all $f \in C \left( I_{n} \right)$ and $\varepsilon > 0$, there exists a $G \in S$ that satisfies the following. $$ \left\| G - f \right\| < \varepsilon $$
As a continuous sigmoidal function is a discriminatory function, therefore by the properties of discriminatory functions, it follows $\overline{S} = C\left( I_{n} \right)$.
■
G. Cybenko. (1989). Approximation by Superpositions of a Sigmoidal Function p6. ↩︎