What is a Sigmoid Function?
Definition
A function $\sigma : \mathbb{R} \to \mathbb{R}$ that satisfies the following is called a sigmoidal function. $$ \sigma (t) \to \begin{cases} 1 & \text{as } t \to + \infty \\ 0 & \text{as } t \to - \infty \end{cases} $$
Explanation of the definition
In the definition of a sigmoidal function, whether it is $0$ or $1$ is not particularly important; what matters is that it converges to constants as the argument goes to positive or negative infinity. It is also not crucial what values it takes away from infinity. In this sense, the logistic function and the sigmoid functions would both be sigmoidal functions.
The most prominent use of sigmoidal functions is in applications to deep learning. Some readers might undervalue sigmoidal functions because, since the late 2010s, activation functions like ReLU or Mish have exhibited superior performance. However, if asked how sigmoidal functions are mathematically and computationally important, one can answer that even a simple form that just converges to $0$ or $1$ is enough to theoretically guarantee the possibility of the technique called artificial neural networks. The study of sigmoidal functions leads to the great result known as the universal approximation theorem, which is undoubtedly very important.
Theorem
Discriminatory property of sigmoidal functions: A bounded measurable sigmoidal function is a discriminatory function.
Explanation of the theorem
It is important that the sigmoidal functions we commonly use have the discriminatory property because the concept of a discriminatory function is somewhat more difficult than its practical use might suggest. However, by the above theorem, even without understanding discriminatory functions in depth, we can obtain concrete examples that can be implemented in practice through programming.
Proof 1
Definition of a discriminatory function A function $\sigma : \mathbb{R} \to \mathbb{R}$ is called a discriminatory function if for all $y \in \mathbb{R}^{n}$ and $\theta \in \mathbb{R}$ and some $\mu \in M \left( I_{n} \right)$ the following holds. $$ \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 \implies \mu =0 $$
- $M \left( I_{n} \right)$ is the set of signed finite regular Borel measures defined on $I_{n} := [0,1]^{n}$.
Claim
According to the definition of discriminatory functions, it suffices to show that when the sigmoidal function $\sigma$ $$ \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 $$ then $\mu =0$.
Part 1. Definitions of $\sigma_{{\lambda}}$ and $\gamma$
For arbitrary $x \in I_{n}$, $y \in \mathbb{R}^{n}$ and $\theta , \varphi \in \mathbb{R}$, define the function $\sigma_{\lambda} : I_{n} \to \mathbb{R}$ as follows. $$ \sigma_{\lambda} (x) := \sigma \left( \lambda \left( y^{T} x + \theta \right) + \varphi \right) $$ Then, as $\lambda$ varies, $$ \sigma_{\lambda} \begin{cases} \to 1 & \text{for } y^{T} x + \theta > 0 & \text{as } \lambda \to \infty \\ \to 0 & \text{for } y^{T} x + \theta < 0 & \text{as } \lambda \to \infty \\ = \sigma \left( \varphi \right) & \text{for } y^{T} x + \theta = 0 & \text{for all } \lambda \in \mathbb{R} \end{cases} $$ so if we define a new function $\gamma : I_{n} \to \mathbb{R}$ by $$ \gamma (x) := \begin{cases} 1 & \text{for } y^{T} x + \theta > 0 \\ 0 & \text{for } y^{T} x + \theta < 0 \\ \sigma \left( \varphi \right) & \text{for } y^{T} x + \theta = 0 \end{cases} $$ then when $\lambda \to \infty$, $\sigma_{\lambda}$ converges to $\gamma$ in the sense of pointwise convergence.
Part 2. When $\mu$ is not a signed measure
Now partition the whole space $I_{n}$ into three parts with respect to the hyperplane $H^{0}_{y, \theta}$ as follows. $$ H^{+}_{y, \theta} := \left\{ x \in \mathbb{R}^{n} : y^{T} x + \theta > 0 \right\} \\ H^{0}_{y, \theta} := \left\{ x \in \mathbb{R}^{n} : y^{T} x + \theta = 0 \right\} \\ H^{-}_{y, \theta} := \left\{ x \in \mathbb{R}^{n} : y^{T} x + \theta < 0 \right\} $$ If $\forall y \in \mathbb{R}^{n}$ and $\forall \theta \in \mathbb{R}$, then inside the Lebesgue integral $\sigma \left( y^{T} x + \theta \right)$ or $\sigma \left( \lambda \left( y^{T} x + \theta \right) + \varphi \right)$ are essentially the same, and by the dominated convergence theorem we can interchange the order between $\displaystyle \lim_{\lambda \to \infty} \cdot$ and $\displaystyle \int_{I_{n}} \cdot d \mu$, so for all $\varphi \in \mathbb{R}$ $$ \begin{align*} 0 =& \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) \\ =& \lim_{\lambda \to \infty} \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) \\ =& \lim_{\lambda \to \infty} \int_{I_{n}} \sigma \left( \lambda \left( y^{T} x + \theta \right) + \varphi \right) d \mu (x) \\ =& \int_{I_{n}} \lim_{\lambda \to \infty} \sigma_{\lambda} \left( x \right) d \mu (x) \\ =& \int_{I_{n}} \gamma (x) d \mu (x) \\ =& \int_{H^{+}_{y, \theta}} \gamma (x) d \mu (x) + \int_{H^{0}_{y, \theta}} \gamma (x) d \mu (x) + \int_{H^{-}_{y, \theta}} \gamma (x) d \mu (x) \\ =& \int_{H^{+}_{y, \theta}} 1 d \mu (x) + \int_{H^{0}_{y, \theta}} \sigma (\varphi) d \mu (x) + \int_{H^{-}_{y, \theta}} 0 d \mu (x) \\ =& \mu \left( H^{+}_{y, \theta} \right) + \sigma (\varphi) \mu \left( H^{0}_{y, \theta} \right) \end{align*} $$ If $\mu$ is not a signed measure, then the hyperplane $H^{0}_{y, \theta}$ must of course be $\mu \left(H^{0}_{y, \theta} \right) = 0$ with respect to the measure $\mu$ defined on $I_{n}$. This holds for all $y,\theta$, hence it is always $\mu \left( H^{+}_{y, \theta} \right) = 0$, and the sigmoidal function $\sigma$ is discriminatory. However, if $\mu$ is a signed measure, $\mu \left(H^{0}_{y, \theta}\right) = 0$ is not guaranteed, so a separate argument is required. If one is willing to sacrifice some mathematical generality and only cares about applications to artificial neural networks, one can stop the proof here without major issues.
Part 3. When $\mu$ is a signed measure
Fix $y$ and for a bounded measurable function $h : \mathbb{R} \to \mathbb{R}$ define the following linear functional $F \in \left( L^{\infty} (\mathbb{R}) \right)^{ \ast }$. $$ F (h) := \int_{I_{n}} h \left( y^{T} x \right) d \mu (x) $$ Since $F$ is defined by the Lebesgue integral, linearity is guaranteed, and because $\mu \in M \left( I_{n} \right)$ is a finite signed measure, $F$ is also bounded and hence belongs to $L^{\infty} (\mathbb{R})$. Now let $h$ be the indicator function $h \left( y^{T} x \right) := \begin{cases} 1 & , y^{T}x \ge \theta \\ 0 & , y^{T}x < \theta \end{cases}$ that satisfies the following for some $\mu \in M \left( I_{n} \right)$. $$ \begin{align*} F(h) =& \int_{I_{n}} h \left( y^{T} x \right) d \mu (x) \\ =& \mu \left( H^{+}_{y, -\theta} \right) + \mu \left( H^{0}_{y, -\theta} \right) \\ =& 0 \end{align*} $$ Similarly, if $h$ is the indicator function for $(\theta , \infty)$ then $F(h) = 0$, and by linearity of $F$ it follows that $F(h)=0$ for indicator functions $h$ of any interval. Therefore $F(h) = 0$ holds for all simple functions $h$, and since the set of simple functions is dense in $L^{\infty } ( \mathbb{R} )$, $F = 0$. In other words, $F(h) = 0$ holds for every bounded measurable function $h \in L^{\infty} \left( \mathbb{R} \right)$. We have assumed $\sigma$ is also a bounded measurable sigmoidal function, so $\sigma \in L^{\infty} ( \mathbb{R} )$ and $$ F (\sigma) = \int_{I_{n}} \sigma \left( y^{T} x \right) d \mu (x) = \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 $$ Now it remains to show that under these conditions $\mu = 0$.
Part 4.
$L^{\infty}\left( \mathbb{R} \right)$ is a vector space of bounded measurable functions, so for $m \in \mathbb{R}^{n}$ define a bounded measurable function $s,c : I_{n} \to \mathbb{R}$ by $$ s(x) := \sin \left( m^{T} x \right) \\ c(x) := \cos \left( m^{T} x \right) $$ Since the vector space is closed under addition and scalar multiplication, $c + is$ is also a bounded measurable function. But $F=0$ held for all bounded measurable functions, so by Euler’s formula $$ \begin{align*} 0 =& F (c + is) \\ =& \int_{I_{n}} \left[ \cos \left( m^{T} x \right) + i \sin \left( m^{T} x \right) \right] d \mu (x) \\ =& \int_{I_{n}} \exp \left( i m^{T} x \right) d \mu (x) \end{align*} $$
Definition of the Fourier transform: The following transform $\mathcal{F}$ is called the Fourier transform. $$ \mathcal{F}f(\xi):=\int f(x)e^{-i \xi x }d x $$
By the definition of the Fourier transform and properties of the Lebesgue integral, $$ 0 = F (c + is) = \mathcal{F} \mu $$ which means the Fourier transform of $\mu$ is $0$, hence $\mu = 0$, and summarizing, $$ \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 \implies \mu =0 $$ Therefore $\sigma$ is a discriminatory function.
■
G. Cybenko. (1989). Approximation by Superpositions of a Sigmoidal Function p5. ↩︎
