What is a Sigmoid Function?
Definition
A function $\sigma : \mathbb{R} \to \mathbb{R}$ is called a Sigmoidal Function if it satisfies the following. $$ \sigma (t) \to \begin{cases} 1 & \text{as } t \to + \infty \\ 0 & \text{as } t \to - \infty \end{cases} $$
Explanation of the Definition
In the definition of a sigmoidal function, whether it’s $0$ or $1$ is not really important, but it’s important that it converges to a constant as it approaches infinity, whether positive or negative. What value it takes at non-infinite points is not that important as well. In this sense, not only the logistic function but also sigmoid functions would be considered sigmoidal functions.
The most notable use of sigmoidal functions is in applications in deep learning. Some readers might underestimate sigmoidal functions because activation functions such as ReLU and Mish have shown superior performance in deep learning since the late 2010s. However, if asked why sigmoidal functions are mathematically and computer scientifically significant, it’s because even in its simple form where it converges merely to $0$ or $1$, it theoretically guarantees the possibility of the technique called artificial neural networks. The exploration of sigmoidal functions leads to a magnificent result called the universal approximation theorem, which is undeniably a very important outcome.
Theorem
Distinction of Sigmoidal Functions: Bounded measurable sigmoidal functions are differentiable functions.
Explanation of the Theorem
The reason why it is important that commonly used sigmoidal functions are differentiable is that the concept of a differentiable function is slightly too difficult compared to its applications. However, according to the theorem above, even without knowing about differentiable functions, we can obtain concrete examples that we can actually implement through programming.
Proof 1
Definition of Differentiable Functions A function $\sigma : \mathbb{R} \to \mathbb{R}$ is said to be differentiable if for all $y \in \mathbb{R}^{n}$ and $\theta \in \mathbb{R}$ and some $\mu \in M \left( I_{n} \right)$, the following is satisfied. $$ \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 \implies \mu =0 $$
- $M \left( I_{n} \right)$ is a set of signed finite regular Borel measures.
Claim
According to the definition of differentiable functions, when sigmoidal function $\sigma$ is $$ \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 $$ it suffices to show that $\mu =0$.
Part 1. Definition of $\sigma_{{\lambda}}$ and $\gamma$
Let’s define the function $\sigma_{\lambda} : I_{n} \to \mathbb{R}$ for any $x \in I_{n}$, $y \in \mathbb{R}^{n}$, and $\theta , \varphi \in \mathbb{R}$ as follows. $$ \sigma_{\lambda} (x) := \sigma \left( \lambda \left( y^{T} x + \theta \right) + \varphi \right) $$ Then, as $\lambda$ changes, $$ \sigma_{\lambda} \begin{cases} \to 1 & \text{for } y^{T} x + \theta > 0 & \text{as } \lambda \to \infty \\ \to 0 & \text{for } y^{T} x + \theta < 0 & \text{as } \lambda \to \infty \\ = \sigma \left( \varphi \right) & \text{for } y^{T} x + \theta = 0 & \text{for all } \lambda \in \mathbb{R} \end{cases} $$ thus, if we define a new function $\gamma : I_{n} \to \mathbb{R}$ as $$ \gamma (x) := \begin{cases} 1 & \text{for } y^{T} x + \theta > 0 \\ 0 & \text{for } y^{T} x + \theta < 0 \\ \sigma \left( \varphi \right) & \text{for } y^{T} x + \theta = 0 \end{cases} $$ then when $\lambda \to \infty$, $\sigma_{\lambda}$ converges pointwise to $\gamma$.
Part 2. When $\mu$ is Not a Sign Measure
Now, let’s split the entire space $I_{n}$ into three parts based on the hyperplane $H^{0}_{y, \theta}$ as follows. $$ H^{+}_{y, \theta} := \left\{ x \in \mathbb{R}^{n} : y^{T} x + \theta > 0 \right\} \\ H^{0}_{y, \theta} := \left\{ x \in \mathbb{R}^{n} : y^{T} x + \theta = 0 \right\} \\ H^{-}_{y, \theta} := \left\{ x \in \mathbb{R}^{n} : y^{T} x + \theta < 0 \right\} $$ If $\forall y \in \mathbb{R}^{n}$ and $\forall \theta \in \mathbb{R}$, then in the Lebesgue integration, $\sigma \left( y^{T} x + \theta \right)$ and $\sigma \left( \lambda \left( y^{T} x + \theta \right) + \varphi \right)$ are essentially the same, and according to the dominated convergence theorem, we can swap the order of $\displaystyle \lim_{\lambda \to \infty} \cdot$ and $\displaystyle \int_{I_{n}} \cdot d \mu$, therefore for all $\varphi \in \mathbb{R}$, $$ \begin{align*} 0 =& \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) \\ =& \lim_{\lambda \to \infty} \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) \\ =& \lim_{\lambda \to \infty} \int_{I_{n}} \sigma \left( \lambda \left( y^{T} x + \theta \right) + \varphi \right) d \mu (x) \\ =& \int_{I_{n}} \lim_{\lambda \to \infty} \sigma_{\lambda} \left( x \right) d \mu (x) \\ =& \int_{I_{n}} \gamma (x) d \mu (x) \\ =& \int_{H^{+}_{y, \theta}} \gamma (x) d \mu (x) + \int_{H^{0}_{y, \theta}} \gamma (x) d \mu (x) + \int_{H^{-}_{y, \theta}} \gamma (x) d \mu (x) \\ =& \int_{H^{+}_{y, \theta}} 1 d \mu (x) + \int_{H^{0}_{y, \theta}} \sigma (\varphi) d \mu (x) + \int_{H^{-}_{y, \theta}} 0 d \mu (x) \\ =& \mu \left( H^{+}_{y, \theta} \right) + \sigma (\varphi) \mu \left( H^{0}_{y, \theta} \right) \end{align*} $$ If $\mu$ is not a sign measure, the hyperplane $H^{0}_{y, \theta}$ must naturally be $\mu \left(H^{0}_{y, \theta} \right) = 0$ with respect to the measure $\mu$ defined in $I_{n}$. Since this holds for all $y,\theta$, it is always $\mu \left( H^{+}_{y, \theta} \right) = 0$, and the sigmoidal function $\sigma$ is a differentiable function. However, if $\mu$ is a sign measure, $\mu \left(H^{0}_{y, \theta}\right) = 0$ would not be guaranteed, so a separate proof is needed. If one is only interested in the application to artificial neural networks while somewhat sacrificing mathematical generality, the proof can be stopped here without big issues.
Part 3. When $\mu$ is a Sign Measure
Fix $y$ and define the linear functional $F \in \left( L^{\infty} (\mathbb{R}) \right)^{ \ast }$ for bounded measurable functions $h : \mathbb{R} \to \mathbb{R}$ as follows. $$ F (h) := \int_{I_{n}} h \left( y^{T} x \right) d \mu (x) $$ $F$ is defined through Lebesgue integration, so it’s linear, and since $\mu \in M \left( I_{n} \right)$ is a finite sign measure, $F$ is also bounded and thus belongs to $L^{\infty} (\mathbb{R})$. Now let $h$ be the indicator function for $[ \theta , \infty )$ that satisfies the following for some $\mu \in M \left( I_{n} \right)$. $$ \begin{align*} F(h) =& \int_{I_{n}} h \left( y^{T} x \right) d \mu (x) \\ =& \mu \left( H^{+}_{y, -\theta} \right) + \mu \left( H^{0}_{y, -\theta} \right) \\ =& 0 \end{align*} $$ Similarly, if $h$ is the indicator function for $(\theta , \infty)$, then it’s $F(h) = 0$, and according to the linearity of $F$, it is also $F(h)=0$ for all intervals of indicator functions $h$. Thus, it is also $F(h) = 0$ for all simple functions $h$, and since the set of simple functions is dense in $L^{\infty } ( \mathbb{R} )$, it follows that $F = 0$. In other words, for all bounded measurable functions $h \in L^{\infty} \left( \mathbb{R} \right)$, it is $F(h) = 0$. Since we assume $\sigma$ is also a bounded measurable sigmoidal function, it is $\sigma \in L^{\infty} ( \mathbb{R} )$, and $$ F (\sigma) = \int_{I_{n}} \sigma \left( y^{T} x \right) d \mu (x) = \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 $$ Now, we only need to show that $\mu = 0$ under these conditions.
Part 4.
$L^{\infty}\left( \mathbb{R} \right)$ is a vector space of bounded measurable functions, so for $m \in \mathbb{R}^{n}$, defining the bounded measurable function $s,c : I_{n} \to \mathbb{R}$ as $$ s(x) := \sin \left( m^{T} x \right) \\ c(x) := \cos \left( m^{T} x \right) $$ since the vector space is closed under addition and scalar multiplication, $c + is$ is also a bounded measurable function. But since for all bounded measurable functions, $F=0$, according to Euler’s formula, $$ \begin{align*} 0 =& F (c + is) \\ =& \int_{I_{n}} \left[ \cos \left( m^{T} x \right) + i \sin \left( m^{T} x \right) \right] d \mu (x) \\ =& \int_{I_{n}} \exp \left( i m^{T} x \right) d \mu (x) \end{align*} $$
Definition of the Fourier transform: The following transformation $\mathcal{F}$ is known as the Fourier transform. $$ \mathcal{F}f(\xi):=\int f(x)e^{-i \xi x }d x $$
Due to the definition of Fourier transformation and properties of Lebesgue integration, $$ 0 = F (c + is) = \mathcal{F} \mu $$ This means that the Fourier transform of $\mu$ is $0$, thus $\mu = 0$, and summarizing, $$ \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 \implies \mu =0 $$ Therefore $\sigma$ is a differentiable function.
■
G. Cybenko. (1989). Approximation by Superpositions of a Sigmoidal Function p5. ↩︎