logo

What is a Sigmoid Function? 📂Machine Learning

What is a Sigmoid Function?

Definition

A function σ:RR\sigma : \mathbb{R} \to \mathbb{R} is called a Sigmoidal Function if it satisfies the following. σ(t){1as t+0as t \sigma (t) \to \begin{cases} 1 & \text{as } t \to + \infty \\ 0 & \text{as } t \to - \infty \end{cases}

Explanation of the Definition

In the definition of a sigmoidal function, whether it’s 00 or 11 is not really important, but it’s important that it converges to a constant as it approaches infinity, whether positive or negative. What value it takes at non-infinite points is not that important as well. In this sense, not only the logistic function but also sigmoid functions would be considered sigmoidal functions.

The most notable use of sigmoidal functions is in applications in deep learning. Some readers might underestimate sigmoidal functions because activation functions such as ReLU and Mish have shown superior performance in deep learning since the late 2010s. However, if asked why sigmoidal functions are mathematically and computer scientifically significant, it’s because even in its simple form where it converges merely to 00 or 11, it theoretically guarantees the possibility of the technique called artificial neural networks. The exploration of sigmoidal functions leads to a magnificent result called the universal approximation theorem, which is undeniably a very important outcome.

Theorem

Distinction of Sigmoidal Functions: Bounded measurable sigmoidal functions are differentiable functions.

Explanation of the Theorem

The reason why it is important that commonly used sigmoidal functions are differentiable is that the concept of a differentiable function is slightly too difficult compared to its applications. However, according to the theorem above, even without knowing about differentiable functions, we can obtain concrete examples that we can actually implement through programming.

Proof 1

Definition of Differentiable Functions A function σ:RR\sigma : \mathbb{R} \to \mathbb{R} is said to be differentiable if for all yRny \in \mathbb{R}^{n} and θR\theta \in \mathbb{R} and some μM(In)\mu \in M \left( I_{n} \right), the following is satisfied. Inσ(yTx+θ)dμ(x)=0    μ=0 \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 \implies \mu =0


Claim

According to the definition of differentiable functions, when sigmoidal function σ\sigma is Inσ(yTx+θ)dμ(x)=0 \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 it suffices to show that μ=0\mu =0.


Part 1. Definition of σλ\sigma_{{\lambda}} and γ\gamma

Let’s define the function σλ:InR\sigma_{\lambda} : I_{n} \to \mathbb{R} for any xInx \in I_{n}, yRny \in \mathbb{R}^{n}, and θ,φR\theta , \varphi \in \mathbb{R} as follows. σλ(x):=σ(λ(yTx+θ)+φ) \sigma_{\lambda} (x) := \sigma \left( \lambda \left( y^{T} x + \theta \right) + \varphi \right) Then, as λ\lambda changes, σλ{1for yTx+θ>0as λ0for yTx+θ<0as λ=σ(φ)for yTx+θ=0for all λR \sigma_{\lambda} \begin{cases} \to 1 & \text{for } y^{T} x + \theta > 0 & \text{as } \lambda \to \infty \\ \to 0 & \text{for } y^{T} x + \theta < 0 & \text{as } \lambda \to \infty \\ = \sigma \left( \varphi \right) & \text{for } y^{T} x + \theta = 0 & \text{for all } \lambda \in \mathbb{R} \end{cases} thus, if we define a new function γ:InR\gamma : I_{n} \to \mathbb{R} as γ(x):={1for yTx+θ>00for yTx+θ<0σ(φ)for yTx+θ=0 \gamma (x) := \begin{cases} 1 & \text{for } y^{T} x + \theta > 0 \\ 0 & \text{for } y^{T} x + \theta < 0 \\ \sigma \left( \varphi \right) & \text{for } y^{T} x + \theta = 0 \end{cases} then when λ\lambda \to \infty, σλ\sigma_{\lambda} converges pointwise to γ\gamma.


Part 2. When μ\mu is Not a Sign Measure

Now, let’s split the entire space InI_{n} into three parts based on the hyperplane Hy,θ0H^{0}_{y, \theta} as follows. Hy,θ+:={xRn:yTx+θ>0}Hy,θ0:={xRn:yTx+θ=0}Hy,θ:={xRn:yTx+θ<0} H^{+}_{y, \theta} := \left\{ x \in \mathbb{R}^{n} : y^{T} x + \theta > 0 \right\} \\ H^{0}_{y, \theta} := \left\{ x \in \mathbb{R}^{n} : y^{T} x + \theta = 0 \right\} \\ H^{-}_{y, \theta} := \left\{ x \in \mathbb{R}^{n} : y^{T} x + \theta < 0 \right\} If yRn\forall y \in \mathbb{R}^{n} and θR\forall \theta \in \mathbb{R}, then in the Lebesgue integration, σ(yTx+θ)\sigma \left( y^{T} x + \theta \right) and σ(λ(yTx+θ)+φ)\sigma \left( \lambda \left( y^{T} x + \theta \right) + \varphi \right) are essentially the same, and according to the dominated convergence theorem, we can swap the order of limλ\displaystyle \lim_{\lambda \to \infty} \cdot and Indμ\displaystyle \int_{I_{n}} \cdot d \mu, therefore for all φR\varphi \in \mathbb{R}, 0=Inσ(yTx+θ)dμ(x)=limλInσ(yTx+θ)dμ(x)=limλInσ(λ(yTx+θ)+φ)dμ(x)=Inlimλσλ(x)dμ(x)=Inγ(x)dμ(x)=Hy,θ+γ(x)dμ(x)+Hy,θ0γ(x)dμ(x)+Hy,θγ(x)dμ(x)=Hy,θ+1dμ(x)+Hy,θ0σ(φ)dμ(x)+Hy,θ0dμ(x)=μ(Hy,θ+)+σ(φ)μ(Hy,θ0) \begin{align*} 0 =& \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) \\ =& \lim_{\lambda \to \infty} \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) \\ =& \lim_{\lambda \to \infty} \int_{I_{n}} \sigma \left( \lambda \left( y^{T} x + \theta \right) + \varphi \right) d \mu (x) \\ =& \int_{I_{n}} \lim_{\lambda \to \infty} \sigma_{\lambda} \left( x \right) d \mu (x) \\ =& \int_{I_{n}} \gamma (x) d \mu (x) \\ =& \int_{H^{+}_{y, \theta}} \gamma (x) d \mu (x) + \int_{H^{0}_{y, \theta}} \gamma (x) d \mu (x) + \int_{H^{-}_{y, \theta}} \gamma (x) d \mu (x) \\ =& \int_{H^{+}_{y, \theta}} 1 d \mu (x) + \int_{H^{0}_{y, \theta}} \sigma (\varphi) d \mu (x) + \int_{H^{-}_{y, \theta}} 0 d \mu (x) \\ =& \mu \left( H^{+}_{y, \theta} \right) + \sigma (\varphi) \mu \left( H^{0}_{y, \theta} \right) \end{align*} If μ\mu is not a sign measure, the hyperplane Hy,θ0H^{0}_{y, \theta} must naturally be μ(Hy,θ0)=0\mu \left(H^{0}_{y, \theta} \right) = 0 with respect to the measure μ\mu defined in InI_{n}. Since this holds for all y,θy,\theta, it is always μ(Hy,θ+)=0\mu \left( H^{+}_{y, \theta} \right) = 0, and the sigmoidal function σ\sigma is a differentiable function. However, if μ\mu is a sign measure, μ(Hy,θ0)=0\mu \left(H^{0}_{y, \theta}\right) = 0 would not be guaranteed, so a separate proof is needed. If one is only interested in the application to artificial neural networks while somewhat sacrificing mathematical generality, the proof can be stopped here without big issues.


Part 3. When μ\mu is a Sign Measure

Fix yy and define the linear functional F(L(R))F \in \left( L^{\infty} (\mathbb{R}) \right)^{ \ast } for bounded measurable functions h:RRh : \mathbb{R} \to \mathbb{R} as follows. F(h):=Inh(yTx)dμ(x) F (h) := \int_{I_{n}} h \left( y^{T} x \right) d \mu (x) FF is defined through Lebesgue integration, so it’s linear, and since μM(In)\mu \in M \left( I_{n} \right) is a finite sign measure, FF is also bounded and thus belongs to L(R)L^{\infty} (\mathbb{R}). Now let hh be the indicator function for [θ,)[ \theta , \infty ) that satisfies the following for some μM(In)\mu \in M \left( I_{n} \right). F(h)=Inh(yTx)dμ(x)=μ(Hy,θ+)+μ(Hy,θ0)=0 \begin{align*} F(h) =& \int_{I_{n}} h \left( y^{T} x \right) d \mu (x) \\ =& \mu \left( H^{+}_{y, -\theta} \right) + \mu \left( H^{0}_{y, -\theta} \right) \\ =& 0 \end{align*} Similarly, if hh is the indicator function for (θ,)(\theta , \infty), then it’s F(h)=0F(h) = 0, and according to the linearity of FF, it is also F(h)=0F(h)=0 for all intervals of indicator functions hh. Thus, it is also F(h)=0F(h) = 0 for all simple functions hh, and since the set of simple functions is dense in L(R)L^{\infty } ( \mathbb{R} ), it follows that F=0F = 0. In other words, for all bounded measurable functions hL(R)h \in L^{\infty} \left( \mathbb{R} \right), it is F(h)=0F(h) = 0. Since we assume σ\sigma is also a bounded measurable sigmoidal function, it is σL(R)\sigma \in L^{\infty} ( \mathbb{R} ), and F(σ)=Inσ(yTx)dμ(x)=Inσ(yTx+θ)dμ(x)=0 F (\sigma) = \int_{I_{n}} \sigma \left( y^{T} x \right) d \mu (x) = \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 Now, we only need to show that μ=0\mu = 0 under these conditions.


Part 4.

L(R)L^{\infty}\left( \mathbb{R} \right) is a vector space of bounded measurable functions, so for mRnm \in \mathbb{R}^{n}, defining the bounded measurable function s,c:InRs,c : I_{n} \to \mathbb{R} as s(x):=sin(mTx)c(x):=cos(mTx) s(x) := \sin \left( m^{T} x \right) \\ c(x) := \cos \left( m^{T} x \right) since the vector space is closed under addition and scalar multiplication, c+isc + is is also a bounded measurable function. But since for all bounded measurable functions, F=0F=0, according to Euler’s formula, 0=F(c+is)=In[cos(mTx)+isin(mTx)]dμ(x)=Inexp(imTx)dμ(x) \begin{align*} 0 =& F (c + is) \\ =& \int_{I_{n}} \left[ \cos \left( m^{T} x \right) + i \sin \left( m^{T} x \right) \right] d \mu (x) \\ =& \int_{I_{n}} \exp \left( i m^{T} x \right) d \mu (x) \end{align*}

Definition of the Fourier transform: The following transformation F\mathcal{F} is known as the Fourier transform. Ff(ξ):=f(x)eiξxdx \mathcal{F}f(\xi):=\int f(x)e^{-i \xi x }d x

Due to the definition of Fourier transformation and properties of Lebesgue integration, 0=F(c+is)=Fμ 0 = F (c + is) = \mathcal{F} \mu This means that the Fourier transform of μ\mu is 00, thus μ=0\mu = 0, and summarizing, Inσ(yTx+θ)dμ(x)=0    μ=0 \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 \implies \mu =0 Therefore σ\sigma is a differentiable function.


  1. G. Cybenko. (1989). Approximation by Superpositions of a Sigmoidal Function p5. ↩︎