What is a Discriminant Function?
Definition
A function $\sigma : \mathbb{R} \to \mathbb{R}$ that satisfies $$ \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 \implies \mu =0 $$ for all $y \in \mathbb{R}^{n}$ and $\theta \in \mathbb{R}$ and some $\mu \in M \left( I_{n} \right)$ is called a Discriminatory Function.
- $I_{n} := [0,1]^{n}$ is the $n$-dimensional unit cube, which is the Cartesian product of $n$ unit closed intervals $[0,1]$.
- $M \left( I_{n} \right)$ is the set of sign finite regular Borel measures defined on $I_{n} := [0,1]^{n}$.
- $y^{T}$ is the transpose matrix of $y$, where $y^{T} x$ is equal to the inner product $\left< x, y \right>$ of $x$ and $y$.
Explanation of the Definition
To understand the concept of the discriminatory function through formulas, it may be good to start with the contrapositive. $$ \mu \ne 0 \implies \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) \ne 0 $$ Simply put, unless the measure $\mu$ is brutally sending all values to $0$, there must be a place within the subspace of $I_{n}$ that is properly measured as not $0$ when doing the Lebesgue integral on whatever is put into $\sigma$. Considering the concept of almost everywhere in measure theory, one can understand that merely having function values not equal to $0$ is insufficient.
In simpler terms, $\sigma$ plays the actual role of a function that can distinguish ‘yes, no’ by satisfying the above properties for all $y \in \mathbb{R}^{n}$ and $\theta \in \mathbb{R}$. Although the word ‘discriminatory’ may have a somewhat negative connotation, it is simply understood as distinguishing in the fields of science and engineering.
Discriminatory functions appear in the Universal Approximation Theorem, which theoretically supports fields like artificial neural networks, and naturally, they are related to the recently popular deep learning. In formulas, $x \in I_{n}$ can be seen as input data, $y \in \mathbb{R}^{n}$ as weights, $\theta \in \mathbb{R}$ as biases. Then, $\sigma$ can be seen as none other than the activation function. Considering that the idea of the activation function itself is an imitation of the threshold, the ‘discrimination’ of whether the signal is transmitted or not precisely corresponds with the concept of the discriminatory function.
Theorem
If $\sigma$ is a continuous discriminatory function, then $$ S := \left\{ G(x) = \sum_{k=1}^{N} \alpha_{k} \sigma \left( y_{k}^{T} x+ \theta_{k} \right) : y_{k} \in \mathbb{R}^{n} \land \alpha_{k} , \theta_{k} \in \mathbb{R} \land N \in \mathbb{N} \right\} $$ is uniformly dense in $C\left( I_{n} \right)$. In other words, for every $f \in C \left( I_{n} \right)$ and $\varepsilon > 0$, there exists a $G \in S$ that satisfies the following. $$ \left\| G - f \right\| < \varepsilon $$
Explanation of the Theorem
To put the above theorem in plain words, it means that any function $f$ we desire can be closely approximated by a finite linear combination of discrimination functions with appropriate weights and biases. In the context of machine learning, $f$ is the function we truly seek, i.e., the function that does the job we want. It could be a function that takes a picture as input and judges whether it’s a dog or a cat, or turns a Korean string into English, or even a more complex and intriguing function.
The problem is, as the definition of the discriminatory function makes clear, it’s hard to know what a discriminatory function actually looks like in practice. Without reading the explanations provided in this post, or even after reading them, without knowledge of measure theory, it might seem like grasping at straws. Fortunately, some well-known functions can be proven to be discriminatory, thereby mathematically guaranteeing that artificial neural networks function.
Proof 1
Corollary 2 of the Hahn-Banach Theorem: Let $X$ be a $\mathbb{C}$-vector space, and both $Y \subset S $ are subspaces of $X$. If $s \in S$ equals $d (s, Y) = \delta > 0$, then there exists a $x^{ \ast } \in X^{ \ast }$ that satisfies the following: $$ \left\| x^{ \ast } \right\| \le 1 $$ $$ \begin{align*} y^{ \ast } (s) =& y^{ \ast } (s) = \delta \qquad , s \in (S \setminus Y) \\ x^{ \ast } (y) =& y^{ \ast } (y) = 0 \qquad , y \in Y \end{align*} $$
It must be shown that $\overline{S} = C \left( I_{n} \right)$. Assuming $\overline{S} \subsetneq C \left( I_{n} \right)$, since the closure $\overline{S}$ is not the whole space, by the Hahn-Banach Theorem, there exists a $L \ne 0$ bounded linear functional $L \in \left( C \left( I_{n} \right) \right)^{ \ast }$ such that: $$ L \left( C \left( I_{n} \right) \right) = L \left( \overline{S} \right) = \left\{ 0 \right\} $$ According to the Riesz Representation Theorem, $L$ can be represented as: $$ L (h) = \int_{I_{n}} h(x) d \mu (x) $$ for some $\mu \in M \left( I_{n} \right)$ and all $h \in C \left( I_{n} \right)$. Especially, among them $h(x) = \sigma \left( y^{T} x + \theta \right)$ belongs to $\overline{S}$ for all $y, \theta$, so by the property of $L$: $$ L (h) = \int_{I_{n}} \sigma \left( y^{T} x + \theta \right) d \mu (x) = 0 $$ Since the premise is that $\sigma$ is discriminatory, $\mu = 0$, then by the Riesz representation of $L$, $L = 0$ must be true, resulting in a contradiction with $0 \ne L$. Therefore, $\overline{S} = C\left( I_{n} \right)$ must be true.
■
G. Cybenko. (1989). Approximation by Superpositions of a Sigmoidal Function p4. ↩︎