Activation Functions in Deep Learning
Definition
An non-linear function that mimics the threshold of real-life organisms is known as an activation function.
Mathematical Definition
In deep learning, a non-linear scalar function $\sigma : \mathbb{R}^{n} \to \mathbb{R}$ is referred to as an activation function.
- Of course, there are exceptions like the softmax which don’t fit into this definition.
Explanation
On the other hand, a vector function is called a layer.
If there is an expression or code indicating that the activation function defined by $\sigma : \mathbb{R} \to \mathbb{R}$ takes a vector as input, it means that it is applied element-wise.
$$ \sigma (\begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}) = \begin{bmatrix} \sigma (x_{1}) \\ \sigma (x_{2}) \\ \vdots \\ \sigma (x_{n}) \end{bmatrix} $$
Motivation
Threshold refers to the minimum intensity of stimulus required for an organism to produce a response. Deep learning simulates this by applying an activation function to the computation result of each node and passing it to the next layer. Without this nonlinear adjustment, there would be no point in having multiple computations across hidden layers in deep learning. There are various types of activation functions, and which one is best is truly a case-by-case matter. There is hardly any theory on how the performance changes depending on which activation function is used; it’s more about trying different ones and sticking with the one that yields good results.
Examples
Step Function
$$u (x) := \begin{cases} 0 & , x<0 \\ 1 & x \ge 0 \end{cases} $$
The step function fits the concept of the threshold the most, but it’s difficult to use in practice because it oversimplifies the computational result. It’s good to think that other activation functions were designed to act like step functions but are not exactly step functions.
Sigmoid Function
The most famous among sigmoid functions is perhaps the logistic function $\displaystyle \sigma (x) := {{1} \over { 1 + e^{-x} }}$, which looks like a continuous version of step functions. The range is different, but $\tanh x$ was used for a similar reason. Recently, it has fallen out of favor due to the problem known as gradient vanishing.
ReLu(Rectified Linear Unit) Function
$$\operatorname{ReLU} (x) := \max \left\{ 0 , x \right\}$$
This function was designed to overcome the issues with the sigmoid function. If it’s $x <0$, the function value is entirely killed, and it’s only when it exceeds $0$ that it is passed on as is, which is characteristic of an activation function.