logo

Activation Functions in Deep Learning 📂Machine Learning

Activation Functions in Deep Learning

Definition

An non-linear function that mimics the threshold of real-life organisms is known as an activation function.

Mathematical Definition

In deep learning, a non-linear scalar function $\sigma : \mathbb{R}^{n} \to \mathbb{R}$ is referred to as an activation function.

  • Of course, there are exceptions like the softmax which don’t fit into this definition.

Explanation

On the other hand, a vector function is called a layer.

If there is an expression or code indicating that the activation function defined by $\sigma : \mathbb{R} \to \mathbb{R}$ takes a vector as input, it means that it is applied element-wise.

$$ \sigma (\begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}) = \begin{bmatrix} \sigma (x_{1}) \\ \sigma (x_{2}) \\ \vdots \\ \sigma (x_{n}) \end{bmatrix} $$

Motivation

20190325\_153312.png

Threshold refers to the minimum intensity of stimulus required for an organism to produce a response. Deep learning simulates this by applying an activation function to the computation result of each node and passing it to the next layer. Without this nonlinear adjustment, there would be no point in having multiple computations across hidden layers in deep learning. There are various types of activation functions, and which one is best is truly a case-by-case matter. There is hardly any theory on how the performance changes depending on which activation function is used; it’s more about trying different ones and sticking with the one that yields good results.

Examples

Step Function

$$u (x) := \begin{cases} 0 & , x<0 \\ 1 & x \ge 0 \end{cases} $$

The step function fits the concept of the threshold the most, but it’s difficult to use in practice because it oversimplifies the computational result. It’s good to think that other activation functions were designed to act like step functions but are not exactly step functions.

Sigmoid Function

1\_XxxiA0jJvPrHEJHD4z893g.png

The most famous among sigmoid functions is perhaps the logistic function $\displaystyle \sigma (x) := {{1} \over { 1 + e^{-x} }}$, which looks like a continuous version of step functions. The range is different, but $\tanh x$ was used for a similar reason. Recently, it has fallen out of favor due to the problem known as gradient vanishing.

ReLu(Rectified Linear Unit) Function

$$\operatorname{ReLU} (x) := \max \left\{ 0 , x \right\}$$

This function was designed to overcome the issues with the sigmoid function. If it’s $x <0$, the function value is entirely killed, and it’s only when it exceeds $0$ that it is passed on as is, which is characteristic of an activation function.