Loss Functions in Machine Learning 📂Machine Learning

Loss Functions in Machine Learning

Definition

When an estimate for the data $Y = \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix}$ is given as $\widehat{Y} = \begin{bmatrix} \widehat{ y_{1} } \\ \vdots \\ \widehat{y_{n}} \end{bmatrix}$, the scalar function $L : \mathbb{R}^{n} \to [ 0 , \infty )$ that represents the discrepancy between the data and its estimate is called a loss function.

Alternate Names

The loss function is used as an indicator to evaluate how much the estimated value of data obtained through learning differs from the actual data. The larger this value, the more incorrect it is, and if this value is $0$, it means there is ’no loss,’ which implies a perfect estimate. This is not much different from the metric as discussed in mathematics.

Since the term was originally used in economics, $L$ is sometimes referred to as the Cost Function.

Types

The following two are representative types of loss functions, and it’s sufficient to know only as much as needed to use them appropriately.

MSE(Mean of Squared Error)

$$ L \left( \widehat{Y} \right) := {{1} \over {n}} \sum_{i=1}^{n} ( y_{i} - \widehat{ y_{i} } )^2 $$ MSE is a time-honored loss function, which can be meaningfully used when $y_{i} \in \mathbb{R}$.

Cross Entropy

$$ L \left( \widehat{Y} \right) := - {{1} \over {n}} \sum_{i=1}^{n} \left< y_{i} , \log \left( \sigma ( \hat{y_{i}} ) \right) \right> $$

Cross Entropy becomes effective when doing what is known as one-hot encoding, and computes using the probabilities that $\widehat{Y}$ estimated for each category, when $Y$ is categorical data. It is primarily used in classification problems.

One-hot encoding simply refers to the mapping to the standard basis. When there are $m$ classes, the standard basis in $\mathbb{R}^{n}$ is denoted by $\beta = \left\{ e_{i} \right\}_{i=1}^{m}$, and each of $y_{i}$ and $\hat{y_{i}}$ is

$$ y_{i} \in \mathbb{R}^{m},\qquad \hat{y}_{i} \in \beta $$

represented as a vector like above. For example, when there are 3 classes in $Y$, if $y_{i}$ belongs to the 1st class then $y_{i} = [1,0,0]^{t}$, if it belongs to the $3$ class then $y_{i} = [0,0,1]^{t}$ is how it is represented.

$\sigma$ is the softmax function, a function that bounds the given vector’s values within $[0,1]$ and satisfies the conditions of a probability distribution. $\left< \cdot , \cdot \right>$ is the dot product. Since $\sigma ( \hat{y_{i}} ) \in [0,1]$, it leads to $\log \left( \sigma ( \hat{y_{i}} ) \right) < 0$ and $y_{i}$ is therefore either $0$ or $1$ thus $$ L \left( \widehat{Y} \right) = - {{1} \over {n}} \sum_{i=1}^{n} \left< y_{i} , \log \left( \sigma ( \hat{y_{i}} ) \right) \right> \ge 0 $$ can be easily checked.

The individual components of the estimate $\hat{ y_{i} } = ( \hat{ y_{i1} } , \cdots , \hat{y_{ij}} , \cdots , \hat{y_{im}} )$ of $y_{i} = ( y_{i1} , \cdots , y_{ij} , \cdots , y_{im} )$ have higher values as the probability increases, and lower values as it decreases. When these are input into the softmax function, the more likely probabilities are adjusted closer to $1$ and the less likely ones closer to $0$. If the actual component was $1$ but was calculated with a lower probability, then $- 1 \cdot \log (c)$ being $c\ll 1$ will result in a significantly large value. Conversely, even if the actual component was $0$ and the probability was calculated to be high, $- 0 \cdot \log(c)$ would not be very significant. Thus, it can be easily surmised that the cross entropy will skyrocket the more mistakes are made.