logo

Loss Functions in Machine Learning 📂Machine Learning

Loss Functions in Machine Learning

Definition

When an estimate for the data Y=[y1yn]Y = \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix} is given as Y^=[y1^yn^]\widehat{Y} = \begin{bmatrix} \widehat{ y_{1} } \\ \vdots \\ \widehat{y_{n}} \end{bmatrix}, the scalar function L:Rn[0,)L : \mathbb{R}^{n} \to [ 0 , \infty ) that represents the discrepancy between the data and its estimate is called a loss function.

Alternate Names

The loss function is used as an indicator to evaluate how much the estimated value of data obtained through learning differs from the actual data. The larger this value, the more incorrect it is, and if this value is 00, it means there is ’no loss,’ which implies a perfect estimate. This is not much different from the metric as discussed in mathematics.

Since the term was originally used in economics, LL is sometimes referred to as the Cost Function.

Types

The following two are representative types of loss functions, and it’s sufficient to know only as much as needed to use them appropriately.

MSE(Mean of Squared Error)

L(Y^):=1ni=1n(yiyi^)2 L \left( \widehat{Y} \right) := {{1} \over {n}} \sum_{i=1}^{n} ( y_{i} - \widehat{ y_{i} } )^2 MSE is a time-honored loss function, which can be meaningfully used when yiRy_{i} \in \mathbb{R}.

Cross Entropy

L(Y^):=1ni=1n<yi,log(σ(yi^))> L \left( \widehat{Y} \right) := - {{1} \over {n}} \sum_{i=1}^{n} \left< y_{i} , \log \left( \sigma ( \hat{y_{i}} ) \right) \right>

Cross Entropy becomes effective when doing what is known as one-hot encoding, and computes using the probabilities that Y^\widehat{Y} estimated for each category, when YY is categorical data. It is primarily used in classification problems.

One-hot encoding simply refers to the mapping to the standard basis. When there are mm classes, the standard basis in Rn\mathbb{R}^{n} is denoted by β={ei}i=1m\beta = \left\{ e_{i} \right\}_{i=1}^{m}, and each of yiy_{i} and yi^\hat{y_{i}} is

yiRm,y^iβ y_{i} \in \mathbb{R}^{m},\qquad \hat{y}_{i} \in \beta

represented as a vector like above. For example, when there are 3 classes in YY, if yiy_{i} belongs to the 1st class then yi=[1,0,0]ty_{i} = [1,0,0]^{t}, if it belongs to the 33 class then yi=[0,0,1]ty_{i} = [0,0,1]^{t} is how it is represented.

σ\sigma is the softmax function, a function that bounds the given vector’s values within [0,1][0,1] and satisfies the conditions of a probability distribution. <,>\left< \cdot , \cdot \right> is the dot product. Since σ(yi^)[0,1]\sigma ( \hat{y_{i}} ) \in [0,1], it leads to log(σ(yi^))<0\log \left( \sigma ( \hat{y_{i}} ) \right) < 0 and yiy_{i} is therefore either 00 or 11 thus L(Y^)=1ni=1n<yi,log(σ(yi^))>0 L \left( \widehat{Y} \right) = - {{1} \over {n}} \sum_{i=1}^{n} \left< y_{i} , \log \left( \sigma ( \hat{y_{i}} ) \right) \right> \ge 0 can be easily checked.

The individual components of the estimate yi^=(yi1^,,yij^,,yim^)\hat{ y_{i} } = ( \hat{ y_{i1} } , \cdots , \hat{y_{ij}} , \cdots , \hat{y_{im}} ) of yi=(yi1,,yij,,yim)y_{i} = ( y_{i1} , \cdots , y_{ij} , \cdots , y_{im} ) have higher values as the probability increases, and lower values as it decreases. When these are input into the softmax function, the more likely probabilities are adjusted closer to 11 and the less likely ones closer to 00. If the actual component was 11 but was calculated with a lower probability, then 1log(c)- 1 \cdot \log (c) being c1c\ll 1 will result in a significantly large value. Conversely, even if the actual component was 00 and the probability was calculated to be high, 0log(c)- 0 \cdot \log(c) would not be very significant. Thus, it can be easily surmised that the cross entropy will skyrocket the more mistakes are made.

See Also