logo

What is an F1 Score in Data Science? 📂Data Science

What is an F1 Score in Data Science?

Definition

Let’s assume a model is given for distinguishing Positive PP and Negative NN in a classification problem. The number of positives correctly identified as positive is referred to as True Positive TPTP, the number of negatives correctly identified as negative is referred to as True Negative TNTN, the number of positives incorrectly identified as negative is referred to as False Negative FNFN, and the number of negatives incorrectly identified as positive is referred to as False Positive FPFP.

Mathematical Definition

The following figure is referred to as the model’s Precision. Precision=TPTP+FP \textrm{Precision} = {{ TP } \over { TP + FP }} The following figure is referred to as the model’s Recall. Recall=TPP \textrm{Recall} = {{ TP } \over { P }} The harmonic mean of precision and recall is referred to as the model’s F1 Score. F1:=2Precision1+Recall1 F_{1} := {{ 2 } \over { \textrm{Precision}^{-1} + \textrm{Recall}^{-1} }}

Description

F1F_{1} score is a metric designed to compensate for the defects in precision and recall, imposing a more significant penalty when one is underestimated, thus assessing whether the model is comprehensively excellent. Particularly in binary classification problems where the labelling of data, i.e., the imbalance between positives and negatives, is pronounced, it is the most commonly used metric to evaluate models.

Intuitive Meaning

Essentially, the F1F_{1} score does not offer an intuitive interpretation by itself. It might seem random that the harmonic mean comes up, but descriptions found everywhere always mention precision and recall first before summarizing with the F1F_{1} score.

If precision represents the ‘power to accurately identify positives’ and recall represents the ‘power to find all positives’, then the F1F_{1} score can be said to represent ‘the intrinsic power not to lose balance between precision and recall’. Although not always mathematically guaranteed, precision and recall are empirically in a trade-off relationship, and depending on the choices made, it’s possible to overly favour one metric at the expense of the other. Especially if accuracy is problematic, one should be aware of such imbalances when using precision or recall.

Mathematical Interpretation

There are many explanations about the F1F_{1} score, including those that explain it through the geometrical properties of the harmonic mean itself. However, personally, these explanations provide clarity on ‘why it’s defined that way’ only if you already know what F1F_{1} is, rather than easily informing someone new ‘how the F1F_{1} score is interpreted’. Let’s start with the very basic ‘figures’.

Since precision is ‘the actual positives among those deemed positive’ it ranges between 00 to 11, and similarly recall is ‘those deemed positive among the actual positives’, which also ranges between 00 to 11. In other words, precision and recall satisfy the following inequalities. 11Precision11Recall2Precision1+Recall1    F1=2Precision1+Recall11 \begin{align*} 1 \le & {{ 1 } \over { \textrm{Precision} }} \\ 1 \le & {{ 1 } \over { \textrm{Recall} }} \\ 2 \le & \textrm{Precision}^{-1} + \textrm{Recall}^{-1} \end{align*} \\ \implies F_{1} = {{ 2 } \over { \textrm{Precision}^{-1} + \textrm{Recall}^{-1} }} \le 1 Thus, F1F_{1} also takes its largest value at 11, and the case where the equation is satisfied is only when both precision and recall are at 11. If either of the two metrics approaches 00, the denominator diverges, so the infimum of F1F_{1} is 00, and to put it more clearly, 0<F11 0 < F_{1} \le 1 it’s a figure normalized between 00 to 11, where closer to 1=100%1 = 100\% means it’s a better figure. Of course, F1F_{1} approaching 00 refers to when both metrics are given freely and extremely unbalanced, but in reality, according to the following theorem, it approaches closer to the smaller of the two.

Upper and Lower Bounds of the Harmonic Mean: The harmonic mean of a,b>0a,b > 0 lies between aa and bb. max\max and min\min represent the maximum and minimum values, respectively. min(a,b)H(a,b)max(a,b) \min (a,b) \le H (a,b) \le \max (a,b)

Meanwhile, as mentioned, precision and recall are usually in a trade-off relationship. If we simply assume for some c(0,2)c \in (0,2), Precision+Recall=c \textrm{Precision} + \textrm{Recall} = c meaning when one side is high, the other is proportionately low, let’s examine what happens to F1F_{1}. Having explained with specific figures, let’s directly calculate the most balanced model A(0.8,0.8)A(0.8, 0.8), a reasonable model B(0.7,0.9)B(0.7, 0.9) and the most unbalanced model C(0.6,1.0)C(0.6, 1.0) when c=1.6c = 1.6. Although all these are calculated to have the same performance of 0.80.8 using the arithmetic mean of the two metrics, calculating each F1F_{1} yields F1(A)=20.81+0.81=0.8F1(B)=20.71+0.91=0.7875F1(C)=20.61+1.01=0.75 \begin{align*} F_{1}(A) =& {{ 2 } \over { 0.8^{-1} + 0.8^{-1} }} = 0.8 \\ F_{1}(B) =& {{ 2 } \over { 0.7^{-1} + 0.9^{-1} }} = 0.7875 \\ F_{1}(C) =& {{ 2 } \over { 0.6^{-1} + 1.0^{-1} }} = 0.75 \end{align*} showing that the most unbalanced model is rated the lowest. This calculation is because of the formula of the F1F_{1} score F1=2Precision1+Recall1=2PrecisionRecallPrecision+Recall \begin{align*} F_{1} =& {{ 2 } \over { \textrm{Precision}^{-1} + \textrm{Recall}^{-1} }} \\ =& 2 {{ \textrm{Precision} \cdot \textrm{Recall} } \over { \textrm{Precision} + \textrm{Recall} }} \end{align*} where the numerator and the denominator can be found. Imagining a rectangle with sides of lengths Precision\textrm{Precision} and Recall\textrm{Recall}, the double of the denominator represents the perimeter, and the numerator represents the area. As known, the area of the rectangle with a fixed perimeter is largest when the width and height are exactly the same, i.e., when Precision=Recall\textrm{Precision} = \textrm{Recall}, and performance drops as it deviates from this condition.

Theorem

To summarize, the F1F_{1} score is a useful metric when both precision and recall are important. It evaluates unfavorably when one is high but the other is low, and within the given trade-off, it’s rated high only when balanced between the two. Improving the model’s performance means improving the ‘given trade-off’ itself, making the F1F_{1} score a relatively manipulation-resistant and reliable metric.

The downside is it’s challenging to explain intuitively. For non-experts, especially the public, precision and recall are already hard to understand, let alone how they differ and explaining the harmonic mean of them, the F1F_{1} score, is nearly impossible.

Generalization

Fβ:=1+β2β2Precision1+Recall1 F_{\beta} := {{ 1 + \beta^{2} } \over { \beta^{2} \textrm{Precision}^{-1} + \textrm{Recall}^{-1} }} F1F_{1} is a special case where weights are applied differently in FβF_{\beta} to β=1\beta = 1. If β=0\beta = 0, it exactly matches recall, and as β\beta increases, it becomes more influenced by precision. In this sense, the F1F_{1} score can be considered as a metric that treats precision and recall with equal importance. FβF_{\beta} becomes meaningful in scenarios similar to those described under ‘mathematical interpretation’, not when precision and recall Precision+Recall=c \textrm{Precision} + \textrm{Recall} = c are in a trade-off relationship but when Precision+β2Recall=c \textrm{Precision} + \beta^{2} \textrm{Recall} = c assumed to be such. Even with the same data and model, which error is more severe depends greatly on the problem.

See Also