What is an F1 Score in Data Science?
Definition
Let’s assume a model is given for distinguishing Positive $P$ and Negative $N$ in a classification problem. The number of positives correctly identified as positive is referred to as True Positive $TP$, the number of negatives correctly identified as negative is referred to as True Negative $TN$, the number of positives incorrectly identified as negative is referred to as False Negative $FN$, and the number of negatives incorrectly identified as positive is referred to as False Positive $FP$.
Mathematical Definition
The following figure is referred to as the model’s Precision. $$ \textrm{Precision} = {{ TP } \over { TP + FP }} $$ The following figure is referred to as the model’s Recall. $$ \textrm{Recall} = {{ TP } \over { P }} $$ The harmonic mean of precision and recall is referred to as the model’s F1 Score. $$ F_{1} := {{ 2 } \over { \textrm{Precision}^{-1} + \textrm{Recall}^{-1} }} $$
Description
$F_{1}$ score is a metric designed to compensate for the defects in precision and recall, imposing a more significant penalty when one is underestimated, thus assessing whether the model is comprehensively excellent. Particularly in binary classification problems where the labelling of data, i.e., the imbalance between positives and negatives, is pronounced, it is the most commonly used metric to evaluate models.
Intuitive Meaning
Essentially, the $F_{1}$ score does not offer an intuitive interpretation by itself. It might seem random that the harmonic mean comes up, but descriptions found everywhere always mention precision and recall first before summarizing with the $F_{1}$ score.
If precision represents the ‘power to accurately identify positives’ and recall represents the ‘power to find all positives’, then the $F_{1}$ score can be said to represent ‘the intrinsic power not to lose balance between precision and recall’. Although not always mathematically guaranteed, precision and recall are empirically in a trade-off relationship, and depending on the choices made, it’s possible to overly favour one metric at the expense of the other. Especially if accuracy is problematic, one should be aware of such imbalances when using precision or recall.
Mathematical Interpretation
There are many explanations about the $F_{1}$ score, including those that explain it through the geometrical properties of the harmonic mean itself. However, personally, these explanations provide clarity on ‘why it’s defined that way’ only if you already know what $F_{1}$ is, rather than easily informing someone new ‘how the $F_{1}$ score is interpreted’. Let’s start with the very basic ‘figures’.
Since precision is ‘the actual positives among those deemed positive’ it ranges between $0$ to $1$, and similarly recall is ‘those deemed positive among the actual positives’, which also ranges between $0$ to $1$. In other words, precision and recall satisfy the following inequalities. $$ \begin{align*} 1 \le & {{ 1 } \over { \textrm{Precision} }} \\ 1 \le & {{ 1 } \over { \textrm{Recall} }} \\ 2 \le & \textrm{Precision}^{-1} + \textrm{Recall}^{-1} \end{align*} \\ \implies F_{1} = {{ 2 } \over { \textrm{Precision}^{-1} + \textrm{Recall}^{-1} }} \le 1 $$ Thus, $F_{1}$ also takes its largest value at $1$, and the case where the equation is satisfied is only when both precision and recall are at $1$. If either of the two metrics approaches $0$, the denominator diverges, so the infimum of $F_{1}$ is $0$, and to put it more clearly, $$ 0 < F_{1} \le 1 $$ it’s a figure normalized between $0$ to $1$, where closer to $1 = 100\%$ means it’s a better figure. Of course, $F_{1}$ approaching $0$ refers to when both metrics are given freely and extremely unbalanced, but in reality, according to the following theorem, it approaches closer to the smaller of the two.
Upper and Lower Bounds of the Harmonic Mean: The harmonic mean of $a,b > 0$ lies between $a$ and $b$. $\max$ and $\min$ represent the maximum and minimum values, respectively. $$ \min (a,b) \le H (a,b) \le \max (a,b) $$
Meanwhile, as mentioned, precision and recall are usually in a trade-off relationship. If we simply assume for some $c \in (0,2)$, $$ \textrm{Precision} + \textrm{Recall} = c $$ meaning when one side is high, the other is proportionately low, let’s examine what happens to $F_{1}$. Having explained with specific figures, let’s directly calculate the most balanced model $A(0.8, 0.8)$, a reasonable model $B(0.7, 0.9)$ and the most unbalanced model $C(0.6, 1.0)$ when $c = 1.6$. Although all these are calculated to have the same performance of $0.8$ using the arithmetic mean of the two metrics, calculating each $F_{1}$ yields $$ \begin{align*} F_{1}(A) =& {{ 2 } \over { 0.8^{-1} + 0.8^{-1} }} = 0.8 \\ F_{1}(B) =& {{ 2 } \over { 0.7^{-1} + 0.9^{-1} }} = 0.7875 \\ F_{1}(C) =& {{ 2 } \over { 0.6^{-1} + 1.0^{-1} }} = 0.75 \end{align*} $$ showing that the most unbalanced model is rated the lowest. This calculation is because of the formula of the $F_{1}$ score $$ \begin{align*} F_{1} =& {{ 2 } \over { \textrm{Precision}^{-1} + \textrm{Recall}^{-1} }} \\ =& 2 {{ \textrm{Precision} \cdot \textrm{Recall} } \over { \textrm{Precision} + \textrm{Recall} }} \end{align*} $$ where the numerator and the denominator can be found. Imagining a rectangle with sides of lengths $\textrm{Precision}$ and $\textrm{Recall}$, the double of the denominator represents the perimeter, and the numerator represents the area. As known, the area of the rectangle with a fixed perimeter is largest when the width and height are exactly the same, i.e., when $\textrm{Precision} = \textrm{Recall}$, and performance drops as it deviates from this condition.
Theorem
To summarize, the $F_{1}$ score is a useful metric when both precision and recall are important. It evaluates unfavorably when one is high but the other is low, and within the given trade-off, it’s rated high only when balanced between the two. Improving the model’s performance means improving the ‘given trade-off’ itself, making the $F_{1}$ score a relatively manipulation-resistant and reliable metric.
The downside is it’s challenging to explain intuitively. For non-experts, especially the public, precision and recall are already hard to understand, let alone how they differ and explaining the harmonic mean of them, the $F_{1}$ score, is nearly impossible.
Generalization
$$ F_{\beta} := {{ 1 + \beta^{2} } \over { \beta^{2} \textrm{Precision}^{-1} + \textrm{Recall}^{-1} }} $$ $F_{1}$ is a special case where weights are applied differently in $F_{\beta}$ to $\beta = 1$. If $\beta = 0$, it exactly matches recall, and as $\beta$ increases, it becomes more influenced by precision. In this sense, the $F_{1}$ score can be considered as a metric that treats precision and recall with equal importance. $F_{\beta}$ becomes meaningful in scenarios similar to those described under ‘mathematical interpretation’, not when precision and recall $$ \textrm{Precision} + \textrm{Recall} = c $$ are in a trade-off relationship but when $$ \textrm{Precision} + \beta^{2} \textrm{Recall} = c $$ assumed to be such. Even with the same data and model, which error is more severe depends greatly on the problem.