What is Reproducibility in Data Science? 📂Data Science

What is Reproducibility in Data Science?

Definition

Let’s assume that in a classification problem distinguishing between Positive $P$ and Negative $N$ , a model is given to assess positives and negatives. Let the numbers of correctly identified positives be called True Positive $TP$ and those of negatives be called True Negative $TN$ , while incorrectly identified positives as False Negative $FN$ and negatives as False Positive $FP$ .

Mathematical Definition

The following metric is referred to as the Recall of a model. $\textrm{Recall} := {{ TP } \over { P }}$

Explanation

Recall is a metric that indicates the ability of a model to identify all positive cases without omission. Another term for recall is Sensitivity. Depending on the context, sensitivity is more commonly used in statistics, whereas recall is more favored in computer sciences and machine learning.

Situations Where Recall is Important

Read this post first: Situations Where Accuracy is Overestimated in Data Science

From the above considerations, we can understand that accuracy is not the sole metric for evaluating models in classification problems. A model claiming “I have high recall” implies “I did not miss any positive cases”. Recall is often considered as a performance measure in situations where Type II errors are more serious:

Information Retrieval: The internet is an ocean of information, where the proportion of documents that genuinely interest me, regardless of the field, is incredibly small compared to the whole. If a document related to my query is considered positive, one key ability for a search engine is to find as many relevant documents as possible among the numerous ones available.
Cancer Diagnosis: Nowadays, it’s possible to treat cancer to some extent thanks to medical advancements, but the outcomes vary significantly between detecting it early and betting on the late stages. Especially, early-stage cancers usually do not cause any pain or disability, thus it’s crucial to be vigilant and sensitively detect any minor signs whether symptoms are present or not. If it’s a false positive, a further detailed diagnosis can be conducted soon, but missing a cancer patient without symptoms means the next diagnosis would only occur when the cancer has progressed to the point of showing symptoms.

Situations Where Recall is Overestimated

Just as accuracy isn’t everything, recall is also not infallible. From the formula for recall, it can be observed that there is no penalty for False Positives $FP$ , implying that in theory, a model could exist that achieves 100% recall on any given data by simply labeling every sample as positive.

For example, if a model A identified 9 out of 10 positive samples in a dataset with 1000 samples, including 10 positives, and had 91 false positives, its recall would be ${{ 9 } \over { 10 }} = 90 \%$ . Despite the presence of 91 false positives, this model, having detected 9 out of 10 positives, can be seen as having a decent performance in terms of recall. Meanwhile, another model B identifies 9 positive samples with only 1 false positive, its recall would be ${{ 9 } \over { 10 }} = 90 \%$ . Both A and B are evaluated to have equivalent performance in terms of recall, which might seem unfair to B. However, strictly speaking, the measurement of their recall isn’t misrepresented as it accurately counted the number of positives detected, and its interpretation isn’t incorrect.

If one feels “B is better than A”, then they have understood how recall can be overestimated. The judgments made by A are unreliable, leading to the absurd situation of having to find the real positives among the ones identified. In a way, the opposite concept is Precision, and considering the shortcomings of both leads to the $F_{1}$ Score.