What is p-hacking?

Terminology

Research results that have been published but fail to replicate or are later contradicted by subsequent evidence are not uncommon¹. Researchers’ publication bias can be divided broadly into two types: the file drawer effect and p-hacking. p-hacking refers to practices that distort research findings — for example, stopping an experiment only once results become “significant,” reporting only a subset of measured outcomes, or changing inclusion/exclusion criteria for outliers — among other tactics².

Explanation

First, the file drawer effect means that the same study is much less likely to be published if it yields statistically non-significant results. This is obvious, but consider a crank who studies the relationship between the price of melons and the number of Somali pirates. If that person found any relationship it would be remarkable, but commonsense suggests no relationship exists and statistical tests would likely be non-significant. Conversely, the claim that two variables are unrelated is not very attractive; even if true, it is less likely to be published and thus receives fewer opportunities for verification. Such uninteresting null results languish in a desk drawer, forgotten, and we end up with a distorted body of knowledge without even realizing it.

p-hacking is somewhat closer to research ethics than the file drawer effect: the data are not fully fabricated, but researchers distort results by adding/removing a few data points or tweaking conditions until the outcome becomes statistically “significant.” At first glance one might object that fabricating data vs. slightly modifying them are the same in effect — both are manipulations — but the reality is not that simple. Even if one ultimately obtains the same numeric result, the sequence of decisions and the researcher’s mindset can make some actions understandable and others not.

…Fundamentally, the p-value in statistical hypothesis testing is a concept fraught with misunderstanding. A lower p-value does not mean a stronger “force” to reject the null hypothesis; rather it is interpreted simply as whether the p-value crosses the significance level $\alpha$ — a binary thresholding. Personally, when I first heard about the concept of p-hacking as an undergraduate I thought it was obviously plausible. I had long felt uneasy about the way statistics were used — the suspicion that this is not the right way to understand data — and it turned out that the weaknesses I worried about were indeed being pointed out in the literature.

Why it happens

Now I will imagine several scenarios in which researchers, consciously or unconsciously, fall prey to the temptation of p-hacking. These hypothetical anecdotes are not meant to defend p-hacking but only to help understand how it can occur. Even though I insist this is not a defense, those who have done research may empathize with the imagined researcher’s position.

5.1%

First, the significance level $\alpha = 0.05$ widely used across statistics is largely a convention adopted without special justification. In practice, different fields or data characteristics might warrant different $\alpha$ values: for example, astronomy, physics, or computer science might use very small $\alpha \approx 0$ values (on a log scale), medicine or biology dealing with living systems might use $\alpha = 0.05$, and psychology or social sciences could reasonably use $\alpha = 0.1$. Nevertheless, many textbooks and statistical resources take $\alpha = 0.05$ as the default, and researchers are accustomed to this convention.

Our imagined scientist X completed 31 grueling experiments and ran the statistical analysis. He had used up all reagents and invested enormous time and effort. The p-value he found was 0.051, which in practical terms supports his hypothesis — the experiment was a success. But a paper is a paper, and he would have preferred the p-value to be ≤ 0.05. He understood statistics well enough to know that the $\alpha = 0.05$ convention is somewhat arbitrary, but readers of the paper (and possibly reviewers) would not think so.

He despises those few scholars who, blinded by money and fame, completely fabricate data, and believes himself to be someone who adheres to research ethics. From here on, X faces several situations. Consider whether his actions are right or wrong.

A. Removal

X focused on the last experiment. Since he had just finished it, the memory was fresh and the result was not yet entered into the lab notebook. On a whim he re-ran the analysis excluding the final observation — using only 30 data points — and obtained the ideal result: the p-value fell below 5%.

He thought that if he had known the 30th experiment’s result beforehand, he would not have bothered to run the 31st. The 31st was essentially unnecessary and in fact degraded the value of the findings. He had only run it because the reagents were available — was there any reason to include it now? Going back in time and not doing the 31st experiment or removing the 31st data point now are equivalent in outcome. Moreover, a sample size of around 30 is commonly accepted as a representative sample, so omitting one observation would not invite criticism for inadequate sample size. He removed the final data point, submitted the paper, and later the paper was highly regarded; he became a professor.

X now advises his students that running many experiments blindly is not always the best approach and that they should think carefully. What do you think?

B. Addition

X felt that with just a little more effort he could improve the result. He called his supervisor to request funding to buy more of the scarce, expensive reagent. A month later the reagent arrived, and in the 32nd experiment the p-value was 5.02% — he thought that with just a bit more, he could reach perfection.

He poured himself into the work, cutting back on sleep to run more experiments. After the 35th experiment, the p-value was 4.98%, just inside the rejection region. Perseverance paid off: by not giving up he obtained a better result. Indeed, persistence is important for a researcher. He still had reagent left, but the sample size was now sufficient. He organized these results and submitted the paper; later it was highly regarded and he became a professor.

X sometimes tells his students this anecdote to emphasize the value of effort. What do you think?

C. Outlier

X inspected his data once more. As a skilled practitioner he rarely made mistakes, but being an expert also meant he could self-reflect. He noticed that the measurement from the 14th experiment looked like an outlier, and when he checked his lab notebook he found a note that a minor trouble had occurred during that trial.

He considered whether removing that outlier would violate research ethics; honestly, since the variable was not properly controlled that day, discarding the experiment was the correct action. The same would have applied even if the original analysis had already produced a p-value < 0.05. It was the right thing to do as a scientist. Of course, if that had always been the case he wouldn’t have needed to revisit the data, but in any event he removed the outlier, reanalyzed, and obtained results below the 5% significance level. He excluded the outlier, submitted the paper, and later the paper was highly regarded; he became a professor.

X tells his students to be critical of data and not to trust numbers blindly whenever mistakes occur. What do you think?

Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124 ↩︎
Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS biology, 13(3), e1002106. https://doi.org/10.1371/journal.pbio.1002106 ↩︎