Under Review

On the nineteenth-century origins of significance testing and p-hacking

Significance testing p-value confidence interval p-hacking history of statistics Laplace Gauss Fourier Cournot Edgeworth Pearson Fisher Neyman

Cite as:

Glenn Shafer (2019). On the nineteenth-century origins of significance testing and p-hacking. RESEARCHERS.ONE, https://www.researchers.one/article/2019-09-22.

Abstract:

This paper examines the development of Laplacean practical certainty from 1810, when Laplace proved his central limit theorem, to 1925, when Ronald A. Fisher published his Statistical Methods for Research Workers.

Although Laplace's explanations of the applications of his theorem were accessible to only a few mathematicians, expositions published by Joseph Fourier in 1826 and 1829 made the simplest applications accessible to many statisticians. Fourier suggested an error probability of 1 in 20,000, but statisticians soon used less exigent standards. Abuses, including p-hacking, helped discredit Laplace's theory in France to the extent that it was practically forgotten there by the end of the 19th century, yet it survived elsewhere and served as the starting point for Karl Pearson's biometry.

The probability that a normally distributed random variable is more than three probable errors from its mean is approximately 5%. When Fisher published his Statistical Methods, three probable errors was a common standard for likely significance. Because he wanted to enable research workers to use distributions other than the normal -- the t distributions, for example --- Fisher replaced three probable errors with 5%.

The use of significant after Fisher differs from its use by Pearson before 1920. In Pearson's Biometrika, a significant difference was an observed difference that signified a real difference. Biometrika's authors sometimes said that an observed difference is likely or very likely to be significant, but they never said that it is very significant, and they did not have levels of significance. Significance itself was not a matter of degree.

What might this history teach us about proposals to curtail abuses of statistical testing by changing its current vocabulary (p-value, significance, etc.)? The fact that similar abuses arose before this vocabulary was introduced suggests that more substantive changes are needed.