When discussing statistical significance in class I always preface my discussion by highlighting the arbitrary nature of accepted p-values. “If you had cancer,” I ask my students, “and a doctor said that there was a 94% chance that a particular treatment would cure you, would you take it?” They would, they assert. But a 94% chance isn’t good enough for social science research. Neither, suggests Valen Johnson, a statistician at Texas A&M, is a 95% chance, or a 98% chance. As discussed by John Timmer at Ars Technica, Johnson mathematically links Bayesian statistics to probabilities:
The math then allows a direct comparison between the probability values. In his comparison, scientific standards seem pretty weak. The 95 percent certainty corresponds to a Bayesian evidence threshold of between three and five, which Johnson notes is typically considered “positive evidence”—but it falls well below the values considered to be “strong evidence.” It takes 99 percent certainty to get there.
…
Johnson concludes that if we assume that only one-half of the hypotheses should give us a positive result, then “these results suggest that between 17 percent and 25 percent of marginally significant scientific findings are false.” If we assume the proportion of correct hypotheses is larger—which we might, given that scientists are usually pretty clever about the hypotheses they choose to test—then the problem gets even more pronounced. Overall, Johnson’s suggestion is simple: raise the statistical rigor all around. Demand that experiments produce a p value of 0.005 or smaller. And be even pickier about results that we consider highly significant. There is a cost to this, in that you need bigger samples to achieve the higher statistical rigor. In his example, you’d have to double the sample size. That’s no problem if you’re breeding bacteria and fruit flies, but it will add a lot of time and expense if your project involves mice.
Or, of course, humans. One implication that Timmer notes for increased significance thresholds is that those with small sample sizes would have to consider discussing non-significant results, potentially undermining our blind faith in statistical significance. While that would be nice, in the world we actually live in the more likely outcome is that individual or small-scale research would be even more difficult to conduct successfully. Good luck getting that NSF grant!