Part I: Framing the debate
There has been much wringing of hands of late over findings that many scientific findings are proving impossible to reproduce – meaning, they were probably wrong. Coverage in the news, concern expressed by the President's council of scientific advisors, and a call to action by Francis Collins, the director of the National Institutes of Health (NIH), all suggest that this is a problem that the scientific community needs to understand and address.
In the recent issue of the journal Nature, Francis Collins and Lawrence Tabak (the deputy director of the NIH) outline their plan for improving scientific reproducibility, emphasizing a need for improving experimental design, statistical analysis, and transparency.
According to Collins and Tabak, the issue of reproducibility is due to a number of complex factors, including “…poor training of researchers in experimental design; increased emphasis on making provocative statements rather than presenting technical details; and publications that do not report basic elements of experimental design.”
The scientific field where this is most problematic, they explain, is preclinical research that utilizes non-human animal models. This is due, in part, to different labs using different strains of animals, a different housing and lab environment that can affect the animal (i.e. noise, light, etc.), and variations in protocols that are never reported in the published article.
Finally they explain that most failures of reproducibility are a result of “coincidental findings that happen to reach statistical significance, coupled with publication bias.” Combined, these factors have lead to “a troubling frequency of published reports that claim a significant result, but fail to be reproducible.”
After reading this article, I had three basic questions: What is the basis of the “troubling frequency of published reports that claim a significant result, but fail to be reproducible” statement? Exactly how big is the problem? And what is being done to fix it?
In this article, I’ll address the first two questions; and in a follow-up article, I’ll address the third.
Question 1: What is the basis of the “ troubling frequency of published reports that claim a significant result, but fail to be reproducible” statement?
Last year, frequent neuroblog contributor Kelly Zalocusky wrote two articles discussing this very issue in neuroscience research (see: “Why most published neuroscience findings are false” and “Why most published neuroscience findings are false, part II: The correspondents strike back”). Both pieces stem from work done by Stanford researcher John Ioannidis, who famously claimed that “most published research findings are false.” And since most research findings are false, then of course, no one is going to be able to reproduce them.
In his 2005 paper of the same name, Ioannidis argues that this high error rate is due to an unusually high percentage of false positives. For example, say I was studying the relationship between diet and cancer, and I found that eating 5 carrots a day significantly decreases your risk of developing skin cancer; but in reality, eating carrots has no effect, positive or negative, on skin cancer, then my finding would be a false positive. On the other hand, a false negative would be if I found no effect of eating carrots on developing skin cancer when in fact a relationship does exist.
Occasionally reporting a false positive or false negative is an unavoidable fact of science, and is built into our overall analysis of a given experiment. In order to limit the incidence of false positive or negatives, we set a threshold for an acceptable error rate, which in most scientific fields is 5% During statistical analysis, we can calculate the probability, expressed as a p-value, that what we observe in our experiment is actually true. Any result with a p-value ≤0.05 is considered significant and most likely true.
Ioannidis presents a theoretical framework for determining the percentage of false positive findings in the scientific literature that relies on an assumption that most tested hypotheses are false. So for instance, if when I was designing my diet/cancer experiment, I decided to test 1,000 hypotheses, e.g. 1. eating carrots decreases risk of developing cancer; 2. eating jelly beans increases risk of developing cancer, etc., the probability that all 1,000 hypotheses will be correct is very low.
If we assume the percentage of true hypotheses is 1%, then out of 1,000 tested hypotheses, 10 would be true and 990 would be false. Because the number of false hypotheses is so high, even with a false positive rate of only 5%, we still end up with more false positive findings (in this case, 50) than true findings (only 10). When we take into account the number of false positives, our reported true findings decreases to only 10, meaning that out of all of the significant relationships that I reported in my study, a full 86% of them are actually false. (For a more detailed explanation, see: “Why most published neuroscience findings are false”).
Question 2: How big is the problem?
One of the weaknesses of this approach is that it is entirely theoretical – Ioannidis never uses real data to support his argument. And new research by scientists Leah Jager and Jeffrey Leek published earlier this year in the journal Biostatistics calls into question the method used by Ioannidis and colleagues, and with it, the magnitude of the problem.
Their basic argument is that the Ioannidis method is biased because it treats all significant results the same by assuming a p-value of 0.05 across all data. In reality, 0.05 is only the cutoff, and in the 5 scientific journals that Jager and Leek looked at, an average of 77% of all reported p-values were significantly less than 0.05.
This is a pretty significant distinction: if in my cancer study, the actual p-values of all of my findings were a more realistic 0.01, then my false positive rate goes from 86% to 55%
Jager and Leek also argue that we can’t know, based on what’s published, how many hypotheses were tested before the researchers found one that was significant. Ioannidis’s basic argument is that if a paper reports on 5 hypotheses, they probably tested 25, but there is no evidence to support that claim, and it further biases his analysis.
For example, if we assume the same number of true hypotheses in my cancer study (10), but instead of testing 1,000 hypotheses, I only tested 20 hypotheses, and all of my p-values=0.01, then my overall false positive rate drops from 86% to 1.2%
Jager and Leek performed the same analysis by collecting 5,322 actual reported p-values from 77,430 papers published in 5 medical journals across a 10 year period, and estimate the actual false positive rate to be ~14% While 14% is still almost three times the acceptable false positive rate of 5%, it’s still a lot better than the 70-80% rate that Ioannidis estimates.
Not surprisingly, not everyone agrees with Jager and Leek. The same issue of Biostatistics published a number of responses to Jager and Leek, where some researchers criticize their approach for being biased because they pulled all their p-values from abstracts, which most likely excludes small effects and non-significant findings. Researchers Andrew Gelman and Keith O’Rouke call Jager and Leek’s attempt to empirically measure the overall false positive rate “hopeless,” and Ioannidis concludes that Jager and Leek’s paper “…exemplifies the dreadful combination of using automated scripts with wrong methods and unreliable data.”
While I agree with some of the criticisms leveled against Jager and Leek, I don’t find their approach “dreadful” or “hopeless.” In fact, I think their attempt to empirically define the scope of the false positive problem is a really important addition to the debate. Whatever the actual false positive rate is, it’s an important issue that the scientific community needs to address. In part II of this article, I’ll talk about some of the measures being taken to improve the quality and reliability of all of our research.
1. Collins, F., and Tabak, L. Policy: NIH plans to enhance reproducibility. Nature 505, 612-613 (2014).
2. Ioannidis, J. Why most published research findings are false. PLOS Medicine, 2(8), e124 (2005).
3. Jager, L., and Leek, J. An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics, 15(1), 1-12 (2014).
4. Gelman, A., and O’Rourke, K. Discussion: Difficulties in making inferences about scientific truth from distributions of published p-values. Biostatistics, 15(1), 18-23 (2014).
5. Ioannidis, J. Discussion: Why “An estimate of the science-wise false discovery rate and application to the tope medical literature” is false. Biostatistics, 15(1), 28-36 (2014).