Stanford Professor Dr. John Ioannidis has made some waves over the last few years. His best-known work is a 2005 paper titled "Why most published research findings are false."(1) It turns out that Ioannidis is not one to mince words.
In the May 2013 issue of Nature Reviews Neuroscience, Ioannidis and colleagues specifically tackle the validity of neuroscience studies (2). This recent paper was more graciously titled "Power failure: why small sample size undermines the reliability of neuroscience," but it very easily could have been called "Why most published neuroscience findings are false."
Since these papers outline a pretty damning analysis of statistical reliability in neuroscience (and biomedical research more generally) I thought they were worth a mention here on the Neuroblog.
Ioannidis and colleagues rely on a measure called Positive Predictive Value or PPV, a metric most commonly used to describe medical tests. PPV is the likelihood that, if a test comes back positive, the result in the real world is actually positive. Let's take the case of a throat swab for a strep infection. The doctors take a swipe from the patient's throat, culture it, and the next day come back with results. There are four possibilities.
- The test comes back negative, and the patient is negative (does not have strep). This is known as a "correct rejection".
- The test comes back negative, even though the patient is positive (a "miss" or a "false negative").
- The test comes back positive, even when the patient is negative (a "false alarm" or a "false positive").
- The test correctly detects that a patient has strep throat (a "hit").
In neuroscience research, we hope that every published "positive" finding reflects an actual relationship in the real world (there are no "false alarms"). We know that this is not completely the case. Not every single study ever published will turn out to be true. But Ioannidis makes the argument that these "false alarms" come up much more frequently than we would like to think.
To calculate PPV, you need three other values:
- the threshold of significance, or α, usually set at 0.05.
- the power of the statistical test. If β is the "false negative" rate of a statistical test, power is 1 - β. To give some intuition--if the power of a test is 0.7, and there are 10 studies done that all are testing non-null effects, the test will only uncover 7 of them. The main result in Ionnadis's paper is an analysis of neuroscience meta-analyses published in 2011. He finds the median statistical power of the papers in these studies to be 0.2. More on that later.
- the pre-study odds, or R. R is the prior on any given relationship tested in the field being non-null. In other words, if you had a hat full of little slips of paper, one for every single experiment conducted in the field, and you drew one out, R is the odds that that experiment is looking for a relationship that exists in the real world.
For those who enjoy bar-napkin calculations--those values fit together like this:
$latex PPV = ([1 - \beta] * R) / ([1 - \beta] * R + \alpha) $
Let's get back to our medical test example for a moment. Say you're working in a population where 1 in 5 people actually has strep (R = 0.25). The power of your medical test (1- β) is 0.8, and you want your threshold for significance to be 0.05. Then the test's PPV is (0.8 * 0.25)/ (0.8 * 0.25 + 0.05) = 0.8. This means that 80% of the times that the test claims the patient has strep, this claim will actually be true. If, however, the power of the test were only 0.2, as Ioannidis claims it is broadly across neuroscience, then the PPV drops to 50%. Fully half of the time, the test's results indicate a false positive.
In a clinical setting, epidemiological results frequently give us a reasonable estimate for R. In neuroscience research, this quantity might be wholly unknowable. But, let's start with the intuition of most graduate students in the trenches (ahem...at the benches?)...which is that 90% of experiments we try don't work. And some days, even that feels optimistic. If this intuition is accurate, then only 10% of relationships tested in neuroscience are non-null in the real world.
Using that value, and Ioannidis's finding that the average power in neuroscience is only 20%, we learn that the PPV of neuroscience research, as a whole, is (drumroll........) 30%.
If our intuitions about our research are true, fellow graduate students, then fully 70% of published positive findings are "false positives". This result furthermore assumes no bias, perfect use of statistics, and a complete lack of "many groups" effect. (The "many groups" effect means that many groups might work on the same question. 19 out of 20 find nothing, and the 1 "lucky" group that finds something actually publishes). Meaning—this estimate is likely to be hugely optimistic.
If we keep 20% power in our studies, but want a 50/50 shot of published findings actually holding true, the pre-study odds (R) would have to be 1 in 5.
To move PPV up to 75%, fully 3 in 4 relationships tested in neuroscience would have to be non-null.
1 in 10 might be pervasive grad-student pessimism, but 3 out of 4 is absolutely not the case.
So—how can we, the researchers, make this better? Well, the power of our analyses depends on the test we use, the effect size we measure, and our sample size. Since the tests and the effect sizes are unlikely to change, the most direct answer is to increase our sample sizes. I did some coffee-shop-napkin calculations from Ioannidis’s data to find that the median effect size in the studies included in his analysis is 0.51 (Cohen’s d). For those unfamiliar with Cohen’s d—standard intuition is that 0.2 is a “small” effect, 0.5 is a “medium” effect, and 0.8 constitutes a “large” effect. For those who are familiar with Cohen’s d…I apologize for saying that.
Assuming that the average effect size in neuroscience studies remains unchanged at 0.51, let’s do some intuition building about sample sizes. For demonstration’s sake, we’ll use the power tables for a 2-tailed t-test.
To get a power of 0.2, with an effect size of 0.51, the sample size needs to be 12 per group. This fits well with my intuition of sample sizes in (behavioral) neuroscience, and might actually be a little generous.
To bump our power up to 0.5, we would need an n of 31 per group.
A power of 0.8 would require 60 per group.
My immediate reaction to these numbers is that they seem huge—especially when every additional data point means an additional animal utilized in research. Ioannidis makes the very clear argument, though, that continuing to conduct low-powered research with little positive predictive value is an even bigger waste. I am happy to take all comers in the comments section, at the Oasis, and/or in a later blog post, but I will not be solving this particular dilemma here.
For those actively in the game, you should know that Nature Publishing Group is working to improve this situation (3). Starting next month, all submitting authors will have to go through a checklist, stating how their sample size was chosen, whether power calculations were done given the estimated effect sizes, and whether the data fit the assumptions of the statistics that are used. On their end, in an effort to increase replicability, NPG will be removing all limits on the length of methods sections. Perhaps other prominent publications would do well to follow suit.
3. The specific announcement detailing changes in submission guidelines, also the Nature Special on Challenges in Irreproducible Research