The past week has been all about maths for me. Well, not all about maths. There was quite a bit of coding (PHP is not my friend) and some experiments (I blocked ALL the acetylcholine receptors).

But special tribute must be paid to all the maths.

First, for those who didn't catch it, Neuro PhD Candidate Kelly Zalocusky posted a fabulous discussion on statistical reliability in neuroscience, reviewing recent work by Stanford Professor Dr. John Ioannidis that highlights the lack of statistical power in many published neuroscience articles. I highly recommend you read Kelly's article (found here). And, if once you're done reading Kelly's post, you have the irresitable urge to calculate the size of n your data needs to be statistically reliable, I recommend the book Power Analysis for Experimental Research: A Pratical Guide for the Biological, Medical and Social Sciences by R. Barker Bauesell and Yu-Fang Li. If you are a Stanford University affiliate, Lane Library has a digital copy (catalogue record here). Last Tuesday, I used the power charts in the t-test section to calculate the correct n I need to have full statistical power, given my pilot data.

From using math to study brains, to studying brains that are doing math. Just in, by a group of researchers at Oxford University - Shocks to the Brain Improve Mathmatical Abilities. This article initialy crossed my internet browser in the form of coverage in Scientific America, as reprinted from Nature. The "shock" in question: transcranial direct current stimulation. The "brain" - the prefrontal cortex. The "math" - arithmetic - "rote memorization of mathematical facts (such as 2 x 17 = 34) and more complicated calculations (for example, 32 – 17 + 5)". The "improvement" - increased response speed - both immediately after stimulation, and, 6 months later, when Oxford students who had received the stimulation were 28% faster than control compatriots. An in depth analysis of the findings/protocols/interpretations of this study would require me to write a longer post, so for the present I'll just link you all to the original article, published in Current Biology.

And, to round out our maths trilogy, this morning Gizmodo posted two video's featuring a mathematician explaining math jokes. It's funny. Very funny. Cora Ames, I expect you to integrate this concept into an improv segment. (Maths jokes, Explained)

A few other (non-math related) links:

Science Seeker Awards - With special call out both Part 1 and Part II of The Crayolafication of the Brain (Part II won best psych/neuro post)

SfN Careers Youtube Channel highlights alternative career choices - video interviews with Society members whose career paths are not of the traditional academic flavor.

A meta-analysis of the use of literary puns in science article titles. Yeah, we scientists took English Lit in college, too.

Stanford Professor Dr. John Ioannidis has made some waves over the last few years. His best-known work is a 2005 paper titled "Why most published research findings are false."(1) It turns out that Ioannidis is not one to mince words.

In the May 2013 issue of Nature Reviews Neuroscience, Ioannidis and colleagues specifically tackle the validity of neuroscience studies (2). This recent paper was more graciously titled "Power failure: why small sample size undermines the reliability of neuroscience," but it very easily could have been called "Why most published neuroscience findings are false."

Since these papers outline a pretty damning analysis of statistical reliability in neuroscience (and biomedical research more generally) I thought they were worth a mention here on the Neuroblog.

Ioannidis and colleagues rely on a measure called Positive Predictive Value or PPV, a metric most commonly used to describe medical tests. PPV is the likelihood that, if a test comes back positive, the result in the real world is actually positive. Let's take the case of a throat swab for a strep infection. The doctors take a swipe from the patient's throat, culture it, and the next day come back with results. There are four possibilities.

The test comes back negative, and the patient is negative (does not have strep). This is known as a "correct rejection".
The test comes back negative, even though the patient is positive (a "miss" or a "false negative").
The test comes back positive, even when the patient is negative (a "false alarm" or a "false positive").
The test correctly detects that a patient has strep throat (a "hit").

In neuroscience research, we hope that every published "positive" finding reflects an actual relationship in the real world (there are no "false alarms"). We know that this is not completely the case. Not every single study ever published will turn out to be true. But Ioannidis makes the argument that these "false alarms" come up much more frequently than we would like to think.

To calculate PPV, you need three other values:

the threshold of significance, or α, usually set at 0.05.
the power of the statistical test. If β is the "false negative" rate of a statistical test, power is 1 - β. To give some intuition--if the power of a test is 0.7, and there are 10 studies done that all are testing non-null effects, the test will only uncover 7 of them. The main result in Ionnadis's paper is an analysis of neuroscience meta-analyses published in 2011. He finds the median statistical power of the papers in these studies to be 0.2. More on that later.
the pre-study odds, or R. R is the prior on any given relationship tested in the field being non-null. In other words, if you had a hat full of little slips of paper, one for every single experiment conducted in the field, and you drew one out, R is the odds that that experiment is looking for a relationship that exists in the real world.

For those who enjoy bar-napkin calculations--those values fit together like this:

$latex PPV = ([1 - \beta] * R) / ([1 - \beta] * R + \alpha) $

Let's get back to our medical test example for a moment. Say you're working in a population where 1 in 5 people actually has strep (R = 0.25). The power of your medical test (1- β) is 0.8, and you want your threshold for significance to be 0.05. Then the test's PPV is (0.8 * 0.25)/ (0.8 * 0.25 + 0.05) = 0.8. This means that 80% of the times that the test claims the patient has strep, this claim will actually be true. If, however, the power of the test were only 0.2, as Ioannidis claims it is broadly across neuroscience, then the PPV drops to 50%. Fully half of the time, the test's results indicate a false positive.

In a clinical setting, epidemiological results frequently give us a reasonable estimate for R. In neuroscience research, this quantity might be wholly unknowable. But, let's start with the intuition of most graduate students in the trenches (ahem...at the benches?)...which is that 90% of experiments we try don't work. And some days, even that feels optimistic. If this intuition is accurate, then only 10% of relationships tested in neuroscience are non-null in the real world.

Using that value, and Ioannidis's finding that the average power in neuroscience is only 20%, we learn that the PPV of neuroscience research, as a whole, is (drumroll........) 30%.

If our intuitions about our research are true, fellow graduate students, then fully 70% of published positive findings are "false positives". This result furthermore assumes no bias, perfect use of statistics, and a complete lack of "many groups" effect. (The "many groups" effect means that many groups might work on the same question. 19 out of 20 find nothing, and the 1 "lucky" group that finds something actually publishes). Meaning—this estimate is likely to be hugely optimistic.

If we keep 20% power in our studies, but want a 50/50 shot of published findings actually holding true, the pre-study odds (R) would have to be 1 in 5.

To move PPV up to 75%, fully 3 in 4 relationships tested in neuroscience would have to be non-null.

1 in 10 might be pervasive grad-student pessimism, but 3 out of 4 is absolutely not the case.

So—how can we, the researchers, make this better? Well, the power of our analyses depends on the test we use, the effect size we measure, and our sample size. Since the tests and the effect sizes are unlikely to change, the most direct answer is to increase our sample sizes. I did some coffee-shop-napkin calculations from Ioannidis’s data to find that the median effect size in the studies included in his analysis is 0.51 (Cohen’s d). For those unfamiliar with Cohen’s d—standard intuition is that 0.2 is a “small” effect, 0.5 is a “medium” effect, and 0.8 constitutes a “large” effect. For those who are familiar with Cohen’s d…I apologize for saying that.

Assuming that the average effect size in neuroscience studies remains unchanged at 0.51, let’s do some intuition building about sample sizes. For demonstration’s sake, we’ll use the power tables for a 2-tailed t-test.

To get a power of 0.2, with an effect size of 0.51, the sample size needs to be 12 per group. This fits well with my intuition of sample sizes in (behavioral) neuroscience, and might actually be a little generous.

To bump our power up to 0.5, we would need an n of 31 per group.

A power of 0.8 would require 60 per group.

My immediate reaction to these numbers is that they seem huge—especially when every additional data point means an additional animal utilized in research. Ioannidis makes the very clear argument, though, that continuing to conduct low-powered research with little positive predictive value is an even bigger waste. I am happy to take all comers in the comments section, at the Oasis, and/or in a later blog post, but I will not be solving this particular dilemma here.

For those actively in the game, you should know that Nature Publishing Group is working to improve this situation (3). Starting next month, all submitting authors will have to go through a checklist, stating how their sample size was chosen, whether power calculations were done given the estimated effect sizes, and whether the data fit the assumptions of the statistics that are used. On their end, in an effort to increase replicability, NPG will be removing all limits on the length of methods sections. Perhaps other prominent publications would do well to follow suit.

Footnotes

1. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124

2. Button et al (2013). Power Failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14: 3665-376. doi:10.1038/nrn3475

3. The specific announcement detailing changes in submission guidelines, also the Nature Special on Challenges in Irreproducible Research

Neuroblog

Linky and the Brain: May 20, 2013

Why most published neuroscience findings are false

Footnotes

NeuWrite West

Search the Website

Featured Series

Ask a Neuroscientist

Subscribe To Blog

Recent blog Articles

@StanfordNeuro