Previous: Why Do Pineapple and Kiwi Ruin Gelatin?
Next: The Only Radiation Units You Need to Know



View count:396,550
Last sync:2022-11-20 12:00
A little over a decade ago, a neuroscientist found "significant activation" in the neural tissue of a dead fish. While it didn't prove the existence of zombie fish, it did point out a huge statistical problem.

Hosted by: Olivia Gordon

SciShow has a spinoff podcast! It's called SciShow Tangents. Check it out at
Support SciShow by becoming a patron on Patreon:
Huge thanks go to the following Patreon supporters for helping us keep SciShow free for everyone forever:

Avi Yashchin, Adam Brainard, Greg, Alex Hackman, Sam Lutfi, D.A. Noe, Piya Shedden, KatieMarie Magnone, Scott Satovsky Jr, Charles Southerland, Patrick D. Ashmore, charles george, Kevin Bealer, Chris Peters
Looking for SciShow elsewhere on the internet?

 (00:00) to (02:00)


A little over a decade ago, a neuroscientist stopped by a grocery store on his way to his lab to buy a large Atlantic salmon.  The fish was placed in an MRI machine and then it completed what was called an open-ended mentalizing task, where it was asked to determine the emotions that were being experienced by different people in photos.  Yes, this salmon was asked to do that.  The dead one.  From the grocery store.

But that's not the weird part.   The weird part is that researchers found that so-called significant activation occurred in neural tissue in a couple of places in the dead fish.  Turns out, this was a little bit of a stunt.  The researchers weren't studying the mental abilities of dead fish.  They wanted to make a point about statistics and how scientists use them, which is to say, stats can be done wrong.  So wrong that they can make a dead fish seem alive.

A lot of the issues surrounding scientific statistics come from a little something called a p-value.  The 'p' stands for 'probability' and it refers to the probability that you would have gotten the results you did just by chance.  There are lots of other ways to provide statistical support for your conclusion in science, but p-value is by far the most common, and I mean, it's literally what scientists mean when they report that their findings are significant, but it's also one of the most frequently mis-used and misunderstood parts of scientific research, and some think it's time to get rid of it altogether.

The p-value was first proposed by a statistician named Ronald Fisher in 1925.   Fisher spent a lot of time thinking about how to determine if the results of a study were really meaningful and at least according to some accounts, his big breakthrough came after a party in the early 1920s.  At this party, there was a fellow scientist named Muriel Bristol and reportedly, she refused a cup of tea from Fisher because he had added the milk after the tea was poured.  She only liked her tea when the milk was added first.

 (02:00) to (04:00)

Fisher didn't believe she could really taste the difference, so he and a colleague designed an experiment to test her assertion.  They made eight cups of tea, half of which were milk first and half of which were tea first.  The order of the cups was random and most importantly, unknown to Bristol, though she was told that there would be four of each cup.  Then, Fisher had her taste each tea one by one and say whether that cup was milk or tea first, and to Fisher's great surprise, she went eight for eight.  She guessed correctly every time which cup was tea first and which was milk first, and that got him thinking.  What are the odds that she got them all right just by guessing?

In other words, if she really couldn't taste the difference, how likely would it be that she got them all right?  He calculated that there are 70 possible orders for the eight cups if there are four of each mix.  Therefore, the probability that she'd guess the right one by luck alone is 1 in 70.  Written mathematically, the value of P is about .014.  That, in a nutshell, is a p-value, the probability that you'd get the result if chance is the only factor.  In other word, there's really no relationship between the two things you're testing, in this case, how tea is mixed versus how it tastes, but you could still wind up with data that suggests there is a relationship.

Of course, the definition of chance varies depending on the experiment, which is why p-values depend a lot on experimental design.  Say Fisher had only made six cups, three of each tea mix.  Then there are only 20 possible orders for the cups, so the odds of getting them all correct is 1 in 20, a p-value of .05.  

Fisher went on to describe an entire field of statistics based on this idea, which we now call null hypothesis significance testing.  The null hypothesis refers to the experiment's assumption of what 'by chance' looks like.  Basically, researchers calculate how likely it is that they've gotten the data that they did, even if the effect they're testing for doesn't exist.  Then, if the results are extremely unlikely to occur if the null hypothesis is true, then they can infer that it isn't.

 (04:00) to (06:00)

So in statistical speak, with a low enough p-value, they can reject the null hypothesis, leaving them with whatever alternate hypothesis they had as the explanation for the results.  The question becomes, how long does a p-value have to be before you can reject that null hypothesis?  Well, the standard answer used in science is less than 1 in 20 odds or a p-value below .05.  The problem is, that's an arbitrary choice.  It also traces back to Fisher's 1925 book, where he said 1 in 20 was "convenient".  A year later, he admitted the cut-off was somewhat subjective, but that .05 was generally his personal preference.  Since then, the .05 threshold has become the gold standard in scientific research.  A p of less than .05 and your results are "significant".  

It's often talked about as determining whether or not an effect is real, but the thing is, a result with a p-value of .049 isn't more true than one with a p-value of .051.  It's just ever-so-slightly less likely to be explained by chance or sampling error.  This is really key to understand.  You're not more right if you get a lower p-value because a p-value says nothing about how correct your alternate hypothesis is.  

Let's bring it back to the tea for a moment.  Bristol aced Fisher's eight cup study by getting them all correct, which as we noted, as a p-value of .014, solidly below the .05 threshold, but it being unlikely that she randomly guessed doesn't prove she could taste the difference.  See, it tells us nothing about other possible explanations for her correctness.  Like, if the teas had different colors rather than tastes or if she secretly saw Fisher pouring each cup.  Also, it still could have been a 1 in 70 fluke, and sometimes, one might even argue, often, 1 in 20 is not a good enough threshold to really rule out that a result is a fluke, which brings us back to that seemingly undead fish.

 (06:00) to (08:00)

The spark of life detected in the salmon was actually an artifact of how MRI data is collected and analyzed.  See, when researchers analyze MRI data, they look at small units about a cubic millimeter or two in volume, so for the fish, they took each of these units and compared the data before and after the pictures were shown to the fish.  That means even though they were just looking at one dead fish's brain before and after, they were actually making multiple comparisons, potentially thousands of them. 

The same issue crops up in all sorts of big studies with lots of data, like nutritional studies where people provide detailed diet information about hundreds of foods or behavioral studies where participants fill out surveys with dozens of questions.  In all cases, even though each individual comparison is unlikely, with enough comparisons, you're bound to find some false positives.  There are statistical solutions for this problem, of course, which are simply known as multiple comparison corrections.

Though they can get fancy, they usually amount to lowering the threshold for P-value significance and to their credit, the researchers who looked at the dead salmon also ran their data with multiple comparison corrections.  When they did, their data was no longer significant, but not everyone uses these corrections and though individual studies might give various reasons for skipping them, one thing that's hard to ignore is that researchers are under a lot of pressure to publish their work and significant results are more likely to get published.  

This can lead to p-hacking, the practice of analyzing or collecting data until you get significant p-values.  This doesn't have to  be intentional, because researchers make many small choices that lead to different results, like we saw with the six versus eight cups of tea.  This has become such a big issue because, unlike when these statistics were invented, people can now run tests lots of different ways fairly quickly and cheaply and just go with what's most likely to get their work published.

Because of all of these issues surrounding p-values, some are arguing that we should get rid of them altogether and one journal has totally banned them, and many that say we should ditch the p-value are pushing for an alternate statistical system called Bayesian statistics.

 (08:00) to (10:00)

P-values, by definition, only examine null hypotheses.  The result is then used to infer if the alternative is likely.  Bayesian statistics actually look at the probability of both the null and alternative hypotheses.  What you wind up with is an exact ratio of how likely one explanation is compared to another.  This is called a Bayes factor and this is a much better answer if you want to know how likely you are to be wrong.  

This system was around when Fisher came up with p-values, but depending on the data set, calculating Bayes factors can require some serious computing power, power that wasn't available at the time since, you know, it was before computers.  Nowadays, you can have a huge network of computers thousands of miles from you to run calculations while you throw a tea party, but the truth is, replacing p-values with Bayes factors probably won't fix everything.

A loftier solution is to completely separate a study's publishability from its results.  This is the goal of two-step manuscript submission, where you submit an introduction to your study and a description of your method and the journal decides whether to publish before seeing your results.  That way, in theory at least, studies would get published based on whether they represent good science, not whether they worked out the way researchers hoped or whether a p-value or Bayes factor was more or less than some arbitrary threshold.  

This sort of idea isn't widely used yet, but it may become more popular as statistical significance meets more sharp criticism.  In the end, hopefully, all this controversy surrounding p-values means that academic culture is shifting toward a clearer portrayal of what research results do and don't really show, and that will make things more accessible for all of us who want to read and understand science and keep anymore zombie fish from showing up.  Now, before I go make myself a cup of earl gray, milk first, of course, I want to give a special shout out to today's President of Space, SR Foxley.

 (10:00) to (10:40)

Thank you so much for your continued support.  Patrons like you give us the freedom to dive deep into complex topics like p-values, so really, we can't thank you enough, and if you want to join SR in supporting this channel and the educational content we make here at SciShow, you can learn more at  Cheerio!