crashcourse
The Replication Crisis: Crash Course Statistics #31
YouTube: | https://youtube.com/watch?v=vBzEGSm23y8 |
Previous: | The Industrial Revolution: Crash Course History of Science #21 |
Next: | Metals & Ceramics: Crash Course Engineering #19 |
Categories
Statistics
View count: | 98,702 |
Likes: | 2,222 |
Comments: | 76 |
Duration: | 14:36 |
Uploaded: | 2018-09-26 |
Last sync: | 2024-11-06 23:30 |
Citation
Citation formatting is not guaranteed to be accurate. | |
MLA Full: | "The Replication Crisis: Crash Course Statistics #31." YouTube, uploaded by CrashCourse, 26 September 2018, www.youtube.com/watch?v=vBzEGSm23y8. |
MLA Inline: | (CrashCourse, 2018) |
APA Full: | CrashCourse. (2018, September 26). The Replication Crisis: Crash Course Statistics #31 [Video]. YouTube. https://youtube.com/watch?v=vBzEGSm23y8 |
APA Inline: | (CrashCourse, 2018) |
Chicago Full: |
CrashCourse, "The Replication Crisis: Crash Course Statistics #31.", September 26, 2018, YouTube, 14:36, https://youtube.com/watch?v=vBzEGSm23y8. |
Replication (re-running studies to confirm results) and reproducibility (the ability to repeat an analyses on data) have come under fire over the past few years. The foundation of science itself is built upon statistical analysis and yet there has been more and more evidence that suggests possibly even the majority of studies cannot be replicated. This "replication crisis" is likely being caused by a number of factors which we'll discuss as well as some of the proposed solutions to ensure that the results we're drawing from scientific studies are reliable.
Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse
Thanks to the following Patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:
Mark Brouwer, Trevin Beattie, Satya Ridhima Parvathaneni, Erika & Alexa Saur, Glenn Elliott, Justin Zingsheim, Jessica Wode, Eric Prestemon, Kathrin Benoit, Tom Trval, Jason Saslow, Nathan Taylor, Brian Thomas Gossett, Khaled El Shalakany, Indika Siriwardena, SR Foxley, Sam Ferguson, Yasenia Cruz, Eric Koslow, Caleb Weeks, Tim Curwick, D.A. Noe, Shawn Arnold, Ruth Perez, Malcolm Callis, Advait Shinde, William McGraw, Andrei Krishkevich, Rachel Bright, Mayumi Maeda, Kathy & Tim Philip, Eric Kitchen, Ian Dundore, Chris Peters
--
Want to find Crash Course elsewhere on the internet?
Facebook - http://www.facebook.com/YouTubeCrashCourse
Twitter - http://www.twitter.com/TheCrashCourse
Tumblr - http://thecrashcourse.tumblr.com
Support Crash Course on Patreon: http://patreon.com/crashcourse
CC Kids: http://www.youtube.com/crashcoursekids
Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse
Thanks to the following Patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:
Mark Brouwer, Trevin Beattie, Satya Ridhima Parvathaneni, Erika & Alexa Saur, Glenn Elliott, Justin Zingsheim, Jessica Wode, Eric Prestemon, Kathrin Benoit, Tom Trval, Jason Saslow, Nathan Taylor, Brian Thomas Gossett, Khaled El Shalakany, Indika Siriwardena, SR Foxley, Sam Ferguson, Yasenia Cruz, Eric Koslow, Caleb Weeks, Tim Curwick, D.A. Noe, Shawn Arnold, Ruth Perez, Malcolm Callis, Advait Shinde, William McGraw, Andrei Krishkevich, Rachel Bright, Mayumi Maeda, Kathy & Tim Philip, Eric Kitchen, Ian Dundore, Chris Peters
--
Want to find Crash Course elsewhere on the internet?
Facebook - http://www.facebook.com/YouTubeCrashCourse
Twitter - http://www.twitter.com/TheCrashCourse
Tumblr - http://thecrashcourse.tumblr.com
Support Crash Course on Patreon: http://patreon.com/crashcourse
CC Kids: http://www.youtube.com/crashcoursekids
Hi, I’m Adriene Hill, and welcome back to Crash Course Statistics.
You might have heard that Power Posing affects how powerful you feel and can change hormone levels. If it does we’d expect to see that effect over and over and over.
Study after study. And it would be pretty disappointing if one study concludes that eating carrots improves your vision, and then after you rushed to sign up for monthly carrot deliveries...5 similar studies find no evidence that munching carrots is good for your eyes. To make sure that an experimental result is sound, we want to replicate the findings.
Results need to be confirmed. Which is why replication--re-running studies to confirm results --and reproducible analysis--the ability for other scientists to repeat the analyses you did on your data--are essential in science. These issues affect basically every field from Artificial Intelligence research to social science.
INTRO A few years ago scientists at a biotech company called Amgen decided to try to replicate more than 50 big-deal cancer treatment studies. These were studies that had been published in respected journals. And the Amgen scientists were only able to replicate the original results 11-percent of the time.
In another reproducibility study...a group of 270 scientists re-ran 100 psychology studies that had been published in 2008 in top-notch journals. Fewer than half of the published results were replicated. Stanford researcher, Dr.
John Ioannidis has claimed that “false findings may be the majority or even the vast majority of published research claims”. The journal Nature published a survey a few years back and asked researchers if they thought there was a reproducibility crisis in science. 52% called it a “significant crisis” another 38% called it a “slight crisis”. And 90% of researchers thinking they have some size of crisis on their hands is big deal.
The “replicability crisis” has been used in political debates to undermine scientific research. Political activists, especially those that hold opinions that run counter to scientific research, have jumped on the problem of replicability as a way to discredit science more broadly. And when a medical study winds up with invalid conclusions researchers could head down the wrong path people could get misguided treatments based on faulty conclusions they could get sicker even...and a whole lot of money could be wasted researching and providing those treatments.
So, what’s causing science’s replication problem? There are a lot of answers. Some of them involve unscrupulous researchers--researchers that are more concerned with attention and publishing and splashy headlines than good science.
Here we’re talking about fraud. Falsified data. Intentional p-hacking.
Statistical evil doers. But many reasons scientific studies aren’t replicable are less nefarious. One issue related to replication--re-doing studies--is reproducibility of the analyses in a paper.
There’s not always one prescribed way of analyzing a data set. A researcher named Brian Nosek and his team invited 29 groups of researchers to analyze the same data set--and attempt to answer whether or not soccer referees give more red cards to dark-skinned players than light-skinned ones? Seems simple enough.
These researchers were all working with the SAME data--but they ran different tests. Some used linear regressions. Some went with Bayesian models.
And it’s not just the models that the researchers could have differed on. You also have freedom to exclude different outliers, or look at different groups. Twenty of the groups found a statistically significant relationship between skin color and red cards.
Nine groups didn’t. The point, says researchers, is that no one analysis is gonna find THE answer, THE singular truth. When researchers aren’t clear about how they analyzed their data, from which data points they excluded, to the exact model they ran, it can make it hard for someone to reproduce their results.
Even if they had the same data. Good papers will have detailed descriptions of researchers’ methods. When you replicate a study you usually know what model the researcher used or you can ask.
But if scientists aren’t clear or consistent about this, it just puts another roadblock in the way of good replication. There are other reasons for the replicability crisis. Some researchers and the folks who report on scientific research don’t fully understand p-values.
They make claims that statistical evidence doesn’t support. Back in 2016, the American Statistical Association released a statement meant to help researchers understand and use P values better. It was reportedly the first time the 170-plus year old organization made this type of explicit recommendations.
Among the guidelines the Statistical Association published: “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” And "A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.” P-values need to be understood in context. A significant result doesn’t mean we ought to all rush out and change what we’re doing. But if you like carrots, by all means keep eating them.
Another reason science produces results that can’t be reproduced is that published studies have a bias toward overestimating effects--in part because they were published because they had a low p-value. Some studies look promising and then aren’t reproducible because they were based on a fluke. When the study is repeated the fluke doesn’t repeat itself.
The website FiveThirtyEight offers up this explanation: Say you were looking at the relationship of height and college majors. You gather up your data--including a class of math majors with a few exceptionally tall kids and a class of philosophy majors with an unusually short student. When you compare the averages-- ha ha!
Look at that! Math majors are taller than philosophy majors. You have statistically significant results but when you repeat the study those differences disappear there’s regression to the mean which gives you a more accurate picture of pretty similar average heights of each major and nothing all that interesting to write about.
Except a correction to your first paper. Small sample sizes also get blamed. The fewer subjects in a study, the more likely you get skewed and unreplicable results.
DFTBAQ, my friends. Even when results make sense to you--DFTBAQ. So where can researchers start improving the process--to help solve this reproducibility crisis.
For one, researchers argue they need to do a whole lot more replication. Replication allows us to weed out false significant effects: the flukes and the “too good to be true” effects that unfortunately make great headlines. We need to get rid of the idea that one significant test is solid proof of anything.
It isn’t. In fact, we need to get rid of the idea that one significant test is even great evidence of anything. But replication is expensive.
And it’s not as sexy as making a new discovery. It doesn’t attract the same media attention institutional acclaim or funders. Who wants to say “I found the effect that my colleague found yesterday!”?
So say researchers we’ve gotta come up with ways to change those incentives. We need to find more funding for replication studies and change the way we all view the value of replication. Some people call for more publication of “null results”--those that DON’T support the hypothesis.
This would allow quality research to be published, even if it didn’t show an effect, making p-hacking a little less enticing, since you could still get null results published. Some researchers argue another way to help correct the reproducibility crisis is by reconsidering the standard p-value cut off of .05 for statistical significance. Is it stringent enough?
Or should researchers move it? In 2017, a group of more than 70 researchers co-authored a paper calling for a change in the default P-value threshold from .05 to .005. They wrote: “This simple step would immediately improve the reproducibility of scientific research in many fields.” Calling results with a p-value of less than .05 statistically significant they argue results in a high rate of false positives...even when that research is done correctly.
Let’s just look at one area of research we’ve talked about before: Social Priming -- the idea that certain actions or conditions can affect the way you behave. One famous case of social priming is a study where subjects who were exposed to words related to old age--like Florida, bingo, grey, or retired--walked more slowly after exposure than those who were shown neutral words. But recently, many researchers have expressed concerns that some of these social priming results may not hold up.
When they began to see that, many experiments were done with many different priming mechanisms and outcome variables. And we’re making this data up here, but let’s say that out of 1000 studies done, about 10% or 100 ended up with real effects of social priming. This is a table that displays how often our studies resulted in true positives, false positives, true negatives, and false negatives.
The top row shows the 900 studies where social priming DIDN’T work. Because we used a threshold of 0.05, 5% of those 900 studies will still be statistically significant even though there was no effect. Those 45 are our false positives.
That leaves 855 studies where social priming didn't work and we caught it. Those are our true negatives. The next row contains the 100 studies where social priming DID work.
In those studies, there were actual effects of social priming. There you can see our true positives (60) and false negatives (40). So what does that mean?
Well, remember, statistical power is the ability to detect real effects. Sometimes we can fail to get a significant result, even if an effect of a certain size is real. One estimate suggests that most psychology studies have an average of 60% power.
So, that 60 on our table represents the 60 studies that had a significant effect that was observed. The other 40 weren’t caught, giving us false negatives. Using our table, we can look at the percent of significant results that come from studies with no effect.
Our “False Alarms”. Our False Discovery Rate is 45 divided by 105, or 42.9%. That means of all the significant effects that were recorded and published in our thought experiment, a bit less than HALF of them are false positives.
Which shows, as we mentioned before, that having a statistically significant effect doesn’t make it REAL. All else being equal if we had changed the p-value threshold from .05 to .005, we would have way less false positives. To make the work of reproducibility easier, there are also pushes underway to encourage researchers to share their data more widely.
In the United Kingdom, for example, many research funders expect researchers will make publicly funded research data available--recognizing the data as a public good. Academic journals also play a role in the conversation around reproducibility--many of the most prestigious journals have adopted guidelines and policies that put more emphasis on reproducibility and transparency. In part, to help boost public trust in science and the scientific process.
Let’s go back to power posing before we finish today. Really get that blood flowing. Confidence building!
A study on power posing was published in Psychological Science back in 2010 that showed that power posing could change hormone levels and boost confidence. A TED talk about power posing was viewed more than 40 million times. Want a raise?
Respect and awe from your friends and family and enemies? Power pose. Or not.
After power-posing went mainstream other researchers tried to replicate the study--with a larger sample--and didn’t come up with the same results. Other researchers found significant problems with the original study and came to the conclusion that quote “the existing evidence is too weak to justify a search for moderators or to advocate for people to engage in power posing to better their lives.” Power-posing got labeled pseudo-science. And then in 2018 the original author published a response to some of the critiques about power-posing...with an analysis that suggested the poses could help people feel more confident and powerful.
Now, the newest paper doesn’t seem to address all of the critiques about Power Posing study but it comes to the conclusion that researchers shouldn't give up research about the effects of Power Posing quite yet. No. No.
These are not power poses. I’m just trying to find something that indicates confusion. This back and forth of the power posing debate does make it harder to know what’s likely to be true.
But it also shows the VALUE of replication and even the reproducibility crisis in research. Science is a push and pull of ideas--researchers are constantly iterating and expanding on ideas that came before. They refine results.
Build on other people’s findings. Replication is an essential part of the path to scientific progress and real breakthroughs. The reproducibility crisis means more people are taking the replication step of the process seriously.
Replication has helped us accomplish some pretty important things. Like help change peoples minds about whether smoking caused increases in lung cancer, even though researchers could never do a Randomized Controlled Trial to demonstrate causation. Evidence piled up, and now smoking rates are incredibly low.
No single study is gonna show us the way the world REALLY is but that study and the studies that follow it that do and don’t find the same relationships will get us closer and closer. And one day maybe we’ll know--with more explicit certainty whether or not we ought to be putting on our hands on our hips and doing the wonder woman before a big job interview. Thanks for watching, I’ll see you next time.
You might have heard that Power Posing affects how powerful you feel and can change hormone levels. If it does we’d expect to see that effect over and over and over.
Study after study. And it would be pretty disappointing if one study concludes that eating carrots improves your vision, and then after you rushed to sign up for monthly carrot deliveries...5 similar studies find no evidence that munching carrots is good for your eyes. To make sure that an experimental result is sound, we want to replicate the findings.
Results need to be confirmed. Which is why replication--re-running studies to confirm results --and reproducible analysis--the ability for other scientists to repeat the analyses you did on your data--are essential in science. These issues affect basically every field from Artificial Intelligence research to social science.
INTRO A few years ago scientists at a biotech company called Amgen decided to try to replicate more than 50 big-deal cancer treatment studies. These were studies that had been published in respected journals. And the Amgen scientists were only able to replicate the original results 11-percent of the time.
In another reproducibility study...a group of 270 scientists re-ran 100 psychology studies that had been published in 2008 in top-notch journals. Fewer than half of the published results were replicated. Stanford researcher, Dr.
John Ioannidis has claimed that “false findings may be the majority or even the vast majority of published research claims”. The journal Nature published a survey a few years back and asked researchers if they thought there was a reproducibility crisis in science. 52% called it a “significant crisis” another 38% called it a “slight crisis”. And 90% of researchers thinking they have some size of crisis on their hands is big deal.
The “replicability crisis” has been used in political debates to undermine scientific research. Political activists, especially those that hold opinions that run counter to scientific research, have jumped on the problem of replicability as a way to discredit science more broadly. And when a medical study winds up with invalid conclusions researchers could head down the wrong path people could get misguided treatments based on faulty conclusions they could get sicker even...and a whole lot of money could be wasted researching and providing those treatments.
So, what’s causing science’s replication problem? There are a lot of answers. Some of them involve unscrupulous researchers--researchers that are more concerned with attention and publishing and splashy headlines than good science.
Here we’re talking about fraud. Falsified data. Intentional p-hacking.
Statistical evil doers. But many reasons scientific studies aren’t replicable are less nefarious. One issue related to replication--re-doing studies--is reproducibility of the analyses in a paper.
There’s not always one prescribed way of analyzing a data set. A researcher named Brian Nosek and his team invited 29 groups of researchers to analyze the same data set--and attempt to answer whether or not soccer referees give more red cards to dark-skinned players than light-skinned ones? Seems simple enough.
These researchers were all working with the SAME data--but they ran different tests. Some used linear regressions. Some went with Bayesian models.
And it’s not just the models that the researchers could have differed on. You also have freedom to exclude different outliers, or look at different groups. Twenty of the groups found a statistically significant relationship between skin color and red cards.
Nine groups didn’t. The point, says researchers, is that no one analysis is gonna find THE answer, THE singular truth. When researchers aren’t clear about how they analyzed their data, from which data points they excluded, to the exact model they ran, it can make it hard for someone to reproduce their results.
Even if they had the same data. Good papers will have detailed descriptions of researchers’ methods. When you replicate a study you usually know what model the researcher used or you can ask.
But if scientists aren’t clear or consistent about this, it just puts another roadblock in the way of good replication. There are other reasons for the replicability crisis. Some researchers and the folks who report on scientific research don’t fully understand p-values.
They make claims that statistical evidence doesn’t support. Back in 2016, the American Statistical Association released a statement meant to help researchers understand and use P values better. It was reportedly the first time the 170-plus year old organization made this type of explicit recommendations.
Among the guidelines the Statistical Association published: “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” And "A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.” P-values need to be understood in context. A significant result doesn’t mean we ought to all rush out and change what we’re doing. But if you like carrots, by all means keep eating them.
Another reason science produces results that can’t be reproduced is that published studies have a bias toward overestimating effects--in part because they were published because they had a low p-value. Some studies look promising and then aren’t reproducible because they were based on a fluke. When the study is repeated the fluke doesn’t repeat itself.
The website FiveThirtyEight offers up this explanation: Say you were looking at the relationship of height and college majors. You gather up your data--including a class of math majors with a few exceptionally tall kids and a class of philosophy majors with an unusually short student. When you compare the averages-- ha ha!
Look at that! Math majors are taller than philosophy majors. You have statistically significant results but when you repeat the study those differences disappear there’s regression to the mean which gives you a more accurate picture of pretty similar average heights of each major and nothing all that interesting to write about.
Except a correction to your first paper. Small sample sizes also get blamed. The fewer subjects in a study, the more likely you get skewed and unreplicable results.
DFTBAQ, my friends. Even when results make sense to you--DFTBAQ. So where can researchers start improving the process--to help solve this reproducibility crisis.
For one, researchers argue they need to do a whole lot more replication. Replication allows us to weed out false significant effects: the flukes and the “too good to be true” effects that unfortunately make great headlines. We need to get rid of the idea that one significant test is solid proof of anything.
It isn’t. In fact, we need to get rid of the idea that one significant test is even great evidence of anything. But replication is expensive.
And it’s not as sexy as making a new discovery. It doesn’t attract the same media attention institutional acclaim or funders. Who wants to say “I found the effect that my colleague found yesterday!”?
So say researchers we’ve gotta come up with ways to change those incentives. We need to find more funding for replication studies and change the way we all view the value of replication. Some people call for more publication of “null results”--those that DON’T support the hypothesis.
This would allow quality research to be published, even if it didn’t show an effect, making p-hacking a little less enticing, since you could still get null results published. Some researchers argue another way to help correct the reproducibility crisis is by reconsidering the standard p-value cut off of .05 for statistical significance. Is it stringent enough?
Or should researchers move it? In 2017, a group of more than 70 researchers co-authored a paper calling for a change in the default P-value threshold from .05 to .005. They wrote: “This simple step would immediately improve the reproducibility of scientific research in many fields.” Calling results with a p-value of less than .05 statistically significant they argue results in a high rate of false positives...even when that research is done correctly.
Let’s just look at one area of research we’ve talked about before: Social Priming -- the idea that certain actions or conditions can affect the way you behave. One famous case of social priming is a study where subjects who were exposed to words related to old age--like Florida, bingo, grey, or retired--walked more slowly after exposure than those who were shown neutral words. But recently, many researchers have expressed concerns that some of these social priming results may not hold up.
When they began to see that, many experiments were done with many different priming mechanisms and outcome variables. And we’re making this data up here, but let’s say that out of 1000 studies done, about 10% or 100 ended up with real effects of social priming. This is a table that displays how often our studies resulted in true positives, false positives, true negatives, and false negatives.
The top row shows the 900 studies where social priming DIDN’T work. Because we used a threshold of 0.05, 5% of those 900 studies will still be statistically significant even though there was no effect. Those 45 are our false positives.
That leaves 855 studies where social priming didn't work and we caught it. Those are our true negatives. The next row contains the 100 studies where social priming DID work.
In those studies, there were actual effects of social priming. There you can see our true positives (60) and false negatives (40). So what does that mean?
Well, remember, statistical power is the ability to detect real effects. Sometimes we can fail to get a significant result, even if an effect of a certain size is real. One estimate suggests that most psychology studies have an average of 60% power.
So, that 60 on our table represents the 60 studies that had a significant effect that was observed. The other 40 weren’t caught, giving us false negatives. Using our table, we can look at the percent of significant results that come from studies with no effect.
Our “False Alarms”. Our False Discovery Rate is 45 divided by 105, or 42.9%. That means of all the significant effects that were recorded and published in our thought experiment, a bit less than HALF of them are false positives.
Which shows, as we mentioned before, that having a statistically significant effect doesn’t make it REAL. All else being equal if we had changed the p-value threshold from .05 to .005, we would have way less false positives. To make the work of reproducibility easier, there are also pushes underway to encourage researchers to share their data more widely.
In the United Kingdom, for example, many research funders expect researchers will make publicly funded research data available--recognizing the data as a public good. Academic journals also play a role in the conversation around reproducibility--many of the most prestigious journals have adopted guidelines and policies that put more emphasis on reproducibility and transparency. In part, to help boost public trust in science and the scientific process.
Let’s go back to power posing before we finish today. Really get that blood flowing. Confidence building!
A study on power posing was published in Psychological Science back in 2010 that showed that power posing could change hormone levels and boost confidence. A TED talk about power posing was viewed more than 40 million times. Want a raise?
Respect and awe from your friends and family and enemies? Power pose. Or not.
After power-posing went mainstream other researchers tried to replicate the study--with a larger sample--and didn’t come up with the same results. Other researchers found significant problems with the original study and came to the conclusion that quote “the existing evidence is too weak to justify a search for moderators or to advocate for people to engage in power posing to better their lives.” Power-posing got labeled pseudo-science. And then in 2018 the original author published a response to some of the critiques about power-posing...with an analysis that suggested the poses could help people feel more confident and powerful.
Now, the newest paper doesn’t seem to address all of the critiques about Power Posing study but it comes to the conclusion that researchers shouldn't give up research about the effects of Power Posing quite yet. No. No.
These are not power poses. I’m just trying to find something that indicates confusion. This back and forth of the power posing debate does make it harder to know what’s likely to be true.
But it also shows the VALUE of replication and even the reproducibility crisis in research. Science is a push and pull of ideas--researchers are constantly iterating and expanding on ideas that came before. They refine results.
Build on other people’s findings. Replication is an essential part of the path to scientific progress and real breakthroughs. The reproducibility crisis means more people are taking the replication step of the process seriously.
Replication has helped us accomplish some pretty important things. Like help change peoples minds about whether smoking caused increases in lung cancer, even though researchers could never do a Randomized Controlled Trial to demonstrate causation. Evidence piled up, and now smoking rates are incredibly low.
No single study is gonna show us the way the world REALLY is but that study and the studies that follow it that do and don’t find the same relationships will get us closer and closer. And one day maybe we’ll know--with more explicit certainty whether or not we ought to be putting on our hands on our hips and doing the wonder woman before a big job interview. Thanks for watching, I’ll see you next time.