Previous: Cheese, Catastrophes, & Process Control: Crash Course Engineering #25
Next: Dada, Surrealism, and Symbolism: Crash Course Theater #37



View count:82,974
Last sync:2023-01-18 23:15
There is a lot of excitement around the field of Big Data, but today we want to take a moment to look at some of the problems it creates. From questions of bias and transparency to privacy and security concerns, there is still a lot to be done to manage these problems as Big Data plays a bigger role in our lives.

Special thanks to Dr. Sameer Singh, the University of Washington, and the University of California Irvine for the content provided in this video.

Crash Course is on Patreon! You can support us directly by signing up at

Thanks to the following Patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Sam Buck, Mark Brouwer, James Hughes, Kenneth F Penttinen, Trevin Beattie, Satya Ridhima Parvathaneni, Erika & Alexa Saur, Glenn Elliott, Justin Zingsheim, Jessica Wode, Eric Prestemon, Kathrin Benoit, Tom Trval, Jason Saslow, Nathan Taylor, Brian Thomas Gossett, Khaled El Shalakany, Indika Siriwardena, SR Foxley, Sam Ferguson, Yasenia Cruz, Eric Koslow, Caleb Weeks, Tim Curwick, D.A. Noe, Shawn Arnold, Malcolm Callis, Advait Shinde, William McGraw, Andrei Krishkevich, Rachel Bright, Mayumi Maeda, Kathy & Tim Philip, Jirat, Ian Dundore

Want to find Crash Course elsewhere on the internet?
Facebook -
Twitter -
Tumblr -
Support Crash Course on Patreon:

CC Kids:
Hi, I’m Adriene Hill, and welcome back to Crash Course Statistics.

In the last episode, we talked about the value of Big Data. But, as Big Data (and the statistics we do with it) permeate more areas of our lives, there are also new problems that come up.

How can we learn from useful data while still keeping it safe and private? When there’s SO much data that we have to rely on algorithms to manage it can we trust those algorithms? Today, we’re going to have a discussion about the potential downsides of Big Data in our lives and some possible solutions.

INTRO Let’s start with a Thought Bubble. This story comes out of a collaboration between the University of Washington and the University of California Irvine. The team wanted to create an algorithm that could take an image and determine whether it was of a husky or a wolf.

To do that, they trained the algorithm with a bunch of pictures. These images are a great example of “BIG DATA” that we CAN wrap our heads around. Pictures are generally made up of millions of tiny pixels.

And each pixel is made up of three colors, red, green, and blue. So three values per pixel is a lot of data. The algorithm ended up doing pretty well.

The team may have anticipated that it would recognize the animals’ different body types, facial features, or body placement. But it turned out that it wasn’t focusing on the animals’ appearances at all. It was mainly looking at snow.

In the data used to create the algorithm, the research team inadvertently included many photos of wolves in the snow. Huskies were often pictured without any snow around. The algorithm picked up on that and glommed onto it as an easy way to tell if something was a husky or wolf.

Once the researchers learned this, they did an experiment, feeding the algorithm new images that had been digitally altered. According to one of them, Dr. Sameer Singh, “When we hid the wolf in the image and sent it across, the network would still predict that it was a wolf, but when we hid the snow, it would not be able to predict that it was a wolf anymore." The algorithm just learned from what it was given.

And based on the data it was trained with, snow meant the image was way more likely to be a wolf. Even more than adorable wolf-i-ness. Thanks, Thought Bubble.

That algorithm brings us to our first concern with Big Data: bias. The defining characteristic of “Big Data” is that it’s big. It’s too big for a lot of the usual programs we use to look at data. In fact it’s sometimes even too big for us to comprehend. And when huge amounts of data are used to create algorithms, we can inadvertently introduce bias. Like the wolf and snow problem Other algorithms could do similar things, but with higher stakes. Ones used to determine mortgage and insurance rates, or assess the risk someone will do something illegal in the future, might pick up on things like race or other minority statuses.

And this is real. Judges in the U. S. use risk assessment programs while making sentencing decisions. A commonly-used one is COMPAS, which was created by the company Equivant. It basically gives a score of how likely a person is to commit another crime within two years. In 2016, ProPublica published an investigation of COMPAS. They looked at the scores of 7,000 people who had been arrested Broward County, Florida. The scores were compared with whether those people actually ended up committing crimes again within two years.

In addition to other concerning revelations, ProPublica found, “The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants. [And] white defendants were mislabeled as low risk more often than black defendants.” Equivant, then called Northpointe, did disagree with these findings. But, the company’s founder, Tim Brennan, also claimed that in order to make scores as accurate as possible, certain factors had to be included that could correlate with race. ProPublica cited the examples “poverty, joblessness and social marginalization.” So we can’t consider ourselves “safe” just because we think that our data is neutral.

We should also look for ways to make sure that the data we use to create our algorithm is as representative as possible. If we wanted to build an algorithm that predicted the success of CEOs, and we ONLY gave it examples of males who succeeded and females who failed then our algorithm will have a bias. We have to supply it with good, unbiased data. Males who succeeded and failed, and females that succeeded and failed.

In the tech world, you’ll often hear the phrase “Garbage In, Garbage Out” which means that bad input will lead to bad output. You can’t put biased data into an algorithm, and expect an unbiased output.

It can be hard to determine what kind of data will lead to biased decision making especially considering most of these algorithms are proprietary. In the Equivant example I mentioned earlier, the company wouldn’t reveal the details of the algorithm used for COMPAS to ProPublica for that exact reason. It’s also hard to figure out exactly what an algorithm is doing from the time we give it raw data, to the time it gives us an output, or decision.

With the methods we’ve talked about in this series, like regression, it’s easy to see which variables they consider important. But other Big Data methods, like neural networks, are often way less forthcoming with the “reasoning” behind their outputs. While we can’t always tell what algorithms are doing, some researchers have made other algorithms that can act as a sort of translator to turn the complex calculations of another algorithm into something humans can understand. The more humans can understand what an algorithm is doing, the more opportunities we have to recognize biased data and the resulting decisions.

Some believe that search and social media websites that use algorithms to affect your experience based on your data should be required to release more information about that algorithm -- how it works and what it’s doing. That’s called algorithmic transparency. Privacy is another big concern in the Age of Big Data. There’s all kinds of personal data about you that you might not want people to know. There are your entertainment choices, like what you’re reading and how many times you’ve watched Fuller House or listened to “I Like It”. And your school or work--emails, cloud services, web browsing. Even your basic information like your location, your step count, or heart rate are tracked by your various smart devices. Companies like 23andMe or might even have your genetic code. Maybe you spend a lot of time at a place with security cameras. There are a lot of questions when thinking about privacy: Who has access to all that information? What are they doing with it? Who are they sharing it with? And what assumptions are they making about us with the data they have?

In 2018, The European Union implemented a law--the General Data Protection Regulation, or GDPR for short--that addresses a lot of the privacy concerns people have with the use of Big Data. It requires companies that deal with Big Data to be more transparent about what they’re collecting and who can see it. And it might be one of the reasons you got a LOT of emails about updated Privacy Policies…back in May of 2018.

The U. S. has The Children’s Online Privacy Protection Act, which went into effect in 2000. It’s intended to protect the privacy of children under the age of thirteen. The Act basically requires websites and apps to get parental approval for the personal information it might collect from kids. And using that information for targeted ads is not allowed. In 2018, a study of about 6,000 children’s apps was published in the journal Proceedings on Privacy Enhancing Technologies. It found that about 57% of them were “potentially violating COPPA.” Examples of violations included “sharing of personal information without applying reasonable security measures,” “potential sharing [of] persistent identifiers with third parties for prohibited purposes,” and “[sharing] location or contact information without consent.” Later that year, the attorney general of New Mexico filed a lawsuit against an app maker for violating COPPA. Privacy laws have been around for a long time all over the world. But as they pertain to Big Data, a lot of this stuff is new, we’re still figuring it out.

At the same time, when universities, hospitals, and other organizations share data, we learn a lot. It can be useful. A health organization’s survey on risky behaviors, like drug use, could have incredibly valuable results to researchers and policy makers. So, we can try to make it so that data can’t be easily connected to the specific person it came from. The obvious first step is to not include people’s names, or other unique, identifying personal information. But that may not be enough. If someone has a rare disease, simply knowing the city where they live might be enough to figure out who they are. One option to combat this issue is to make sure that there are at least 2 or more subjects that have the same characteristics. This is called k-anonymity. K is the number of subjects who share the exact same characteristics. If there are two people with that disease from that city, we have 2-anonymity because there were 2 subjects with the same characteristics. In our dataset, these two subjects are indistinguishable from each other, which helps keep the data private. And the larger k is, on average, the better.

Outside of medical research, there are debates about what companies should be expected to keep private. DNA companies, for example. In 2018, Joseph James DeAngelo was arrested as the suspected Golden State Killer. Investigators found DeAngelo because they had DNA from a crime scene, which they uploaded to a public, online genealogy database called GEDmatch. The database doesn’t collect DNA, but lets people upload profiles. So, the investigators were able to connect DeAngelo’s DNA with other relatives and figure out who he was from there. Even though he hadn’t uploaded anything to the site personally, the information his relatives had submitted was enough. GEDmatch does have rules about whose DNA you can upload to the site, like you can upload your own or someone else’s with permission. Their site policy also currently states that DNA can be uploaded if it was “obtained and authorized by law enforcement to either: (1) identify a perpetrator of a violent crime against another individual; or (2) identify remains of a deceased individual.” And the revelation that cases could get solved this way has led to questions of how private companies with DNA databases should be keeping their data. Currently, we don’t know how often cases get solved like this. Although a spokesperson for 23andMe told the New York Times that they’ve received “a handful of inquiries over the course of 11 years” from law enforcement. He claimed that data was never handed over.

Criminal investigations aside, it’s regular practice for both 23andMe and AncestryDNA to share data with medical researchers. Though participants can opt in or out. And in 2018, it was announced that the pharmaceutical company GlaxoSmithKline invested $300 million in 23andMe for drug development with the company’s resources. We have privacy laws in the U. S., but a lot of this is still ambiguous.

When it comes to your personal privacy, the best thing you can do is try to be as informed as possible about what’s happening to your data when you put it out there. And all the data out there means there’s a lot of information that can be stolen. And yes, better technology DOES allow for more protections like encryption but it also exposes our data to wider scale breeches. Hackers are after your personal information--- information that can be used to set up lines of credit-- like when Equifax was hacked in 2017. They also want your photos (remember the iCloud in hack in 2014) Your indiscretions… (Ashley Madison) Your email addresses… (Yahoo was hacked back in 2013 and 2014) Your business files… (remember the Sony studios hack after “The Interview”) Hackers have no qualms about cutting off your access to play FIFA--like when the Playstation network shut down after an attack in 2011. Companies and institutions like these that collect our data have responsibility to protect it. But just how much responsibility and what happens when they don’t. We’re still figuring that out. We don’t want to let our excitement about Big Data to outpace our caution. We don’t want to be like the scientists in Jurassic

Park: so preoccupied with whether we could, and not stopping to think about whether we should. As a society we need to think about and implement solutions to the problems big data creates. We want to use for good not not good. Thanks for watching. I'll see you next time.