Previous: How to Build a Rocket Engine in Your Kitchen (Experiment Episode)
Next: The Neolithic Diet: New Details About What's in the Iceman's Stomach



View count:237,825
Last sync:2022-11-05 09:15
Data mining recently made big news with the Cambridge Analytica scandal, but it is not just for ads and politics. It can help doctors spot fatal infections and it can even predict massacres in the Congo.

Hosted by: Stefan Chin

SciShow has a spinoff podcast! It's called SciShow Tangents. Check it out at
Head to for hand selected artifacts of the universe!
Support SciShow by becoming a patron on Patreon:
Dooblydoo thanks go to the following Patreon supporters: Lazarus G, Sam Lutfi, Nicholas Smith, D.A. Noe, سلطان الخليفي, Piya Shedden, KatieMarie Magnone, Scott Satovsky Jr, Charles Southerland, Patrick D. Ashmore, Tim Curwick, charles george, Kevin Bealer, Chris Peters
Looking for SciShow elsewhere on the internet?


Audio Source:

Image Source:
[♪ INTRO ].

Data. The word is everywhere these days.

Every company is dying to tell you about its big data, data analytics, data privacy, data warehouse, data lake, data data data data. At the center of the data mania is data mining—the practice of sifting through all those piles of information for insights. Data mining recently made big news with the Cambridge Analytica scandal.

The political consultancy reportedly sucked up data about millions of Facebook users without their knowledge, then used it to profile and sway voters in the US, UK, and elsewhere. And similar techniques let companies like Amazon, Facebook, and Google work out what we want to see or buy—sometimes with shocking accuracy. It’s a little creepy.

It’s not just ads and politics, either. Data mining allows airlines to predict who’s going to miss a flight; it tells big-box stores who’s pregnant; it helps doctors spot fatal infections; and it’s even enabled cell phone companies to predict massacres in the Congo. The power of data mining and the hype surrounding it can make it sound like a magic wand—one that will either save your business or sink democracy.

Of course, data mining doesn’t really involve any unicorn hair or phoenix tail feathers. It’s just applied statistics, searching lots of data points for patterns that humans might not spot. Those patterns are based not on human intuition, but on whatever the data suggests, so sometimes they can seem incredibly subtle or even alien.

But there’s no more magic in data mining than there is in a weather forecast. In fact, data mining is a lot like meteorology. Meteorologists aim for two things: first, they want to describe patterns in the weather—to boil down its massive complexity into a few numbers and equations.

And second, they want to predict Tuesday’s weather. That’s the whole point. Similarly, Spotify’s data scientists might be interested in describing medieval rock fans, recognizing them as a group distinct from nerdcore or freak folk fans — yes, that's a real sub-genre.

Ultimately, though, what’s most important to companies like Spotify is predicting what each person wants to listen to. The key with data mining is that it achieves description and prediction not through careful study by experts, but by analyzing large amounts of data. In Spotify’s case, that might mean scanning for patterns in genre labels, acoustic attributes,.

Internet reviews, and anything else about each track, plus the age, location, friend group, and other scraps of information about each user. Data mining is more about spotting patterns than explaining them. Of course, the words “pattern” and “data” can mean just about anything.

There are no clear definitions for data mining, data science, or big data, and they’re sometimes used interchangeably with each other or with machine learning. That’s why it’s so easy to slap these buzzwords onto any project for instant venture capital karma. That being said, a few types of techniques consistently earn the “data mining” label.

The most broadly applicable one is classification, where you try to categorize things. For example, Target famously realized as early as 2002 that they could guess who was pregnant and send them baby-related coupons. That’s a textbook classification problem: Target needed to assign each customer to one of two categories: either “probably pregnant” or “probably not pregnant.” Classification typically works in several stages.

First, each example, or instance, has to be broken down into a collection of numerical attributes, or features. For a store like Target, an instance might be your mom 7 months before you were born. The features would be things like “How many bottles of unscented lotion did she buy in the last three months?

How about in the quarter before that?” And likewise for zinc supplements, Asian pears, and every other product in the inventory. The store would also need labels for some chunk of the data—the ground truth about whether those customers were pregnant. Target got those labels from baby registries and due dates customers had shared.

Once the data’s all lined up, it’s time for training. That’s where the system tries to tease out patterns from all the labeled examples. Learning to classify is such a basic, common need that dozens of algorithms, the mathematical procedures computer programs follow, have been devised for it.

Which algorithm works best depends on all kinds of factors, like how many categories there are and how different features are connected to each other. But many classification algorithms are similar in that they treat each feature as a drop of evidence for one category or the other. The features get weights indicating how strongly they boost or weaken someone’s chances of falling into the “yes” category — that they are pregnant, for example.

Those weights are what the system learns during training. Basically, it’s figuring out how informative each attribute is. Finally, to classify instances the system hasn’t seen before, it puts together all the weighted contributions, and maybe stuffs the resulting number through a bit of mathematical machinery to slide it up or down.

If the result is negative, that instance goes in the “no” bucket. If it’s positive—load up the crib coupons! Each individual feature doesn’t tell you much.

In fact, many turn out to be irrelevant. But together they can be really powerful. Target’s approach worked so well that when one customer complained that his teenage daughter was getting coupons for baby clothes, he ended up apologizing to Target.

Turned out the company knew about his daughter’s pregnancy before he did! Classification is useful any time you want to tell one group of things from another. Insurance companies use it to guess which elderly patients will die soon so that they can start end-of-life counseling.

Doctors use it to check whether premature babies are developing dangerous infections, since the classifier can put together subtle disease indicators before humans would notice any signs. I could spend all day listing uses for classification, but it’s far from the only type of data mining. One close cousin is known as regression.

And no, that doesn’t mean deciding you like Limp Bizkit again. In regression, instead of predicting a category, the goal is to predict a number. Take Target again.

They wanted to know not just whether each customer was pregnant, but when to send each coupon. So they managed to estimate due dates, too. That’s a regression question—how many weeks until the customer gives birth.

Regression often depends on dozens or even thousands of variables—the features that describe each example. It finds an equation or curve to fit the data points, telling you how high you’d expect the curve to be given any arbitrary input. Or in this case, how far away you’d expect the customer’s due date to be.

Like in classification, many regression techniques give each feature a weight, then combine the positive and negative contributions from the weighted features to get an estimate. And, also like classification, regression is used everywhere. One of the better-known examples is Google Flu Trends.

In 2008, it began publishing real-time estimates of how many people had the flu based on searches for words like “fever” and “cough.” Regression is also part of predictive policing software — programs that look at historical data to guess how likely a crime is to occur in each area. The third major data mining technique is clustering. As the name suggests, the goal here is to group data points in a way that helps with the analysis.

In the marketing world, clustering emerged in the 1980s—well before data mining—with the work of a market researcher named Howard Moskowitz. He struck gold when he realized there wasn’t one best pasta sauce. Consumers showed three distinct types of preferences—and the previously unrecognized group that craved extra-chunky turned out to be worth millions.

Clustering is often used to analyze market segmentation like this, but to understand how the techniques work, let’s take a different example: eBay. On eBay, you can get millions of products, from antiques to zip ties. Even within a single category, like electronics, the selection is overwhelming.

So eBay organizes things into subcategories. But it’s a pain for humans to trawl through all the electronics, identify subcategories, and assign every product to a subcategory. Instead, the company can use clustering to automatically group the products.

Again, each product first has to be broken down into numerical features, like how many times “printer” appears in the description, or who manufactured it. The simplest clustering method is to guess how many distinct subcategories there should be. Then you randomly lump items together into that many clusters, and keep shifting items between groups to make each cluster tighter.

In the end, similar products end up settling into clusters together. But we don’t have to stop there! The blue and silver versions of the same camera don’t really deserve separate listings; they’re variants of the same product.

So in addition to subcategories, it would be nice to find listings to merge. Sites like eBay can do both simultaneously with a technique called hierarchical clustering. Rather than a single set of categories, hierarchical clustering produces a sort of taxonomic tree.

For example, it might find that cameras are much more like each other than like TVs. But within cameras, the DSLRs and point-and-shoots each get their own subgroup, albeit slightly less distinct ones. And within those are many different models, each with a few variants. (on image) Companies like Cambridge Analytica use these techniques to look for groups of voters who will respond to the same kinds of advertising, and Spotify can use them to guess who will like similar music.

The fourth staple of data mining is anomaly detection. It’s basically a special case of classification—identifying instances that are unusual or worrisome. The IRS uses anomaly detection to spot likely tax evaders, and credit card companies use it to flag transactions that don’t fit your usual buying habits.

It also helps industries with heavy-duty equipment. For instance, power companies and airlines can see when a generator or jet engine is starting to vibrate differently than usual. Some anomalies can be detected just by looking for deviations from averages.

Fancier techniques include looking for instances that don’t match any cluster, or comparing instances with the closest other examples to see if their feature values are far off. Finally, association learning reveals which birds are of a feather. The idea is to look through, say, millions of grocery store purchases to see what gets bought together and when.

A classic example is the Osco drug store chain, which once found that many customers bought beer and diapers together on Friday evenings. Contrary to popular legend, the store never acted on this profound insight, but stores regularly use observations like this to optimize their floor layouts and inventory. For instance, Walmart discovered that shoppers buy lots of Pop Tarts immediately before hurricanes, so it started to stock up.

Association learning has broader applications, too. CellTel, an African cell phone company, realized it could spot impending massacres in the Congo when everyone nearby started buying prepaid phone cards. The five strategies we’ve covered—classification, regression, clustering, anomaly detection, and association learning—form the backbone of data mining.

What makes them so powerful is that they offer standard mathematical tools you can use for everything from curating Facebook feeds to optimizing store layouts. But that ease of use can also lead people astray. Data mining is just one step in the process of extracting knowledge from data—and it’s all too easy to whip out an algorithm without carefully selecting the data, massaging it into the right form, and considering how to interpret the results.

Remember Google Flu Trends? It shut down after a few years, but not because the algorithm was broken. Search auto completion had totally thrown off the data, and engineers had given it too much leeway to interpret seasonal words like “snow” as evidence of the flu.

Then there are the queasy social implications of sharing data in the first place, and of letting companies form such an intimate understanding of our behavior. In other words … the creep factor. So as powerful as it is, the math of data mining is just the beginning.

Sometimes the hardest part is all the messy human stuff. Thanks for watching this episode of SciShow! If you’re interested in the ways companies can use psychology to learn even more about you from your data, you can check out our video about that over on the SciShow Psych channel. [ ♪OUTRO ].