Previous: Theory & Deviance: Crash Course Sociology #19
Next: World Cinema - Part 2: Crash Course Film History #15



View count:357
Last sync:2017-07-26 17:40
Get your first two months of CuriosityStream free by going to and using the promo code “crashcourse”. So last episode we talked about some basic file formats, but what we didn’t talk about is compression. Often files are way too large to be easily stored on hard drives or transferred over the Internet - the solution, unsurprisingly, is to make them smaller. Today, we’re going to talk about lossless compression, which will give you the exact same thing when reassembled, as well as lossy compression, which uses the limitations of human perception to remove less important data. From listening to music and sharing photos, to talking on the phone and even streaming this video right now the ways we use the Internet and our computing devices just wouldn’t be possible without the help of compression.

Pre-order our limited edition Crash Course: Computer Science Floppy Disk Coasters here!

Produced in collaboration with PBS Digital Studios:

Want to know more about Carrie Anne?

The Latest from PBS Digital Studios:

Want to find Crash Course elsewhere on the internet?
Facebook -
Twitter -
Tumblr -
Support Crash Course on Patreon:
CC Kids:

 (00:00) to (02:00)

 Introduction (00:00)

Hi I'm Carrie Anne and welcome to Crash Course Computer Science. Last episode we talked about files, bundles of data stored on a computer that are formatted and arranged to encode information, like text, sound, or images. We even discussed some basic file formats like text (.txt) wave (.wav) and bitmap (.bmp). While these formats are still used today, their simplicity also means they're not very efficient.

Ideally, we want files to be as small as possible, so we can store lots of them without filling up our harddrives and also transmit them more quickly. Nothing is more frustrating than waiting for an email attachment to download (ugh!). The answer is compression, which literally squeezes data into a smaller size.

To do this we have to encode data using fewer bits than the original representation. That might sound like magic, but it's actually computer science.

 Uncompressed Images (00:58)

Let's return to our old friend from last episode, Mr. Pac Man. This image is 4 pixels by 4 pixels. As we discussed, image data is typically stored as a list of pixel values.

To know where rows end, image files have metadata, which defines properties like dimensions. But to keep it simple today, we're not going to worry about it. Each pixels' color is a combination of three additive primary colors: red, yellow, and blue.  We store each of these values in one byte, giving us a range of 0 to 255 for each color.

If you mix full intensity red, green, and blue, that's 255 for all three values, you get the color white. If you mix full intensity red and green, but no blue at 0, you get yellow. We have 16 pixels in our image and each of those needs three bytes of color data.

That means this images' data will consume 48 bytes of storage, but we can compress the data and pack it into a smaller number of bytes than 48.

 Run-Length Encoding(1:55

One way to compress data is to reduce repeated or redundant information. The most straightfoward way to do this is called run length encoding. This takes advantage of the fact that there are often runs of identical values in files. For example, in our Pac Man image there are seven yellow pixels in a row.

Instead of encoding the redundant data, yellow pixel, yellow pixel, yellow pixel, ... and so on. We can just state that there are seven yellow pixels in a row by inserting an extra byte that identifies the length of the run, like so, and then we can eliminate the redundant data behind it. To ensure that computers don't get confused with which bytes are run lengths and which bytes represent color, we have to be consistent in how we apply this scheme. So we need to preface all pixels with their run length. In some cases this actually adds data, but on the whole, we've dramatically reduced the number of bytes we need to encode this image. We're now at 24 bytes, down from 48. That's 50% smaller, a huge saving.

 Lossless compression (02:40)

Also note that we've not lost any data. We can easily expand this back to the original form without any degredation.

A compression technique that has this characteristic is called lossless compression, because we don't lose anything.  The decompressed data is identical to the original data before compression, bit for bit.

 (02:00) to (04:00)

 Dictionary Compression (2:56)

Let's take a look at another type of lossless compression, where blocks of data are replaced by more compact representations. This is sort of like Don't Forget to be Awesome being replaced by DFTBA. To do this, we need a dictionary that stores the mapping from codes to data.

Let's see how this works for our example. We can view our image as not just a string of individual pixels but also as blocks of data. For simplicity, we're going to use pixel pairs, which is six bytes long, but blocks can be any size.

In our example, there are only four pairings, white-yellow, black-yellow, yellow-yellow, and white-white. Those are the data blocks in our dictionary we want to generate compact codes for.

What's interesting is that these blocks occur at different frequencies. There are four yellow-yellow pairs, two white-yellow pairs, and one each of black-yellow and white-white.

Because yellow-yellow is the most common block, we want that to be substituted for the most compact representation. On the other hand, black-yellow and white-white can be substituted for something longer because those blocks are infrequent.

 (04:00) to (06:00)

 Huffman Trees (3:51)

One method for generating efficient codes is building a Huffman tree, invented by David Huffman while he was a student at MIT in the 1950's. His algorithm goes like this:

First, you lay out all the possible blocks and their frequencies. At every round, you select the two with the lowest frequencies. Here, that's black-yellow and white-white, each with a frequency of one.

You combine these into a little tree, which have a combined frequency of two so we record that. And now, one step of the algorithm is done.

Now we repeat the process. This time we have three things to choose from. Just like before, we select the two with the lowest frequency, put them into a little tree, and record the new total frequency of all the subitems.

Ok, we're almost done. This time, it is easy to select the two items with the lowest frequency because there are only two things left to pick. We combine these into a tree and now we're done.

Our tree looks like this, and it has a very cool property. It's arranged by frequency, with less common items lower down.

 Dictionary Creation (4:43)

So now we have a tree, but you may be wondering how this gets us to a dictionary. Well, we use our frequency sorted tree to generate the codes we need by labeling each branch with a 0 or a 1, like so.

With this, we can write out our code dictionary. Yellow-yellow is encoded as just a single 0, white-yellow is encoded as 10, black-yellow is 110, and finally white-white is 111.

The really cool thing about these code words is that there's no way to have conflicting codes, because each path down the tree is unique. This means our codes are prefix-free, that is, no code starts with another complete code.

 Image Compression (5:15)

Now let's return to our image data and compress it. Our first pixel pair, white-yellow, is substituted for the bits 10. The next pair is black-yellow, which is substituted for 110. Next is yellow-yellow, with the incredibly compact substitution of just 0, and this process repeats for the rest of the image.

So instead of 48 bytes of image data, this process has encoded it into 14 bits, not bytes, bits. That's less than two bytes of data, but don't break out the champagne quite yet. This data is meaningless unless we also save our code dictionary, so we'll need to append it to the front of the image data like this.

Now including the dictionary, our image data is 30 bytes long. That's still a significant improvement over 48 bytes.

 (06:00) to (08:00)

 (08:00) to (10:00)

 (10:00) to (12:00)

 (12:00) to (12:48)

Website Security Test