Previous: Catching Alzheimer's 25 Years Earlier
Next: Tsunamis... From the Sky?



View count:5,325
Last sync:2021-09-30 02:00
Even though proteins are fundamental to life, it’s hard to predict what they look like. But two independent groups announced that they’d cracked it, and it’s all thanks to some seriously clever artificial intelligence.

Hosted by: Hank Green

SciShow is on TikTok! Check us out at
Support SciShow by becoming a patron on Patreon:
Huge thanks go to the following Patreon supporters for helping us keep SciShow free for everyone forever:

Chris Peters, Matt Curls, Kevin Bealer, Jeffrey Mckishen, Jacob, Christopher R Boucher, Nazara, charles george, Christoph Schwanke, Ash, Silas Emrys, Eric Jensen, Adam, Brainard, Piya Shedden, Alex Hackman, James Knight, GrowingViolet, Sam Lutfi, Alisa Sherbow, Jason A Saslow, Dr. Melvin Sanicas, Melida Williams

Looking for SciShow elsewhere on the internet?
SciShow Tangents Podcast:

[♪ INTRO].

There is an unsolved mystery at the heart of biology that’s been slowing progress in medicine for a half a century. Whether you’re a biochemist trying to understand life, or a drug designer trying to save lives, you may have run into the protein folding problem.

That is, despite the fact that proteins are fundamental to life, it’s really hard to predict what they look like. But on the 15th of July 2021, two independent groups announced that they’d cracked it, and it’s all thanks to some seriously clever artificial intelligence. Which is exciting, because it could eventually lead to breakthroughs in the fights against cancer and covid-19, and even toxic waste.

See, proteins are the building blocks of life. Everything in your body, and everything in organisms all the way down to viruses, is made either from or by proteins. Proteins transport oxygen in your blood, they digest foods, copy DNA, fight infections, build structures... it’s not just the things you eat to gain muscles here!

In fact, DNA, which is the code that makes you you, boils down to a series of instructions for making proteins. And proteins are made from a set of twenty building block molecules called amino acids. If proteins are like words, amino acids are the alphabet they’re made from.

When your body makes proteins, it reads instructions from your DNA to make long strings of amino acids, which fold in on themselves in particular ways, and settle into specific shapes. That shape determines how the protein works, because proteins need to fit together with other proteins like puzzle pieces, or latch onto specific molecules. And that makes proteins different from something like DNA, where knowing the sequence is the same as knowing what it does.

For a protein, the shape matters too. The neat thing, though, is that the way the amino acid string folds is determined by the sequence. If you know that, you should be able to work out the protein’s final shape or shapes.

And it’s easy enough to figure out what the sequence should be, because that’s given to you by the DNA -- and we can read the genetic code. But working out the shape is way harder. A protein can be made from anywhere from fifty to two thousand amino acids.

And each amino acid has a slightly different chemical structure, adding to the complexity. Individual parts of the amino acids can interact with all the other nearby amino acids, and even some of the faraway ones, pushing the folding in seemingly random directions. It still ends up making a useful structure that is the same every time, but it is very hard to predict how it gets there.

So even though we know the amino acid sequences of billions of proteins, we’re still stuck stumbling in the dark when it comes to working out their shapes. One way we currently study protein structure is using x-ray crystallography. A solution that contains proteins is slowly evaporated so that the proteins left behind form crystals.

X-rays are fired at the crystals to get images, and those images are assembled into a 3D model. Another method uses nuclear magnetic resonance imaging, the same tech used in hospitals to image the human body. These processes are time-consuming: they can take days or even years depending on the protein.

But in 2020, a team from the London-based,. Google-owned company DeepMind made a stunning announcement. They claimed that their new AI algorithm, AlphaFold2, could predict the folded shape of proteins from amino acid sequences about as well as experimental methods.

DeepMind had previously made a name for itself by making. AIs that can beat world champions in chess, go, shogi, and even StarCraft 2. But the game-playing AIs were just a warm-up for the real challenge of doing science.

The proof of their claim came from the results of a competition held in 2020 called CASP14. CASP is a competition held once every two years for people trying to solve the protein folding problem with computers. In the competition, teams are given the amino acid sequences of about 100 proteins with known structures.

They’ve been worked out with experiments, like x-ray crystallography, but the structures haven’t been revealed publicly. The teams then predict what the structure of the protein will look like after folding, and independent judges compare the predictions to the experiments. Before 2020, no team came close to experimental accuracy.

Not even DeepMind themselves when they entered in 2018 with their original iteration of AlphaFold. But 2020 was different, and not just because the competition involved proteins from a new virus called SARS-CoV-2! In CASP14, AlphaFold2 cracked the problem wide open: two-thirds of their predictions were about as good as experiments.

And it’s not like the experiments were perfect, either. For some predictions where AlphaFold2 disagreed with the experiments, it wasn’t actually clear which of the two was closer to the real structure, since both predictions and experiments always come with a certain margin of error. Many of AlphaFold2’s predictions were remarkably precise.

For some, the margin of error was the size of a typical atom. In other words, they guessed the exact spot where each of the thousands of individual atoms were placed in the overall structure. It does not get much better than that!

When the results were announced, scientists were stunned by the breakthrough. Some had thought they would never see the problem solved in their lifetimes, and called it a ‘holy grail’ of sorts. Basically, AlphaFold2 and the DeepMind team had kind of ‘solved’ the protein folding problem.

But only kind of. The best was yet to come. DeepMind gave a short presentation about their algorithm, which was seen by a group from the University of Washington, Seattle, who then applied what they saw to their own work on the problem.

Working with an international team of collaborators, they developed their own algorithm, which they called RoseTTAFold. They combined the DeepMind team’s ideas with some of their own, and made a program that was actually better than AlphaFold in a few ways. Finally, on the same day in July 2021, both DeepMind and Seattle teams published their full methods, and made the code free for academics to use.

In a day, we went from having no good way to predict protein structure on a computer to having two different, highly-accurate options. And there’s already plenty of ideas for how to use them. Firstly, DeepMind have published a paper with predictions for the structure of almost all human proteins, but there’s more, too.

Scientists are always trying to design drugs that stop proteins from binding to other proteins. Which is useful if those proteins are made by a cancerous cell, or, y’know, the virus that causes COVID-19. That process often relies on being able to modify and design new proteins.

Being able to see the protein you’re designing on a computer without needing to go out and make it is hugely useful. It could potentially streamline the process from taking years to taking days. And there’s more exotic ideas for new proteins out there, too.

Proteins in our bodies break down harmful chemicals and make useful byproducts all the time, so imagine making an artificial protein that can break down toxic waste, or produce biofuels. So, that’s what they did. Now the question is how did the.

DeepMind and Seattle teams bring home this holy grail of biochemistry? If you can’t tell, I, personally, am very excited about this because it’s what I worked on in research and it was so hard and I’m so excited to see it get easier. Both methods use what’s called deep learning and neural networks.

A neural network contains layers of interconnected nodes, or artificial ‘neurons’ made of data, which use algorithms to mimic the way biological neurons in our brains send signals back and forth to one another. Deep learning is a subset of machine learning, and it’s basically a neural network with at least three layers of nodes: an input layer, one or more "hidden" intermediary layers, and an output layer. The idea is that with all these different layers working together, the program can learn to recognize and identify objects in a dataset.

A classic example of how this works is training a neural network to look at an image and work out whether it includes a cat. You could think of each node in a neural network as its own tiny mathematical model, using an equation to predict an outcome. This prediction will be the output of the node, and this data will be sent along as the input data to another node in the next layer of the neural network.

This process repeats over and over again. In the cat image example, you could use a set of correctly labeled training data to help “train” the algorithm. Like, say, thousands of images, some with cats, some without, that are labeled with and without cats.

The algorithm makes its predictions and checks for errors against the known dataset. Then, it can use the data from the errors to adjust its predictions. And over time, it can gradually become more accurate.

Eventually, you could give it testing data, asking it to identify cats in images it hasn’t been trained on, and checking if it’s right. So, applying this to protein-folding, the training and testing data consisted of correctly folded proteins. The algorithms train themselves by looking at these proteins and trying to fold the proteins themselves to match what they see.

Now, for the DeepMind and Seattle teams, that training and testing data came from the Protein Data Bank, which has over 180,000 protein structures and their amino acid sequences. Now at the basic level, most of the CASP14 entries also used a deep learning approach, but the DeepMind and Seattle teams had some clever tricks up their sleeves. See, this is a more complex task than looking at cat pictures.

The teams had to do more than just put in previously-folded proteins and hope the algorithm learned how to fold more. So when the researchers built their algorithms, they needed to make adjustments specific to the world of protein folding, which they needed deep biochemistry expertise to do. For example, biochemists can predict on a more general level if an amino acid sequence is likely to form a structure like a coil or a flat sheet.

And they can use that knowledge to guide the neural network to its predictions. In fact, AlphaFold uses multiple neural networks that feed into each other in two stages. AlphaFold starts with a network that reads and folds the amino acid sequence and adjusts how far apart pairs of amino acids are in the overall structure.

And that’s followed by a structure model network that reads the whole thing, creates a 3D structure, and makes adjustments at the end. One of RoseTTAFold’s innovations was to add a simultaneous third neural network: one that tracks where the amino acids are in 3D space as the structure folds, alongside the 1D and 2D information. On top of that, protein-folding researchers have started to look at evolutionary history of closely-related proteins to work out their structures.

If you know how closely related two proteins are, and you know the structure of one of them, you can make good guesses about the structure of the other. The two teams both took advantage of this idea. RoseTTAFold isn’t quite as accurate as AlphaFold, but it works using way less computing power and time, working out in minutes what AlphaFold needs hours to do.

Because the Seattle team didn’t have the immense computing power of Google at their disposal to do the calculations. But RoseTTAFold can actually do things AlphaFold can’t do yet. For instance, AlphaFold struggles with the crucial but difficult subject of protein complexes, where multiple proteins interact and their structures depend on which specific proteins are present.

Many proteins require partners to function, so understanding protein complexes is part of understanding proteins. But RoseTTAFold can handle proteins where the amino acid chain is broken into more than one piece, and that same logic can be used to study interactions of different proteins with each other, which is needed to analyze complexes. So both methods have interesting things to bring to the table.

However, no one here is envisioning replacing the biologists who study protein structures. I mean, the AIs wouldn’t have gotten far without the massive databases of protein structures people have built up over decades. And the AI is only useful because it will let humans do more stuff.

Like quickly designing and testing artificial proteins. With that power, biochemists will create exotic proteins we can only dream of, and biologists will be able to explore the history and nature of life like never before. With all of this potential, the future of biology is starting to take shape.

Thank you for watching this episode of SciShow, which was brought to you as always with the help of our amazing patrons. We would not be able to make ten minute long videos about the overlap between chemical biology and computer science without you, and we appreciate the heck out of your support. If you’d like to get involved making things like this possible, it’s so easy to do that.

You just have to go to [♪ OUTRO].