#
crashcourse

Training Neural Networks: Crash Course AI #4

### Categories

### Statistics

View count: | 153,645 |

Likes: | 4,206 |

Dislikes: | 50 |

Comments: | 134 |

Duration: | 12:29 |

Uploaded: | 2019-08-30 |

Last sync: | 2023-01-01 00:00 |

Today we’re going to talk about how neurons in a neural network learn by getting their math adjusted, called backpropagation, and how we can optimize networks by finding the best combinations of weights to minimize error. Then we’ll send John Green Bot into the metaphorical jungle to find where this error is the smallest, known as the global optimal solution, compared to just where it is relatively small, called local optimal solutions, and we'll discuss some strategies we can use to help neural networks find these optimized solutions more quickly.

Crash Course is produced in association with PBS Digital Studios

https://www.youtube.com/pbsdigitalstudios

Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse

Thanks to the following patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Eric Prestemon, Sam Buck, Mark Brouwer, Timothy J Kwist, Brian Thomas Gossett, Haxiang N/A Liu, Jonathan Zbikowski, Siobhan Sabino, Zach Van Stanley, Bob Doye, Jennifer Killen, Nathan Catchings, Brandon Westmoreland, dorsey, Indika Siriwardena, Kenneth F Penttinen, Trevin Beattie, Erika & Alexa Saur, Justin Zingsheim, Jessica Wode, Tom Trval, Jason Saslow, Nathan Taylor, Khaled El Shalakany, SR Foxley, Sam Ferguson, Yasenia Cruz, Eric Koslow, Caleb Weeks, Tim Curwick, David Noe, Shawn Arnold, William McGraw, Andrei Krishkevich, Rachel Bright, Jirat, Ian Dundore

--

Want to find Crash Course elsewhere on the internet?

Facebook - http://www.facebook.com/YouTubeCrashCourse

Twitter - http://www.twitter.com/TheCrashCourse

Tumblr - http://thecrashcourse.tumblr.com

Support Crash Course on Patreon: http://patreon.com/crashcourse

CC Kids: http://www.youtube.com/crashcoursekids

#CrashCourse #ArtificialIntelligence #MachineLearning

Crash Course is produced in association with PBS Digital Studios

https://www.youtube.com/pbsdigitalstudios

Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse

Thanks to the following patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Eric Prestemon, Sam Buck, Mark Brouwer, Timothy J Kwist, Brian Thomas Gossett, Haxiang N/A Liu, Jonathan Zbikowski, Siobhan Sabino, Zach Van Stanley, Bob Doye, Jennifer Killen, Nathan Catchings, Brandon Westmoreland, dorsey, Indika Siriwardena, Kenneth F Penttinen, Trevin Beattie, Erika & Alexa Saur, Justin Zingsheim, Jessica Wode, Tom Trval, Jason Saslow, Nathan Taylor, Khaled El Shalakany, SR Foxley, Sam Ferguson, Yasenia Cruz, Eric Koslow, Caleb Weeks, Tim Curwick, David Noe, Shawn Arnold, William McGraw, Andrei Krishkevich, Rachel Bright, Jirat, Ian Dundore

--

Want to find Crash Course elsewhere on the internet?

Facebook - http://www.facebook.com/YouTubeCrashCourse

Twitter - http://www.twitter.com/TheCrashCourse

Tumblr - http://thecrashcourse.tumblr.com

Support Crash Course on Patreon: http://patreon.com/crashcourse

CC Kids: http://www.youtube.com/crashcoursekids

#CrashCourse #ArtificialIntelligence #MachineLearning

Hey, I’m Jabril and welcome to Crash Course AI!

One way to make an artificial brain is by creating a neural network, which can have millions of neurons and billions (or trillions) of connections between them. Nowadays, some neural networks are fast and big enough to do some tasks even better than humans can, like for example playing chess or predicting the weather!

But as we’ve talked about in Crash Course AI, neural networks don’t just work on their own. They need to learn to solve problems by making mistakes. Sounds kind of like us, right?

INTRO Neural networks handle mistakes. using an algorithm called backpropagation to make sure all the neurons that contributed to an error get their math adjusted, and we’ll unpack this a bit later. And neural networks have two main parts: the architecture and the weights. The architecture includes neurons and their connections.

And the weights are numbers that fine-tune how the neurons do their math to get an output. So if a neural network makes a mistake, this often means that the weights aren’t adjusted correctly and we need to update them so they make better predictions next time. The task of finding the best weights for a neural network architecture is called optimization.

And the best way to understand some basic principles of optimization is with an example with the help of my pal John Green Bot. Say that I manage a swimming pool, and I want to predict how many people will come next week, so that I can schedule enough lifeguards. A simple way to do this is by graphing some data points, like the number of swimmers and the temperature in Fahrenheit for every day over the past few weeks.

Then, we can look for a pattern in that graph to make predictions. A way computers do this is with an optimization strategy called linear regression. We start by drawing a random straight line on the graph, which kind of fits the data points.

To optimize though, we need to know how incorrect this guess is. So we calculate the distance between the line and each of the data points, add it all up, and that gives us the error. We’re quantifying how big of a mistake we made.

The goal of linear regression is to adjust the line to make the error as small as possible. We want the line to fit the training data as much as it can. The result is called the line of best fit.

We can use this straight line to predict how many swimmers will show up for any temperature, but parts of it defy logic. For example, super cold days have a negative number, while dangerously hot days have way more people than the pool can handle. To get more accurate results, we might want to consider more than two features, like for example adding the humidity which would turn our 2d graph into 3d.

And our line of best fit would be more like a plane of best fit. But if we added a fourth feature, like whether it’s raining or not, suddenly we can’t visualize this anymore. So as we consider more features, we add more dimensions to the graph, the optimization problem gets trickier, and fitting the training data is tougher.

This is where neural networks come in handy. Basically, by connecting together many simple neurons with weights, a neural network can learn to solve complicated problems, where the line of best fit becomes a weird multi-dimensional function. Let’s give John Green-bot an untrained neural network.

To stick with the same example, the input layer of this neural network takes features like temperature, humidity, rain, and so on. And the output layer predicts the number of swimmers that will come to the pool. We’re not going to worry about designing the architecture of John Green-bot’s neural network right now.

Let’s just focus on the weights. He’ll start, as always, by setting the weights to random numbers, like the random line on the graph we drew earlier. Only this time, it’s not just one random line.

Because we have lots of inputs, it’s lots of lines that are combined to make one big, messy function. Overall, this neural network’s function resembles some weird multi-dimensional shape that we don’t really have a name for. To train this neural network, we’ll start by giving John Green-bot a bunch of measurements from the past 10 days at the swimming pool, because these are the days where we also know the output attendance.

We’ll start with one day, where it was 80 degrees Fahrenheit, 65% humidity, and not raining (which we’ll represent with 0). The neurons will do their thing by multiplying those features by the weights, adding the results together, and passing information to the hidden layers until the output neuron has an answer. What do you think, John Green-bot?

John Green-bot: 145 people were at the pool! Just like before, there is a difference between the neural network’s output and the actual swimming pool attendance -- which was recorded as 100 people. Because we just have one output neuron, that difference of 45 people is the error.

Pretty simple. In some neural networks though, the output layer may have a lot of neurons. So the difference between the predicted answer and the correct answer is more than just one number.

In these cases, the error is represented by what’s known as a loss function. Moving forward, we need to adjust the neural network’s weights so that the next time we give John Green-bot similar inputs, his math and final output will be more accurate. Basically, we need John Green-bot to learn from his mistakes, a lot like when we pushed a button to supervise his learning when he had the perceptron program.

But this is trickier because of how complicated neural networks are. To help neural networks learn, scientists and mathematicians came up with an algorithm called backpropagation of the error, or just backpropagation. The basic goal is to look at the loss function and then assign blame to neurons back in the previous layers of the network.

Some neurons’ calculations may have been more to blame for the error than others, so their weights will be adjusted more. This information is fed backwards, which is where the idea of backpropagation comes from. So for example, the error from our output neuron would go back a layer and adjust the weights that get applied to our hidden layer neuron outputs.

And the error from our hidden layer neurons would go back a layer and adjust the weights that get applied to our features. Remember: our goal is to find the best combination of weights to get the lowest error. To explain the logic behind optimization with a metaphor, let’s send John Green Bot on a metaphorical journey through the Thought Bubble.

Let’s imagine that weights in our neural network are like latitude and longitude coordinates on a map. And the error of our neural network is the altitude -- lower is better. John Green-bot the explorer is on a quest to find the lowest point in the deepest valley.

The latitude and longitude of that lowest point -- where the error is the smallest -- are the weights of the neural network’s global optimal solution. But John Green-bot has no idea where this valley actually is. By randomly setting the initial weights of our neural network, we’re basically dumping him in the middle of the jungle.

All he knows is his current latitude, longitude, and altitude. Maybe we got lucky and he’s on the side of the deepest valley. But he could also be at the top of the highest mountain far away.

The only way to know is to explore! Because the jungle is so dense, it’s hard to see very far. The best John Green-bot can do is look around and make a guess.

He notices that he can descend down a little by moving northeast, so he takes a step down and updates his latitude and longitude. From this new position, he looks around and picks another step that decreases his altitude a little more. And then another… and another.

With every brave step, he updates his coordinates and decreases his altitude. Eventually, John Green-bot looks around and finds that he can’t go down anymore. He celebrates, because it seems like he found the lowest point in the deepest valley!

Or... so he thinks. If we look at the whole map, we can see that John Green-bot only found the bottom of a small gorge when he ran out of “down.” It’s way better than where he started, but it’s definitely not the lowest point of the deepest valley. So he just found a local optimal solution, where the weights make the error relatively small, but not the smallest it could be.

Sorry, buddy. Thanks, Thought Bubble. Backpropagation and learning always involves lots of little steps, and optimization is tricky with any neural network.

If we go back to our example of optimization as exploring a metaphorical map, we’re never quite sure if we’re headed in the right direction or if we’ve reached the lowest valley with the smallest error -- again that’s the global optimal solution. But tricks have been discovered to help us better navigate. For example, when we drop an explorer somewhere on the map, they could be really far from the lowest valley, with a giant mountain range in the way.

So it might be a good idea to try different random starting points to be sure that the neural network isn’t getting stuck at a locally optimal solution. Or instead of restarting over and over again, we could have a team of explorers that start from different locations and explore the jungle simultaneously. This strategy of exploring different solutions at the same time on the same neural network is especially useful when you have a giant computer with lots of processors.

And we could even adjust the explorer’s step size, so that they can step right over small hills as they try to find and descend into a valley. This step size is called the learning rate, and it’s how much the neuron weights get adjusted every time backpropagation happens. We’re always looking for more creative ways to explore solutions, try different combinations of weights, and minimize the loss function as we train neural networks.

But even if we use a bunch of training data and backpropagation to find the global optimal solution… we’re still only halfway done. The other half of training an AI is checking whether the system can answer new questions. It’s easy to solve a problem we’ve seen before, like taking a test after studying the answer key.

We may get an A, but we didn’t actually learn much. To really test what we’ve learned, we need to solve problems we haven’t seen before. Same goes for neural networks.

This whole time, John Green-bot has been training his neural network with swimming pool data. His neural network has dozens of features like temperature, humidity, rain, day of the week, and wind speed… but also grass length, number of butterflies around the pool, and the average GPA of the lifeguards. More data can be better for finding patterns and accuracy, as long as the computer can handle it!

Over time, backpropagation will adjust the neuron weights, so that neural network’s output matches the training data. Remember, that’s called fitting to the training data, and with this complicated neural network, we’re looking for a multi-dimensional function. And sometimes, backpropagation is too good at making a neural network fit to certain data.

See, there are lots of coincidental relationships in big datasets. Like for example, the divorce rate in Maine may be correlated with U. S. margarine consumption, or skiing revenue may be correlated with the number of people dying by getting trapped in their bed sheets.

Neural networks are really good at finding these kinds of relationships. And it can be a big problem, because if we give a neural network some new data that doesn’t adhere to these silly correlations, then it will probably make some strange errors. That’s a danger known as overfitting.

The easiest way to prevent overfitting is to keep the neural network simple. If we retrain John Green-bot’s swimming pool program /without/ data like grass length and number of butterflies, and we observe that our accuracy doesn’t change, then ignoring those features is best. So training a neural network isn’t just a bunch of math!

We need to consider how to best represent our various problems as features in AI systems, and to think carefully about what mistakes these programs might make. Next time, we’ll jump into our very first lab of the course, where we’ll apply all this knowledge and build a neural network together. Crash Course Ai is produced in association with PBS Digital Studios.

If you want to help keep Crash Course free for everyone, forever, you can join our community on Patreon. And if you want to learn more about the math of k-means clustering, check out this video from Crash Course Statistics.

One way to make an artificial brain is by creating a neural network, which can have millions of neurons and billions (or trillions) of connections between them. Nowadays, some neural networks are fast and big enough to do some tasks even better than humans can, like for example playing chess or predicting the weather!

But as we’ve talked about in Crash Course AI, neural networks don’t just work on their own. They need to learn to solve problems by making mistakes. Sounds kind of like us, right?

INTRO Neural networks handle mistakes. using an algorithm called backpropagation to make sure all the neurons that contributed to an error get their math adjusted, and we’ll unpack this a bit later. And neural networks have two main parts: the architecture and the weights. The architecture includes neurons and their connections.

And the weights are numbers that fine-tune how the neurons do their math to get an output. So if a neural network makes a mistake, this often means that the weights aren’t adjusted correctly and we need to update them so they make better predictions next time. The task of finding the best weights for a neural network architecture is called optimization.

And the best way to understand some basic principles of optimization is with an example with the help of my pal John Green Bot. Say that I manage a swimming pool, and I want to predict how many people will come next week, so that I can schedule enough lifeguards. A simple way to do this is by graphing some data points, like the number of swimmers and the temperature in Fahrenheit for every day over the past few weeks.

Then, we can look for a pattern in that graph to make predictions. A way computers do this is with an optimization strategy called linear regression. We start by drawing a random straight line on the graph, which kind of fits the data points.

To optimize though, we need to know how incorrect this guess is. So we calculate the distance between the line and each of the data points, add it all up, and that gives us the error. We’re quantifying how big of a mistake we made.

The goal of linear regression is to adjust the line to make the error as small as possible. We want the line to fit the training data as much as it can. The result is called the line of best fit.

We can use this straight line to predict how many swimmers will show up for any temperature, but parts of it defy logic. For example, super cold days have a negative number, while dangerously hot days have way more people than the pool can handle. To get more accurate results, we might want to consider more than two features, like for example adding the humidity which would turn our 2d graph into 3d.

And our line of best fit would be more like a plane of best fit. But if we added a fourth feature, like whether it’s raining or not, suddenly we can’t visualize this anymore. So as we consider more features, we add more dimensions to the graph, the optimization problem gets trickier, and fitting the training data is tougher.

This is where neural networks come in handy. Basically, by connecting together many simple neurons with weights, a neural network can learn to solve complicated problems, where the line of best fit becomes a weird multi-dimensional function. Let’s give John Green-bot an untrained neural network.

To stick with the same example, the input layer of this neural network takes features like temperature, humidity, rain, and so on. And the output layer predicts the number of swimmers that will come to the pool. We’re not going to worry about designing the architecture of John Green-bot’s neural network right now.

Let’s just focus on the weights. He’ll start, as always, by setting the weights to random numbers, like the random line on the graph we drew earlier. Only this time, it’s not just one random line.

Because we have lots of inputs, it’s lots of lines that are combined to make one big, messy function. Overall, this neural network’s function resembles some weird multi-dimensional shape that we don’t really have a name for. To train this neural network, we’ll start by giving John Green-bot a bunch of measurements from the past 10 days at the swimming pool, because these are the days where we also know the output attendance.

We’ll start with one day, where it was 80 degrees Fahrenheit, 65% humidity, and not raining (which we’ll represent with 0). The neurons will do their thing by multiplying those features by the weights, adding the results together, and passing information to the hidden layers until the output neuron has an answer. What do you think, John Green-bot?

John Green-bot: 145 people were at the pool! Just like before, there is a difference between the neural network’s output and the actual swimming pool attendance -- which was recorded as 100 people. Because we just have one output neuron, that difference of 45 people is the error.

Pretty simple. In some neural networks though, the output layer may have a lot of neurons. So the difference between the predicted answer and the correct answer is more than just one number.

In these cases, the error is represented by what’s known as a loss function. Moving forward, we need to adjust the neural network’s weights so that the next time we give John Green-bot similar inputs, his math and final output will be more accurate. Basically, we need John Green-bot to learn from his mistakes, a lot like when we pushed a button to supervise his learning when he had the perceptron program.

But this is trickier because of how complicated neural networks are. To help neural networks learn, scientists and mathematicians came up with an algorithm called backpropagation of the error, or just backpropagation. The basic goal is to look at the loss function and then assign blame to neurons back in the previous layers of the network.

Some neurons’ calculations may have been more to blame for the error than others, so their weights will be adjusted more. This information is fed backwards, which is where the idea of backpropagation comes from. So for example, the error from our output neuron would go back a layer and adjust the weights that get applied to our hidden layer neuron outputs.

And the error from our hidden layer neurons would go back a layer and adjust the weights that get applied to our features. Remember: our goal is to find the best combination of weights to get the lowest error. To explain the logic behind optimization with a metaphor, let’s send John Green Bot on a metaphorical journey through the Thought Bubble.

Let’s imagine that weights in our neural network are like latitude and longitude coordinates on a map. And the error of our neural network is the altitude -- lower is better. John Green-bot the explorer is on a quest to find the lowest point in the deepest valley.

The latitude and longitude of that lowest point -- where the error is the smallest -- are the weights of the neural network’s global optimal solution. But John Green-bot has no idea where this valley actually is. By randomly setting the initial weights of our neural network, we’re basically dumping him in the middle of the jungle.

All he knows is his current latitude, longitude, and altitude. Maybe we got lucky and he’s on the side of the deepest valley. But he could also be at the top of the highest mountain far away.

The only way to know is to explore! Because the jungle is so dense, it’s hard to see very far. The best John Green-bot can do is look around and make a guess.

He notices that he can descend down a little by moving northeast, so he takes a step down and updates his latitude and longitude. From this new position, he looks around and picks another step that decreases his altitude a little more. And then another… and another.

With every brave step, he updates his coordinates and decreases his altitude. Eventually, John Green-bot looks around and finds that he can’t go down anymore. He celebrates, because it seems like he found the lowest point in the deepest valley!

Or... so he thinks. If we look at the whole map, we can see that John Green-bot only found the bottom of a small gorge when he ran out of “down.” It’s way better than where he started, but it’s definitely not the lowest point of the deepest valley. So he just found a local optimal solution, where the weights make the error relatively small, but not the smallest it could be.

Sorry, buddy. Thanks, Thought Bubble. Backpropagation and learning always involves lots of little steps, and optimization is tricky with any neural network.

If we go back to our example of optimization as exploring a metaphorical map, we’re never quite sure if we’re headed in the right direction or if we’ve reached the lowest valley with the smallest error -- again that’s the global optimal solution. But tricks have been discovered to help us better navigate. For example, when we drop an explorer somewhere on the map, they could be really far from the lowest valley, with a giant mountain range in the way.

So it might be a good idea to try different random starting points to be sure that the neural network isn’t getting stuck at a locally optimal solution. Or instead of restarting over and over again, we could have a team of explorers that start from different locations and explore the jungle simultaneously. This strategy of exploring different solutions at the same time on the same neural network is especially useful when you have a giant computer with lots of processors.

And we could even adjust the explorer’s step size, so that they can step right over small hills as they try to find and descend into a valley. This step size is called the learning rate, and it’s how much the neuron weights get adjusted every time backpropagation happens. We’re always looking for more creative ways to explore solutions, try different combinations of weights, and minimize the loss function as we train neural networks.

But even if we use a bunch of training data and backpropagation to find the global optimal solution… we’re still only halfway done. The other half of training an AI is checking whether the system can answer new questions. It’s easy to solve a problem we’ve seen before, like taking a test after studying the answer key.

We may get an A, but we didn’t actually learn much. To really test what we’ve learned, we need to solve problems we haven’t seen before. Same goes for neural networks.

This whole time, John Green-bot has been training his neural network with swimming pool data. His neural network has dozens of features like temperature, humidity, rain, day of the week, and wind speed… but also grass length, number of butterflies around the pool, and the average GPA of the lifeguards. More data can be better for finding patterns and accuracy, as long as the computer can handle it!

Over time, backpropagation will adjust the neuron weights, so that neural network’s output matches the training data. Remember, that’s called fitting to the training data, and with this complicated neural network, we’re looking for a multi-dimensional function. And sometimes, backpropagation is too good at making a neural network fit to certain data.

See, there are lots of coincidental relationships in big datasets. Like for example, the divorce rate in Maine may be correlated with U. S. margarine consumption, or skiing revenue may be correlated with the number of people dying by getting trapped in their bed sheets.

Neural networks are really good at finding these kinds of relationships. And it can be a big problem, because if we give a neural network some new data that doesn’t adhere to these silly correlations, then it will probably make some strange errors. That’s a danger known as overfitting.

The easiest way to prevent overfitting is to keep the neural network simple. If we retrain John Green-bot’s swimming pool program /without/ data like grass length and number of butterflies, and we observe that our accuracy doesn’t change, then ignoring those features is best. So training a neural network isn’t just a bunch of math!

We need to consider how to best represent our various problems as features in AI systems, and to think carefully about what mistakes these programs might make. Next time, we’ll jump into our very first lab of the course, where we’ll apply all this knowledge and build a neural network together. Crash Course Ai is produced in association with PBS Digital Studios.

If you want to help keep Crash Course free for everyone, forever, you can join our community on Patreon. And if you want to learn more about the math of k-means clustering, check out this video from Crash Course Statistics.