Previous: How to Seek Help and Find Key Partners: Crash Course Entrepreneurship #9
Next: How to Build Customer Relationships: Crash Course Entrepreneurship #10



View count:114,441
Last sync:2023-11-18 19:15
Reinforcement learning is particularly useful in situations where we want to train AIs to have certain skills we don’t fully understand ourselves. Unlike some of the techniques we’ve discussed so far, reinforcement learning generally only looks at how an AI performs a task AFTER it has completed it. And when an AI completes that task figuring out when and how to reward an AI, called credit assignment, is one of the hardest parts of reinforcement learning. So today, we’re going to explore these ideas, introduce a ton of new terms like value, policy, agent, environment, actions, and states and we’ll show you how we can use strategies like exploration and exploitation to train John Green Bot to find things more efficiently next time.

Crash Course AI is produced in association with PBS Digital Studios:

Crash Course is on Patreon! You can support us directly by signing up at

Thanks to the following patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Eric Prestemon, Sam Buck, Mark Brouwer, Indika Siriwardena, Avi Yashchin, Timothy J Kwist, Brian Thomas Gossett, Haixiang N/A Liu, Jonathan Zbikowski, Siobhan Sabino, Zach Van Stanley, Jennifer Killen, Nathan Catchings, Brandon Westmoreland, dorsey, Kenneth F Penttinen, Trevin Beattie, Erika & Alexa Saur, Justin Zingsheim, Jessica Wode, Tom Trval, Jason Saslow, Nathan Taylor, Khaled El Shalakany, SR Foxley, Sam Ferguson, Yasenia Cruz, Eric Koslow, Caleb Weeks, Tim Curwick, David Noe, Shawn Arnold, William McGraw, Andrei Krishkevich, Rachel Bright, Jirat, Ian Dundore

Want to find Crash Course elsewhere on the internet?
Facebook -
Twitter -
Tumblr -
Support Crash Course on Patreon:

CC Kids:

#CrashCourse #ArtificialIntelligence #MachineLearning
Hey, I’m Jabril and welcome to Crash Course AI.

Say I want to get a cookie from a jar that’s on a tall shelf. There isn’t one “right way” to get the cookies.

Maybe I find a ladder, use a lasso, or build a complicated system of pulleys. These could all be brilliant or terrible ideas, but if something works, I get the sweet taste of victory... and I learn that doing that same thing could get me another cookie in the future. We learn lots of things by trial-and-error, and this kind of “learning by doing” to achieve complicated goals is called Reinforcement Learning.

INTRO So far, we’ve talked about two types of learning in Crash Course AI: Supervised Learning, where a teacher gives an AI answers to learn from, and Unsupervised Learning, where an AI tries to find patterns in the world. Reinforcement Learning is particularly useful for situations where we want to train AIs to have certain skills we don’t fully understand ourselves. For example, I’m pretty good at walking, but trying to explain the process of walking is kind of difficult.

What angle should your femur be relative to your foot? And should you move it with an average angular velocity of… yeah, never mind… its really difficult. With reinforcement learning, we can train AIs to perform complicated tasks.

But unlike other techniques, we only have to tell them at the very end of the task if they succeeded, and then ask them to tell us how they did it. (We’re going to focus on this general case, but sometimes this feedback could come earlier. So if we want an AI to learn to walk, we give them a reward if they’re both standing up and moving forward, and then figure out what steps they took to get to that point. The longer the AI stands up and moves forward, the longer it’s walking, and the more reward it gets.

So you can kind of see how the key to reinforcement learning is just trial-and-error, again and again. For humans, a reward might be a cookie or the joy of winning a board game. But for an AI system, a reward is just a small positive signal that basically tells it “good job” and “do that again”!

Google Deepmind got some pretty impressive results when they used reinforcement learning to teach virtual AI systems to walk, jump, and even duck under obstacles. It looks kinda silly, but works pretty well! Other researchers have even helped real life robots learn to walk.

So seeing the end result is pretty fun and can help us understand the goals of reinforcement learning. But to really understand how reinforcement learning works, we have to learn new language to talk about these AI and what they’re doing. Similar to previous episodes, we have an AI (or Agent) as our loyal subject that’s going to learn.

An agent makes predictions or performs Actions, like moving a tiny bit forward, or picking the next best move in a game. And it performs actions based on its current inputs, which we call the State. In supervised learning, after /each/ action, we would have a training label that tells our AI whether it did the right thing or not.

We can’t do that here with reinforcement learning, because we don’t know what the “right thing” actually is until it’s completely done with the task. This difference actually highlights one of the hardest parts of reinforcement learning called credit assignment. It’s hard to know which actions helped us get to the reward (and should get credit) and which actions slowed down our AI when we don’t pause to think after every action.

So the agent ends up interacting with its Environment for a while, whether that’s a game board, a virtual maze, or real life kitchen. And the agent takes many actions until it gets a Reward, which we give out when it wins a game or gets that cookie jar from that really tall shelf. Then, every time the agent wins (or succeeds at its task), we can look back on the actions it took and slowly figure out which game states were helpful and which weren’t.

During this reflection, we’re assigning Value to those different game states and deciding on a Policy for which actions work best. We need Values and Policies to get anything done in reinforcement learning. Let’s say I see some food in the kitchen: a box, a small bag, and a plate with a donut.

So my brain can assign each of these a value, a numerical yummy-ness value. The box probably has 6 donuts in it, the bag probably has 2, and the plate just has 1… so the values I assign are 6, 2, and 1. Now that I’ve assigned each of them a value, I can decide on a policy to plan what action to take!

The simplest policy is to go to the highest value (that box of possibly 6 donuts). But I can’t see inside of it, and that could be a box of bagels, so it’s high reward but high risk. Another policy could be low reward but low risk, going with the plate with 1 guaranteed delicious donut.

Personally, I’d pick a middle-ground policy, and go for the bag because I have a better chance of guessing that there are donuts inside than the box, and a value of 1 donut isn’t enough. That’s a lot of vocab, so let’s see these concepts in action to help us remember everything. Our example is going to focus on a mathematical framework that could be used with different underlying machine learning techniques.

Let’s say John-Green-bot wants to go to the charging station to recharge his batteries. In this example, John-Green-bot is a brand new Agent, and the room is the Environment he needs to learn about. From where he is now in the room, he has four possible Actions: moving up, down, left, or right.

And his State is a couple of different inputs: where he is, where he came from, and what he sees. For this example, we’ll assume John-Green-bot can see the whole room. So when he moves up (or any direction), his state changes.

But he doesn’t know yet if moving up was a good idea, because he hasn’t reached a goal. So go on, John-Green-bot... explore! He found the battery, so he got a Reward (that little plus one).

Now, we can look back at the path he took and give all the cells he walked through a Value -- specifically, a higher value for those near the goal, and lower for those farther away. These higher and lower values help with the trial-and-error of reinforcement learning, and they give our agent more information about better actions to take when he tries again! So if we put John-Green-bot back at the start, he’ll want to decide on a Policy that maximizes reward.

Since he already knows a path to the battery, he’ll walk along that path, and he’s guaranteed another +1. But that’s… too easy. And kind of boring if John-Green-bot just takes the same long and winding path every time.

So another important concept in reinforcement learning is the trade-off between exploitation and exploration. Now that John-Green-bot knows one way to get to the battery, he could just exploit this knowledge by always taking the same 10 actions. It’s not a terrible idea -- he knows he won’t get lost and he’ll definitely get a reward.

But this 10-action path is also pretty inefficient, and there are probably more efficient paths out there. So exploitation may not be the best strategy. It’s usually worth trying lots of different actions to see what happens, which is a strategy called exploration.

Every new path John-Green-bot takes will give him a bit more data about the best way to get a reward. So let’s let John-Green-bot explore for 100 actions, and after he completes a path, we’ll update the values of the cells he’s been to. Now we can look at all these new values!

During exploration, John-Green-bot found a short-cut, so now he knows a path that only takes 4 actions to get to the goal. This means our new policy (which always chooses the best value for the next action) will take John-Green-bot down this faster path to the target. That’s much better than before, but we paid a cost, because during those 100 actions of exploration, he took some paths that were even /more/ inefficient than the first 10-action try and only got a total of 6 points.

If John-Green-bot had just exploited his knowledge of the first path he took for those 100 actions, he could have made it to the battery 10 times and gotten 10 points. So you could say that exploration was a waste of time. BUT if we started a new competition between the new John-Green-bot (who knows a 4-action path) and his younger, more foolish self (who knows a 10-action path), over 100 actions, the new John-Green-bot would be able to get 25 points because his path is much faster.

His reinforcement learning helped! So should we explore more to try and find an even better path? Or should we just use exploitation right away to collect more points?

In many reinforcement learning problems, we need a balance of exploitation and exploration, and people are actively researching this trade-off. These kinds of problems can get even more complicated if we add different kinds of rewards, like a +1 battery and a +3 bigger battery. Or there could even be Negative Rewards that John-Green-Bot needs to learn to avoid, like this black hole.

If we let John-Green-Bot explore this new environment using reinforcement learning, sometimes he falls into the black hole. So the cells will end up having different values than the earlier environment, and there could be a different best policy. Plus, the whole environment could change in many of these problems.

If we have an AI in our car helping us drive home, the same road will have different people, bicycles, cars, and black holes on it every day. There might even be construction that completely reroutes us. This is where reinforcement learning problems get more fun, but much harder.

When John-Green-bot was learning how to navigate on that small grid, cells closer to the battery had higher values than those far away. But for many problems, we’ll want to use a value function to think about what we’ve done so far, and decide on the next move using math. For example, in this situation where an AI is helping us drive home, if we’re optimizing safety and we see the brake lights of the car in front of us, it’s probably time to slow down, but if we saw a bag of donuts in the street, we would want to stop.

So reinforcement learning is a powerful tool that’s been around for decades, but a lot of problems need a ton of data and a ton of time to solve. There have been really impressive results recently thanks to deep reinforcement learning on large-scale computing. These systems can explore massive environments and a huge number of states, leading to results like AIs learning to play games.

At the core of a lot of these problems are discrete symbols, like a command for forward or the squares on a game board, so how to reason and plan in these spaces is a key part of AI. Next week, we’ll dive into symbolic AI and how it’s a powerful tool for systems we use every day. See you then.

Crash Course Ai is produced in association with PBS Digital Studios. If you want to help keep Crash Course free for everyone, forever, you can join our community on Patreon. And if you want to learn other approaches to control robot behavior check out this video on Crash Course Computer Science.