crashcourse
Web Search: Crash Course AI #17
YouTube: | https://youtube.com/watch?v=PnFwdCGmVG0 |
Previous: | Financing Options for Small Businesses: Crash Course Entrepreneurship #16 |
Next: | Migration: Crash Course European History #29 |
Categories
Statistics
View count: | 83,088 |
Likes: | 1,788 |
Comments: | 60 |
Duration: | 11:15 |
Uploaded: | 2019-12-06 |
Last sync: | 2024-10-22 18:00 |
Citation
Citation formatting is not guaranteed to be accurate. | |
MLA Full: | "Web Search: Crash Course AI #17." YouTube, uploaded by CrashCourse, 6 December 2019, www.youtube.com/watch?v=PnFwdCGmVG0. |
MLA Inline: | (CrashCourse, 2019) |
APA Full: | CrashCourse. (2019, December 6). Web Search: Crash Course AI #17 [Video]. YouTube. https://youtube.com/watch?v=PnFwdCGmVG0 |
APA Inline: | (CrashCourse, 2019) |
Chicago Full: |
CrashCourse, "Web Search: Crash Course AI #17.", December 6, 2019, YouTube, 11:15, https://youtube.com/watch?v=PnFwdCGmVG0. |
Today we’re going to talk about search engines, which are just AI systems that try to help us find what we’re looking for. Search engines can be the sort that serve up a list of results, like during a Google or Bing search, using web crawlers, an inverted index, and measuring stuff like click through and bounce back to figure out what you want to see. They can also be the kind that give you answers, like when you ask Siri or Alexa a question, relying on knowledge bases. Admittedly, these systems aren’t perfect so next week we’ll talk about bias in AI systems like this.
Crash Course is produced in association with PBS Digital Studios: https://www.youtube.com/pbsdigitalstudios
Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse
Thanks to the following patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:
Eric Prestemon, Sam Buck, Mark Brouwer, Efrain R. Pedroza, Matthew Curls, Indika Siriwardena, Avi Yashchin, Timothy J Kwist, Brian Thomas Gossett, Haixiang N/A Liu, Jonathan Zbikowski, Siobhan Sabino, Jennifer Killen, Nathan Catchings, Brandon Westmoreland, dorsey, Kenneth F Penttinen, Trevin Beattie, Erika & Alexa Saur, Justin Zingsheim, Jessica Wode, Tom Trval, Jason Saslow, Nathan Taylor, Khaled El Shalakany, SR Foxley, Yasenia Cruz, Eric Koslow, Caleb Weeks, Tim Curwick, DAVID NOE, Shawn Arnold, William McGraw, Andrei Krishkevich, Rachel Bright, Jirat, Ian Dundore
--
Want to find Crash Course elsewhere on the internet?
Facebook - http://www.facebook.com/YouTubeCrashCourse
Twitter - http://www.twitter.com/TheCrashCourse
Tumblr - http://thecrashcourse.tumblr.com
Support Crash Course on Patreon: http://patreon.com/crashcourse
CC Kids: http://www.youtube.com/crashcoursekids
#CrashCourse #MachineLearning #ArtificialIntelligence
Crash Course is produced in association with PBS Digital Studios: https://www.youtube.com/pbsdigitalstudios
Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse
Thanks to the following patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:
Eric Prestemon, Sam Buck, Mark Brouwer, Efrain R. Pedroza, Matthew Curls, Indika Siriwardena, Avi Yashchin, Timothy J Kwist, Brian Thomas Gossett, Haixiang N/A Liu, Jonathan Zbikowski, Siobhan Sabino, Jennifer Killen, Nathan Catchings, Brandon Westmoreland, dorsey, Kenneth F Penttinen, Trevin Beattie, Erika & Alexa Saur, Justin Zingsheim, Jessica Wode, Tom Trval, Jason Saslow, Nathan Taylor, Khaled El Shalakany, SR Foxley, Yasenia Cruz, Eric Koslow, Caleb Weeks, Tim Curwick, DAVID NOE, Shawn Arnold, William McGraw, Andrei Krishkevich, Rachel Bright, Jirat, Ian Dundore
--
Want to find Crash Course elsewhere on the internet?
Facebook - http://www.facebook.com/YouTubeCrashCourse
Twitter - http://www.twitter.com/TheCrashCourse
Tumblr - http://thecrashcourse.tumblr.com
Support Crash Course on Patreon: http://patreon.com/crashcourse
CC Kids: http://www.youtube.com/crashcoursekids
#CrashCourse #MachineLearning #ArtificialIntelligence
Hi, I’m Jabril and welcome to Crash Course AI!
There used to be a time when a group of friends at dinner could ask a question like “is a hot dog a sandwich?” and it would turn into a basic shouting match with lots of gesturing and hypothetical examples. But now, we have access to a LOT of human knowledge in the palm of our hands… so our friends can look up memes and dictionary definitions and pictures of sandwiches to prove that none of them have a connected bun like hot dogs (disappointed).
Search engines are a huge part of modern life. They help us access information, find directions to places, shop, and participate in sandwich arguments. But how does Google find answers to questions?
How are Siri and Alexa so smart but also easily stumped? How did IBM’s Watson beat the best Jeopardy players in the world? Well, search engines are just AI systems that are getting better and better at helping us find what we’re looking for.
INTRO. When we talk about search engines, we typically think about the AI systems online, like Google,. Bing, Duck Duck Go and Ask Jeeves.
But the basic ideas behind non-AI search engines have existed for centuries. Essentially, search engines gather data, create organization systems to sort that data, and find results to a question. For example, when you needed an answer to a question and couldn’t search online, you could go to the library!
Libraries gather data in the form of books and newspapers that are stacked neatly on the shelves. Librarians have organization systems to help you find what you’re looking for. Knowing that magazines are on shelves by the water fountain, while kids books are on the second floor is a kind of organization.
Plus, fiction books are sorted by the author’s last name, while nonfiction has the Dewey Decimal System, and so on. Once you (or the librarian) have the resources you need, you’ll be able to find results to your question! Now, rather than looking through books, web search engines look through all the data on the World Wide Web, aka “the Web”.
And instead of asking a human librarian where to find information, we ask an AI like John-Green-bot instead.
Jabril: Oh John Green Bot? [JGB dial-up beeps]. Alright John Green Bot you're all set. We're going to need that later. And just so we’re clear, we’re using “Web” throughout this video even though it might sound a little old-fashioned.
That’s because the Internet and the Web are not the same thing. The Internet is a collection of computers that send messages to each other. Video services like Netflix that play on your TV, for example, use the Internet, not the Web.
The Web, on the other hand, is part of the Internet and uses the Internet’s connections to send documents and other content in a format that can be displayed by a browser like Chrome or Safari. As with most AI systems, the first step is to gather lots of data. To gather data on the Web, we can use a computer program called a Web crawler, which systematically finds and downloads Web pages.
This is a HUGE task and happens before the search engine AI can take any questions. It starts on some Web page that we pick, called a seed, and downloads that page and finds all its links. Then, the crawler downloads each of the linked Web pages and finds their links, and so on... until we’ve crawled the whole Web.
After we have collected all the data, the AI’s next step is to organize it by building an index, which is a kind of lookup system. The kind that’s used for organizing Web pages is called an inverted index, which is like the index in the back of a textbook. For each word, it lists all of the Web pages that contain that word.
Usually, the Web pages are represented by I. D. numbers so we don’t have a long, messy list of URLs. Let’s say 0 is the seed - which happens to be a page about Genghis Khan.
It has a lot of words on it like “the, mongol, Khan, Genghis, who, and is”. In this inverted index, page 1 is about Marco Polo, but it mentions the word “Genghis” along with words like “the, Marco, Polo, who, are, and is.” Page 2 is about the Mongols, page 3 is a different webpage about Marco Polo, and page 4 is about Water Polo. So, let’s say we type “Who is Genghis Khan?” into a search engine.
Our AI can use this inverted index to find results, which in this case, are links to. Web pages. The AI will look at the words “who”, “is”, “Genghis”, and “Khan” and use the inverted index to find relevant pages.
Our AI might find that Web pages zero, one, two and five have at least one of the words from the question “who is Genghis Khan?” When Siri says “I found this for you,” the AI is just returning a list of Web pages that contain the same terms as the question. Except… most search engines include one more step. There are millions of pages online that contain the same terms.
So it’s important for search engines to rank Web pages, so that the top result is more likely to be relevant than the tenth result or the hundredth. Of course, Google and Bing don’t hire “supervisors” to grade each possible question and answer to help their AI systems learn from training data. That would take forever, and they wouldn’t be able to keep up with all the new content that gets created every day.
Really, regular users like us do this training for free all the time. Every time we use a search engine, our behavior tells the AI whether or not the results answered our question. For example, if we type in “who is Genghis Khan” into a search engine, and click on a Web page about Star Trek
II: The Wrath of Khan, we might be disappointed to find Genghis. Khan isn’t ANYWHERE in that movie. So we’ll bounce back to the search results, and try again until we find a page that answers our question. A bounce indicates a bad result.
But if we click on a Wikipedia article about Genghis Khan and stay for a while reading, that’s a click through, which probably means that we found what we were looking for… so that indicates a good result. Human behavior like bounces and click throughs give AI systems the training data they need to learn how to rank search results and better answer our questions. Data from the Web and data from how we use the Web helps make better and better search engines.
Now, sometimes we ask our smart devices questions and we want actual answers… not links to. Web pages. When I say “OK Google, what’s the weather like in Indianapolis?” I don’t want to scroll through results.
For this kind of problem, instead of using an inverted index, AIs rely on knowledge bases. Which you might remember from our video about Symbolic AI. A knowledge base encodes information about the universe as relationships between objects like "chocolate donut" and "John Green Bot wears polo".
One of the main problems with knowledge bases is that it’s really hard to write down all of the facts in the universe, especially common sense things that humans take for granted but computers need to be told. Enter AI researcher Tom Mitchell and his team of scientists from Carnegie Mellon University. In 2010, they created a huge knowledge base called the Never Ending Language Learner or.
NELL, which was able to extract hundreds of thousands of facts from random Web pages. The way it works is really clever, so let’s go to the Thought Bubble to see how. NELL starts with some facts provided by a human, for example, the genre of music that.
Mozart plays is classical. Which was represented like this: Mozart. musicGenre. Classical.
Similarly,. Jimi Hendrix. plays. Guitar.
And Darth Vader. hasChild. Luke Skywalker. Then, NELL gets to work and reads through each Web page one-by-one for words mentioned in those facts.
Maybe it finds the text “Mozart plays the piano.” NELL doesn’t know much about these symbols, but this text matches the same pattern as one of the facts provided by a human, specifically, the “plays” relationship. So NELL learns a new object: Piano. And a new fact: Mozart. plays.
Piano. By searching over the entire Web, NELL can learn lots of facts based on just the three original ones that humans gave it! Some facts might appear hundreds or thousands of times online, like Lenny Kravitz. hasChild.
Zoë Kravitz. But NELL might also find facts that are mentioned SOMEWHERE online and extract them as potentially true. Like, for example, Darth Vader. plays.
Kloo Horn. We just don’t know! Just like how we look for multiple sources when writing a paper, NELL uses repetition and multiple sources to build confidence that the facts it’s finding are actually true.
To consider other relationships, NELL uses the highly confident facts it learned and searches through the Web again. Only this time, NELL is looking for new relationships. Maybe it finds the text “Darth Vader cuts off Luke Skywalker’s hand,” and NELL learns a new (very specific) relationship: cutsOffHand.
Over and over again, NELL will use known relationships to find new objects, and known objects to find new relationships -- creating a huge knowledge base. Thanks, Thought Bubble! AI systems can use huge knowledge bases, like this one extracted by NELL, to answer our questions directly.
Instead of using the words from our questions to search through an inverted index, an AI like Siri can reformulate our questions into incomplete facts and then look for matches in a knowledge base. Hey John Green Bot…. John Green
Bot: Yes, Jabril?
Jabril: “Who wrote The Bluest Eye?” His AI could then reformulate that question into an incomplete fact, replacing “who” with a question mark. If John-Green-bot extracted that information earlier, he can find matches in his knowledge base and return the most confident result. John-Green-bot: Toni Morrison wrote The Bluest Eye!
Jabril: Hey. Thanks, John-Green-bot! Different words are categorized differently, so an AI like John-Green-bot can tell the difference between questions asking “who” and “when” and “where.” But that gets more complicated, so we’re not going to dive into the details here. If you want to learn more, you can read about part of speech tagging systems.
Using all these strategies, search engines have become really good at answering common questions. But questions like “How many trees are in Ohio?” or “How many hotdogs are eaten in the South Sandwich Islands annually?” still stump most AI systems, because not enough people ask them and AI hasn’t learned how to answer them well yet. It’s also important to watch out for search engine answers to questions like “Who invented the time machine?” because AI systems have a tough time with nuance and incomplete data.
Sorry Doc Brown. And a big, sort of hidden, problem is that search engine AI systems, are influenced by any biases in data online. For example, if I ask Google for images of “nurses,” it will mostly show pictures of female nurses.
So next time, we’ll talk about how an algorithm can be biased, where bias comes from, and what we can do to address bias in AI. I’ll see ya then. Crash Course AI is produced in association with PBS Digital Studios!
If you want to help keep all Crash Course free for everybody, forever, you can join our community on Patreon. And if you want to learn more about the history of the World Wide Web, check out this episode of Crash Course Computer Science.
There used to be a time when a group of friends at dinner could ask a question like “is a hot dog a sandwich?” and it would turn into a basic shouting match with lots of gesturing and hypothetical examples. But now, we have access to a LOT of human knowledge in the palm of our hands… so our friends can look up memes and dictionary definitions and pictures of sandwiches to prove that none of them have a connected bun like hot dogs (disappointed).
Search engines are a huge part of modern life. They help us access information, find directions to places, shop, and participate in sandwich arguments. But how does Google find answers to questions?
How are Siri and Alexa so smart but also easily stumped? How did IBM’s Watson beat the best Jeopardy players in the world? Well, search engines are just AI systems that are getting better and better at helping us find what we’re looking for.
INTRO. When we talk about search engines, we typically think about the AI systems online, like Google,. Bing, Duck Duck Go and Ask Jeeves.
But the basic ideas behind non-AI search engines have existed for centuries. Essentially, search engines gather data, create organization systems to sort that data, and find results to a question. For example, when you needed an answer to a question and couldn’t search online, you could go to the library!
Libraries gather data in the form of books and newspapers that are stacked neatly on the shelves. Librarians have organization systems to help you find what you’re looking for. Knowing that magazines are on shelves by the water fountain, while kids books are on the second floor is a kind of organization.
Plus, fiction books are sorted by the author’s last name, while nonfiction has the Dewey Decimal System, and so on. Once you (or the librarian) have the resources you need, you’ll be able to find results to your question! Now, rather than looking through books, web search engines look through all the data on the World Wide Web, aka “the Web”.
And instead of asking a human librarian where to find information, we ask an AI like John-Green-bot instead.
Jabril: Oh John Green Bot? [JGB dial-up beeps]. Alright John Green Bot you're all set. We're going to need that later. And just so we’re clear, we’re using “Web” throughout this video even though it might sound a little old-fashioned.
That’s because the Internet and the Web are not the same thing. The Internet is a collection of computers that send messages to each other. Video services like Netflix that play on your TV, for example, use the Internet, not the Web.
The Web, on the other hand, is part of the Internet and uses the Internet’s connections to send documents and other content in a format that can be displayed by a browser like Chrome or Safari. As with most AI systems, the first step is to gather lots of data. To gather data on the Web, we can use a computer program called a Web crawler, which systematically finds and downloads Web pages.
This is a HUGE task and happens before the search engine AI can take any questions. It starts on some Web page that we pick, called a seed, and downloads that page and finds all its links. Then, the crawler downloads each of the linked Web pages and finds their links, and so on... until we’ve crawled the whole Web.
After we have collected all the data, the AI’s next step is to organize it by building an index, which is a kind of lookup system. The kind that’s used for organizing Web pages is called an inverted index, which is like the index in the back of a textbook. For each word, it lists all of the Web pages that contain that word.
Usually, the Web pages are represented by I. D. numbers so we don’t have a long, messy list of URLs. Let’s say 0 is the seed - which happens to be a page about Genghis Khan.
It has a lot of words on it like “the, mongol, Khan, Genghis, who, and is”. In this inverted index, page 1 is about Marco Polo, but it mentions the word “Genghis” along with words like “the, Marco, Polo, who, are, and is.” Page 2 is about the Mongols, page 3 is a different webpage about Marco Polo, and page 4 is about Water Polo. So, let’s say we type “Who is Genghis Khan?” into a search engine.
Our AI can use this inverted index to find results, which in this case, are links to. Web pages. The AI will look at the words “who”, “is”, “Genghis”, and “Khan” and use the inverted index to find relevant pages.
Our AI might find that Web pages zero, one, two and five have at least one of the words from the question “who is Genghis Khan?” When Siri says “I found this for you,” the AI is just returning a list of Web pages that contain the same terms as the question. Except… most search engines include one more step. There are millions of pages online that contain the same terms.
So it’s important for search engines to rank Web pages, so that the top result is more likely to be relevant than the tenth result or the hundredth. Of course, Google and Bing don’t hire “supervisors” to grade each possible question and answer to help their AI systems learn from training data. That would take forever, and they wouldn’t be able to keep up with all the new content that gets created every day.
Really, regular users like us do this training for free all the time. Every time we use a search engine, our behavior tells the AI whether or not the results answered our question. For example, if we type in “who is Genghis Khan” into a search engine, and click on a Web page about Star Trek
II: The Wrath of Khan, we might be disappointed to find Genghis. Khan isn’t ANYWHERE in that movie. So we’ll bounce back to the search results, and try again until we find a page that answers our question. A bounce indicates a bad result.
But if we click on a Wikipedia article about Genghis Khan and stay for a while reading, that’s a click through, which probably means that we found what we were looking for… so that indicates a good result. Human behavior like bounces and click throughs give AI systems the training data they need to learn how to rank search results and better answer our questions. Data from the Web and data from how we use the Web helps make better and better search engines.
Now, sometimes we ask our smart devices questions and we want actual answers… not links to. Web pages. When I say “OK Google, what’s the weather like in Indianapolis?” I don’t want to scroll through results.
For this kind of problem, instead of using an inverted index, AIs rely on knowledge bases. Which you might remember from our video about Symbolic AI. A knowledge base encodes information about the universe as relationships between objects like "chocolate donut" and "John Green Bot wears polo".
One of the main problems with knowledge bases is that it’s really hard to write down all of the facts in the universe, especially common sense things that humans take for granted but computers need to be told. Enter AI researcher Tom Mitchell and his team of scientists from Carnegie Mellon University. In 2010, they created a huge knowledge base called the Never Ending Language Learner or.
NELL, which was able to extract hundreds of thousands of facts from random Web pages. The way it works is really clever, so let’s go to the Thought Bubble to see how. NELL starts with some facts provided by a human, for example, the genre of music that.
Mozart plays is classical. Which was represented like this: Mozart. musicGenre. Classical.
Similarly,. Jimi Hendrix. plays. Guitar.
And Darth Vader. hasChild. Luke Skywalker. Then, NELL gets to work and reads through each Web page one-by-one for words mentioned in those facts.
Maybe it finds the text “Mozart plays the piano.” NELL doesn’t know much about these symbols, but this text matches the same pattern as one of the facts provided by a human, specifically, the “plays” relationship. So NELL learns a new object: Piano. And a new fact: Mozart. plays.
Piano. By searching over the entire Web, NELL can learn lots of facts based on just the three original ones that humans gave it! Some facts might appear hundreds or thousands of times online, like Lenny Kravitz. hasChild.
Zoë Kravitz. But NELL might also find facts that are mentioned SOMEWHERE online and extract them as potentially true. Like, for example, Darth Vader. plays.
Kloo Horn. We just don’t know! Just like how we look for multiple sources when writing a paper, NELL uses repetition and multiple sources to build confidence that the facts it’s finding are actually true.
To consider other relationships, NELL uses the highly confident facts it learned and searches through the Web again. Only this time, NELL is looking for new relationships. Maybe it finds the text “Darth Vader cuts off Luke Skywalker’s hand,” and NELL learns a new (very specific) relationship: cutsOffHand.
Over and over again, NELL will use known relationships to find new objects, and known objects to find new relationships -- creating a huge knowledge base. Thanks, Thought Bubble! AI systems can use huge knowledge bases, like this one extracted by NELL, to answer our questions directly.
Instead of using the words from our questions to search through an inverted index, an AI like Siri can reformulate our questions into incomplete facts and then look for matches in a knowledge base. Hey John Green Bot…. John Green
Bot: Yes, Jabril?
Jabril: “Who wrote The Bluest Eye?” His AI could then reformulate that question into an incomplete fact, replacing “who” with a question mark. If John-Green-bot extracted that information earlier, he can find matches in his knowledge base and return the most confident result. John-Green-bot: Toni Morrison wrote The Bluest Eye!
Jabril: Hey. Thanks, John-Green-bot! Different words are categorized differently, so an AI like John-Green-bot can tell the difference between questions asking “who” and “when” and “where.” But that gets more complicated, so we’re not going to dive into the details here. If you want to learn more, you can read about part of speech tagging systems.
Using all these strategies, search engines have become really good at answering common questions. But questions like “How many trees are in Ohio?” or “How many hotdogs are eaten in the South Sandwich Islands annually?” still stump most AI systems, because not enough people ask them and AI hasn’t learned how to answer them well yet. It’s also important to watch out for search engine answers to questions like “Who invented the time machine?” because AI systems have a tough time with nuance and incomplete data.
Sorry Doc Brown. And a big, sort of hidden, problem is that search engine AI systems, are influenced by any biases in data online. For example, if I ask Google for images of “nurses,” it will mostly show pictures of female nurses.
So next time, we’ll talk about how an algorithm can be biased, where bias comes from, and what we can do to address bias in AI. I’ll see ya then. Crash Course AI is produced in association with PBS Digital Studios!
If you want to help keep all Crash Course free for everybody, forever, you can join our community on Patreon. And if you want to learn more about the history of the World Wide Web, check out this episode of Crash Course Computer Science.