This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and th bag of words approach.
Data Skeptic
[MINI] Natural Language Processing
(upbeat music) - The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. - Well, welcome to yet another episode of the Data Skeptic Podcast mini-episodes. As always, I'm here with my wife and co-host Linda. - Hello. - Thank you for joining me, Linda. - Thank you, Kyle. - So tell me something. When I say the topic, natural language processing. What does that mean to you? - To me, I just think of someone who's thinking about language naturally. (laughs) - Doesn't like think computer-ed you? - Yeah, I mean, processing sounds like a computer chip to me. - Well, yeah, that's what we call the field. And actually, when I was in grad school, people were trying to push for this term computational linguistics, which I actually preferred. But it seems like natural language processing or NLP is what's stuck. And that's the area of computer science, AI, data science, however you wanna spin it. It's the focus of how do we get computers to recognize the language that you and I speak. In this case, being English, but it doesn't matter the language. It's just human the way we communicate with each other. - So an example is on an iPhone or an Android when we're talking to Google now or Siri. - Right, yeah, exactly. - We say directions home or something like that. - Well, there's two parts actually. So first there's like speech signal processing. Can they turn our voice into words? And then there's, well, what are these words actually mean? And that's the part that I think of as natural language processing. Once you have words, what are those words mean? It's just like Alan Turing said. Language is probably amongst the most computationally difficult activities that one can work on. Yet we're starting to kind of see, and maybe even for perhaps a decade now or more, we're seeing that machines are understanding language. How can that be when it's so hard of a problem? - You're asking me, I don't know. Maybe their chips have gotten faster. - Oh, that's actually interesting. We could talk about Moore's Law more sometime, but it's not so, well, is it about the speed of the chips? This is a really interesting question, almost for another episode, but I would say it's not so much the speed of the chips, but the size of the data sets that people have been able to work on. So let's define natural language processing. It's when people are trying to use computers to understand natural language. And the general process for doing that is about training an algorithm to recognize things. So just like how we talked about a couple of shows ago, Markov chains can link words together, like I and love, go together probably more often than I and potato. A lot of the learnings of those things, or at least on the successful technologies we have today, are based upon algorithms that were well-trained on large data sets. But again, I'm kind of maybe moving a little bit away from some of the fundamentals I wanted to talk about. So let's think about how a computer tries to first understand an English sentence. How do you break down a sentence when you hear one or read a sentence in a book? - I mean, I don't know. I think I'm kind of older. I'm not like a five-year-old kid. So, you know, once you understand the blocks, which are the words, well first you have to understand the letters and then you want to see the letters and you could see the words. And once you understand the words, then you could see the sentence. So then when you see the sentence, hopefully you're seeing the message. - That's actually pretty much right on with where I wanted to go with this. So you identified the first thing, which is recognizing the words. So if we kind of separate a sentence by spaces and maybe punctuation, we can sort of separate out all the words. And how do you feel about the word run, running, and ran? - Well, there are different verb forms. - Yep, true. There are different verb forms of the same idea though, right? - I mean, it's the same word conjugated differently. - Correct. And it does have a different meaning, but when we look at statistical approaches to natural language processing, it's all about the frequency with which you can observe different situations. If there's a document that's about running, you would want the word running and ran and run all to kind of be considered the same thing so that they can be matched together. So a major first step in doing natural language processing is dividing up sentences into what we call tokens. So tokens aren't just words, but they're words often with the stem cut off, like so that run and running are considered the same thing because from a linguistical perspective, even though running captures the tense, they're the same on some fundamental level for comparison's sake. So you might use a stemming algorithm, something that Google is well known for is looking at what they call engrams. Have you heard of that before? - No. - If I say, again, figuring like all the stemming stuff is done, there's parrot, that's a single token word, but Amazon parrot is a different type of parrot, correct? - Yes. - So if we saw a document about Amazon parrots and another one about parrots, the document about Amazon parrots is probably more relevant to a query about Amazon parrots. So maybe it would be useful to have your algorithm consider those two things simultaneously as the same idea. Amazon dash parrot is considered as if it was one word. - You mean they have to be next to each other? - If they're next to each other, that implies to us that maybe they're important and related, perhaps the same concept. - So you're saying your search algorithm would recognize that or no? - So I'm saying that this is a consideration when doing natural language processing is the concept of N-gram. What I described was a by-gram, 'cause there's two, but you could have trigrams and quadrams and however deep you want. Don't think of words individually, but think of sequences of words as concepts. So Amazon parrot is a by-gram of two words. Amazon parrot whistling is a trigram, has three words, that if I saw those three linked, I would wanna see wherever that came from, 'cause they're probably making cute bird noises, right? - Maybe. - All right, tokens, roots, stemming, N-grams. There's another important concept that gets used a lot in natural language processing and that's part of speech tagging. Now, I bet you have a guess what that is, right? 'Cause you actually have a lot more liberal arts kind of background than I do. As far as grasp of the English language, I think you win. - Parts of speech tagging? I don't know what that means. - So do you know about parts of speech? - I don't know. - Adverbs, nouns, adjectives. - Oh, is that a part of the speech? - Yeah. - Oh, okay. - So there's this other tool we use a lot called a part of speech or POS tagger. And what that software does is it can look at a sentence and figure out what type of words are in that sentence. In the example, like, let's have the sentence, Yoshi ate the peanut. Yoshi is some sort of proper noun. Eight is the verb. The is, I don't know, what is the? Is that like a preposition? - Yeah. - Oh, cool, I got it right. - That's what I think. I'm not very good at grammar as a footnote. - Me, either. Even though I've worked on a lot of these algorithms, I know them sort of just from like a code and statistical level. And then peanut would be another noun. So you could potentially with this sort of software label each word for what it does in the sentence. Now why would you do that? Because verbs and nouns are generally more interesting than prepositions and it can help you identify not only what the sentence is about, but maybe the relationships described. Like English tends to be subject verb object, right? Yoshi ate peanut. Linda married Kyle, right? Where I guess German is the opposite, it's object verb subject. But if you're looking at English sentences and you've extracted the part of speech tags, you can sometimes extract more meaning from that sentence because you know what the verb is and what it's describing. Now that's a bit more advanced than a lot of what most people do actually when they talk about natural language processing. When you look at parts of speech and grammars, that's almost a bit fancy and high-end. A lot of people use a technique called the bag of words approach. Now what would you guess that means having heard it? - Random. - Well not random, kind of bagged you have. - Well it reminds me of the grocery store when you go in and you're like, yeah, mystery bag of apples on sale. - You would never buy the mystery bag, that's not a Linda thing to do. - Well no one's gonna open the bag and look at all the apples, they would just buy it. That's too much work. - All right, well, let's think about those apples. What different kinds of apples could be in there? - Bruised apples. - And how do you like them apples? - They're not good. - I meant like gala, honey crisp. If you're gonna buy that bag of mixed apples, it doesn't matter, right? 'Cause you get it home and you take it out and you have what you have. In language, the order actually matters. Yoshi ate the peanut. That's a very different sentence than peanut ate the Yoshi. - Oh yeah. - You can't just toss the words all into a big soup bowl and assume that it all works out. But actually, that's what most people do and they do national language processing. They use this thing called the bag of words approach. Let's say you had a bunch of books. Every book in the library you scanned, you had it all in a text file. And you wanted to find out how similar two books were. And presumably like an Isaac Asimov book is more similar to an Arthur C. Clark book than Isaac Asimov books are similar than what did you just read, the Hunger Games book? - I think that was like four years ago. - Yeah, I said I remembered. But I think we could agree Isaac Asimov, more similar to Arthur C. Clark than it is to whoever wrote Hunger Games. So how would you detect a similarity like that? Well, a lot of success has been had by just comparing these frequency counts and word count. The word computer appears this often in these books, along with a bunch of other words and then seeing what matches most closely in terms of having similarities versus what's very different in that the counts of the words they use to describe things are very different. Are you with me right now? - You mean the average count of their words? - The whole idea that you could just jumble all the words and do a bag and just see how frequently they appear and that that could be a meaningful thing to do. - So you're asking me if I'm with you? Yeah, I've read about that. - Oh yeah, what have you read? - They said it's like a fingerprint. - Or the words you use? - Yeah, like authors. - Oh yeah, actually there's a lot of cool work and I'm trying to get this one guy on the show who did this cool thing with Agatha Christie. So maybe we'll have him in the next year. But yeah, people's writing patterns show up statistically and that's very much in the domain of natural language processing. So it's hard to do many episodes on this because there are so many different aspects of this and little tools and methodologies to cover but some of the very basic introductory ones I hope we talked about that people get are how to tokenize something and stem it, take away the conjugation and get to its base word. And the reason you would do that is so that you could compare words and have them run as the same as running. That engrams are a powerful technique looking at how words appear together versus like two words that are in the same sentence but very far apart. POS or part of speech tags are a useful piece of metadata that might give you some insight into the meaning of a sentence or the context of a document. And that yeah, I've been saying document a lot without even explaining what that means. Do you follow me when I say document? - What do you mean? - Well, it could be anything, a tweet, a book. Document is just a completed work of text and we use that a lot of natural language processing. Also a corpus is a collection of documents generally of some like the same source or some relevancy or something like that. I guess in conclusion, we're very far from solving this natural language processing problem. There is a lot to be said for the human understanding of language and the currently inferior computer understanding of language but some of these techniques are what people use to explore the ways in which computers can understand language. And an interesting use case of that will appear in an upcoming interview which is why I wanted to squeeze this many episode in. So can I quiz you on these topics? - No. - Why not? - I don't know. - It's stemming. - Stemming. It's things connected to other words. - That's pretty much right. What about POS tagging? - Parts of speech tagging. - What does it mean? - Whether something is a noun, an adverb proposition. I mean, preposition, sorry. - How about engrams as the hardest one we did? - That's groups of words, that means something. - Bag of words approach? - That's what you were saying, people just take all the words together, right? - Yep, and that's pretty much good enough in a lot of cases. Why, you really picked up on a lot of this. I didn't get into grammars at all but I should probably save that for another day, so. - Yes, please. (laughs) - Well, once again, thank you for joining me, Linda. - Thank you. - This is gonna be a great mini episode. (upbeat music) (upbeat music) You