Data Skeptic

[MINI] Term Frequency - Inverse Document Frequency

Duration:: 10m
Broadcast on:: 11 Dec 2015
Audio Format:: other

Today's topic is term frequency inverse document frequency, which is a statistic for estimating the importance of words and phrases in a set of documents.

[MUSIC] Data skeptic mini-episodes provide high-level descriptions of key concepts related to data science and skepticism, usually in ten minutes or less. Our topic for today is term frequency, inverse document frequency, or TF-IDF. So Linda, I want to talk about a fun metric today, like many good statistics, it's actually very interesting and informative to unpack it and ask sort of like why it exists and what problem it was trying to solve. So we're going to do that with this information retrieval statistic, called TF-IDF. What if you wanted to assign an important score to every word in a document? Can you imagine a scenario where that would be something useful? Assigning importance towards? In a document, yeah, so a document could be like an email or a book or a tweet or just about anything, but where you'd want to know like what are the important words in that document? Well, I think we're working a little bit out of order, but I think the answer is, I assume like search engines and such, you know, use that to search a document or, you know, when you're searching your email, you're like, I'm trying to find this document, and hopefully it's scanned the words in the document. Search absolutely takes advantage of TF-IDF in some form. Now, how do you think we might arrive at the most important words or phrases in a document? What's like an indicator of a word being important? I mean, obviously it has to be in the document, that's a fair one, but how do you know which are more and less important words? I mean, it doesn't mean it's important word, but you could talk about how often it's mentioned. Yeah, that's exactly right. The first idea is TF, term frequency. For example, if the phrase "lilac-crowned" appeared in a document many times, what do you think that document might be about? About Yoshi. Or her species. There are other Amazon's besides her that it could be about, right? Yeah, well, if you just think of how often we say Yoshi, that must mean she's really important to us. Yeah, definitely. Now, what about in all those documents that say "lilac-crowned"? How many times do you think that phrase would appear on a page that's about "lilac-crowned" Amazon's? I don't know. It depends how long the page is. Let's say it's like, if you printed it out, it would be like three pages long. I don't know, maybe like 60 times. 60 times? That's a lot. I was actually going to say less, but yeah, you might be right. Do you think that's the most common word on the whole page? Word or phrase? Well, no. You were saying like connector words, right? Yeah, like the "of" on and in stuff like that. Are they called connector words? What do they call them? I'm going to get in trouble here. Some listeners are going to write in and correct me for sure. Yeah, there's some grammar person who should tell us what it's called. I would guess it's a conjunctive part of speech, but even aside from part of speech, there could be other nouns that are going to be more common than the phrase "lilac-crowned" while actually not necessarily being all that informative about what the page comes. So you have to kind of balance the first part that you picked up on how common a word is on the page, like the more the better, theoretically. But that's not a perfect solution, right? Because then every page, it's the most important word would be the. So what would be like maybe a way we could counteract that? Well, I don't know. What is? Well, maybe you just want to penalize the score for how many times common words appear or rather penalize the score of that word. So you'd want the score of "the" to be really low, right, on most pages. Why? Because most pages are not about "the". I mean, maybe the Wikipedia page on the band "the", it would be important there, but elsewhere "the" is not an important word. I don't know. If I'm reading it, I'd like to read "the". So it's not an important word in determining what. Like, not that you take it out, but that when you say "what are the important words on a web page", the word "the" is almost never among them. Oh, yeah, probably not. Right. What's something we know about "the" that maybe we could penalize for? How many pages contain the word "the", would you estimate, or what percentage of pages? All of them. Almost 100% of websites would say "the". So maybe we could penalize it by how many documents it appears in, because the more documents a term appears in, the less overall important it probably is. The less likely it is to tell you which page it came from, versus the phrase "how many pages do you think, say, Linda and Kyle's wedding on them?" Web pages? Yeah. Only our web page. I mean, maybe somebody put on their blog that they went to. No. You don't think so? Nope. We had some high-profile people at our wedding. No bloggers. Yeah, so that's a rare phrase. It appeared only a few times, maybe on our site. It's not like we just repeated a million times over, but it appears so infrequently on the Internet that it has a little bit of information. That it has a low term frequency, but also a high inverse document frequency, so it doesn't penalize it as much. And the reason we call it inverse document frequency is because it's sort of like the inverse. It's one divided by a number. Do you know what happens if you have this equation one divided by n? What happens to that as n gets bigger? Well, it means less. Yeah, it shrinks. Smaller percentage. It gets closer and closer to zero. So the bigger the n, the smaller the value. So that's what happens when you take the inverse of something. The intuition I want you to remember is that as the number goes up, it shrinks. So the more documents on which a term appears, the lower this part of the score should go. You're actually going to need to scale it, so generally you want to take the log of this, but it would be like the log of n, which is the total number of documents, divided by the number of documents that have that term. And that's sort of the generalized form. There's some variance to that and some reasons why you would tweak that, but that's the basic idea. So term frequency multiplied by the inverse document frequency balances out this cool feature you point out of a document appearing a lot of times means it's important with this idea that if it appears too much on too many documents, it's not important. Now, TFIDF is not a perfect metric as I've just described it. Sometimes you need to normalize one or both of those. Sometimes you need to take the log of one or both of those. So your mileage may vary and I would recommend people, if they're interested in that part, go look up a legitimate resource. That's not what we do on many episodes here. We try to give you just the high level understanding of what the topics are. So how did this come about? What's the history of this method? Good question. It came out of the information retrieval community because they were trying to score documents in the exact way I'm describing. They'd like to know per word or per phrase what is the importance of that word or phrase to the document. That way, if you go and do a search query or something like that, you can easily come to the pages that have the highest term frequency inverse document frequency to the words in your query. Or if you had a page you liked and you said what are similar pages, you can look at the vectors of those two things and use cosine similarity that we talked to a long time ago, which is the idea of like, what is the distance between their two vectors? If their vectors are composed of all these weights on all the words. Maybe we could talk about one of the ways in which term frequency inverse documents frequency is sort of insufficient. Think back to the early, I guess, like mid 90s when the web was all in new. When did you get on the internet, Linda? No. I don't know. I have to calculate the 90s sometime. Well, yeah, you guess at the year or guess at the grade or anything? I don't have to think about that. Well, anyway, do you remember how easy it was to find what you were looking for back then? Well, there wasn't that much out there, so no. And what often happened when you would wind up at a page? Would it be exactly what you thought it would be? No, people were really bad also with keywords and stuff. Yeah, so there were a number of problems. The search engines themselves and the algorithms. They implemented weren't as sophisticated, but also there wasn't as much policing or penalization for manipulations. So let's talk about one of the ways that TFIDF can be easily manipulated. Could you maybe in your mind repeat back what those two pieces are or roughly how you calculate them? Calculate what? TF and IDF. Term frequency. I guess you just count how many times it appears in the rate. And then IDF. That one I'm going to need some help on. I would be like the log of N, which is the total number of documents divided by the number of documents that have that term. What if you have a web page that just scrolls forever? Well, that would be one page. Although search engines have to... Oh, this is another good point of another problem with TFIDF. If you have documents of mixed length, it's not necessarily a sufficient statistic for comparing them. You might need to do some normalization. So let's say you had a web page and you really wanted to be ranked highly on search engines, right? Get into that number one spot so people find you. Can you control the number of other websites that mentioned the word games? No. Can you control the number of times the word games appears on your page? Yeah, I mean, that's what people do during search engine days. They would just repeat the same word like a bazillion times. Yeah, super annoying. And then they would try and trick it and put it in different colors. So eventually, search engines started penalizing those pages. But that's a failure of TFIDF in that it's manipulable. If someone knows you're using it, they can go in and try and cheat the system. That doesn't mean that it's inherently not good. I guess I would call it necessary but not sufficient. And another example why the one you kind of mentioned earlier is if the documents are of different lengths, the comparisons of the weights are not necessarily going to pan out in a nice normalized way. But there are some workarounds for that. If you're going to go implement TFIDF, do some reading about, you know, look at the Wikipedia page and some of the common implementations because you do need to adapt it a little bit to the data you're working on. But I think we've covered intuitively what it is. Thank you for joining me as always, Linda. Thank you. And until next time, I want to remind everyone to keep thinking skeptically of and with data. Good night. Good night. ♪♪ For more on this episode, visit datascopic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music) [BLANK_AUDIO]