Data Skeptic

[MINI] The Elbow Method

Duration:: 15m
Broadcast on:: 18 Mar 2016
Audio Format:: other

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one should select to solve their problem?

This mini-episode explores the appropriate value of k to use when trying to estimate the cost of a house in Los Angeles based on the closests sales in it's area.

(upbeat music) Data skeptic mini episodes provide high level descriptions of key concepts related to data science and skepticism. Today's topic is the elbow method. - So picking up on our last mini episode, Linda, where we tried to apply multiple regressions to estimate the value of a home, since we're in the market to buy a home, huh? - We sure are. - What do you think the most important thing about a home really is? Or the realtors might say the top three most important things? - You mean in terms of determining the value of the house? - Yeah. - Well, the thing is everyone has different values, but I could tell you my items. Well, I want our investment to appreciate. - Yep. - I want the culture of the neighborhood to be like somewhat like acceptable or doable with our lifestyle. - Yep. - The location matters, safety. I want it to be safe so we could live our lives. This isn't as important to me as long as it's within an hour commute, but I know you want it to be within commuting distance from work. - Right. - And you defined it as 30 minutes. - Yeah. - Which is not very fun. - Very fun. My little joke earlier about the top three things for Marilter are location, location, location. And you pick location and neighborhood. And it's neighborhood I kind of want to focus on today because a house in one neighborhood could be identical to another neighborhood, but that doesn't make them of the same value now, does it? - If we want to walk somewhere, it doesn't have restaurants, grocery stores, parks, or anything that we like to do. Like if there's a pool or a gym nearby. Is there a CVS or a drug store? - And you'd be happy with just any old restaurants? - I want them to taste good and be safe. - What do you mean be safe? - I don't want the restaurant to have a history of like shootouts or anything. - What kind of restaurants are you eating at? - I don't know, I don't need those, those, exactly. My point exactly, I don't need those restaurants. - I thought you were going to say like, I want a restaurant with a Michelin star. You're settling for one where they're having to not be a shooting? - Yes. - Okay, cool, good to know. So it's actually going to be a little bit difficult, even in the age of data, to get all the numbers of good restaurants in an area. Like maybe I could look at the Yelp API and I don't know, hope it's complete enough and hope that their star ratings are good, but there's a lot and then what does the walking distance mean, right? Like maybe there's sidewalks some places or it's safe for other places. A neighborhood is a complex thing to model. So I'd really like to just have a way to approximate it or know something about the neighborhood by proxy. That way I could assume that like, all right, the price of one home is probably relatively close to the houses around it. Wouldn't that make some sense? - Within reason, some houses are bigger than others. - Yeah. - And some of them are in worse condition, like for example, a termite infestation. How does that impact your, the plumbing is messed up and they need completely new plumbing? What does that mean to the value? - Yeah, so those are issues with the house itself that we would want to figure out in more of a regression type point of view. What I was looking at here was how do I get a sense of like the neighborhood means, you know, like that one block is better than another block. And for that, I'm going to use an algorithm we talked about a long time ago called K nearest neighbors where if you want to approximate something, you look at the average of its K closest houses. So then maybe the market price of a neighborhood kind of summarizes the value of that neighborhood. We don't have to worry about all the features like if there's good restaurants or things that else we care about. - But I do care about those things. We'd like to have, since I can't easily measure those, we'd like to have a way of sort of approximating the cost or the price of the house. - So you want us to exclude it from this conversation? - Well, that excluded, can we measure it by proxy? Can we know that like, oh, this neighborhood has very expensive houses. That means people must like all the restaurants around here. They must think this is a nice place to live. - I mean, that's generally the trend in LA, yeah. - Yeah. So that's kind of the intuition I'm going to rely on. Because really what I'm trying to do also is predict the fair market value, which is different from our private valuation. Like what have we found a house that was pretty cheap? And it was walking distance to your 10 favorite restaurants in the whole world. There wouldn't necessarily be someone else's favorite restaurants. So maybe that would be like, we would value it more than the market would. - Yeah, but it would only be valuable for maybe a couple thousand or a couple hundred, really depends. - Yeah, that sort of proximity stuff. Well, I don't know, that can influence it a lot. Like maybe on like a block by block basis. That's in fact a good way we can tie this into the topic. Moving two blocks away doesn't, you know, make it, and you're two blocks further from all these attractions. That's not going to change much. But on the scale of like maybe miles, neighborhoods change quite a bit. - Yeah. - We don't want to look too far away to find comparisons. So this K in the K nearest neighbor, what you want to do is pick a number K and say like, well, I want the K closest items and use those as my basis of comparison. Let's talk about the type of K we might want to pick. What do we pick to a really, really big K? We want to compare the value of a certain home to the thousand nearest homes to it. - A thousand? - Yeah. - So if you look at each neighborhood in LA, the average density, I think per mile is like 10,000 persons. So that's pretty high. - Yeah. - And so I don't recommend that. Actually, I don't think that's right. One mile, 10,000, is that really right? - I actually have a great idea. - But anyways, there was an average density of LA. And so if you're including like, let's say your thousand houses mean it's 40,000 persons, that's really too high. 'Cause the neighborhoods in LA are so dense and they change so frequently. - All right. So you're telling me I can't pick a really large K. What if I just go the other direction, I pick a small K, I'm going to take the two closest houses and I'm just going to use those as my price comparisons. - Well, there's not enough data, but I have a question. Can you do like a bullseye treatment where you're like, oh, there's tier one houses where you're like checking within like five block radius. Then there's a tier two where you're like, okay, within a five block radius. And there's tier three within one mile radius. - So that's interesting. And I'm not going to say I'm not going to do it, but it's not the topic of this episode. Actually, you know, that is a nice method that data scientists don't talk about enough. - 'Cause I kind of want to spread G of like how further and further how it varies and how then from there we pick the most accurate one. - Maybe that's why we don't talk about it much because when you have this so-called bullseye method, what you end up doing is providing a lot of numbers and people are like, cool, let me look at the numbers and it doesn't really drive you to a decision. You know, you need a summary statistic that gives you the information you need to empower a choice. So maybe just one number, it's easier to be like, that's the number, that's my, you know, arithmetic mean, I'm going to run off that. If the bullseye, now what's interesting about the bullseye is it kind of captures the gradient, right? - Yeah. - Yeah. - So you could see how the price is shifting away from you or closer to like a heat map of like, hmm, we're kind of in a cheap area or we're in an expensive area. It gives you an idea of like that neighborhood story. - Well, the challenge becomes then how do you set the parameters of the bullseye? - Well, it could be on population density, maybe. - Well, it could, but now it's getting to be a complicated thing here and we have to ultimately some new parameters are going to show up and you'd like to minimize the number of parameters you have to pick. And in classic K, there's basically these two main K algorithms. There's K nearest neighbors that we're sort of talking about and there's K means clustering that we talked about a long, long time ago also. And in both of those, even though they're named K that K's don't mean the same thing really, they're just a parameter one has to pick. Why are you smiling? - 'Cause I was just thinking, my name starts with L and I'm like, why don't they call it L? - 'Cause L looks too much like a one when you write it sloppy. It's confusing. - Anyway, it's moving on. - You don't want too many free parameters, but how do you pick the right parameter? So there's this technique called the elbow method and I actually thought I was going to get really deep into the elbow method and I did some work I'm going to share with you in a moment. And what I'm doing is not technically the elbow method. So let's just talk about elbow method real quick. Elbow method, if we think of clustering is how many clusters are you trying to make? So as that goes up, typically you have this monotonically increasing curve where if you add a cluster, you get a better accuracy, but there's a trade off. So you want to look for a point where it's like, well, the trade off, if I increase this, I get a lot of return on that. But the elbow method is also a very soft sort of method. I haven't found any papers that have a rigorous definition of how to do it that uses like the moment of the function or anything like that. It's just sort of like, well, go eyeball it. Look at the visual. Another way that we won't get into the background of the people like to do this is by saying you take the square root of the number of features and set that to your K. That's a bit more of an unsupervised type method. And that does have some theoretical underpinning from what I understand, but not prepared to talk through all the details of that now. Just a minute ago, we agreed that you don't want K of 1,000, right? If we're going to look at the camerest neighbors for a house. And you also said two was too few, right? How many do you think you would compare it to? - Like 100. - 100, well, that might even get a couple blocks away. Well, maybe that's not too bad. So I didn't do up to 100. I didn't do up to 100. And I know-- - What, lacking in data, Data King. - Yes, I am lacking in data. We're going to talk about that at the end here. But this is what I had off of the sample I did. If anyone wants to see this graph, they can go to the show notes. This is a graph. Linda, can you describe it? - It is a bar chart bar graph. - Yep. - Where it looks like X is the value of the house. X is the K value I chose from one to about 25, I think. And the Y is the price error. - Oh, okay, so the Y is the price and the X is the K value. - So what happens is I increase K. - Well, the price goes up. - Well, the error goes up. That's the dollars of error that-- - Oh, that's dollars of error. You really should label those. - I did. - Where is the Y label? - Well, it's in the title, it's a mean square error. This compares the average error I make. So what the error is, is I take a house, and then I say, okay, find me the K nearest houses, and see how much its price differs from those. And some of the square errors is the value I report there. - So there's more errors. - Yes, you'd like to minimize that. You wanna pick a K so that you get the minimum number of errors. You properly balance out having a good sample, but not too many that it's stopping the representative. - So from looking at this graph, what is your recommendation? - Well, by this data set three, K equals three. - Three houses? - Yeah, that's the minimum number. - Well, okay, why isn't it enough? I don't have every house. I have some sprinkling of houses that I was able to find with this unofficial API that's in one of the City of L.A.'s websites. I have about 5,000 houses in my data set, so you can imagine they're probably spread out, or they are spread out. And so I don't have a lot of neighbors. It's not like I have every house on the block. So this is more for demonstration purposes. - So I was just saying this method does not help us. 'Cause that error is the value of the house. That does not sound like this is helping us. - It was not. And the reason it's not is because I have a limited data set. - You told me, I don't know. - Haven't you been listening to these last couple episodes? I've been complaining. - Well, complaining again. - There is not ready access to high-quality, sufficient, complete real estate transactional data information. As far as I can tell, anywhere in the United States, despite this being publicly available information. So for those of you who haven't heard yet, I've kicked off this little community data science project. If you want to get involved, or if you just want to lurk, you can go over to datascheptic.com and click on home sales, and you can email me from there to get on our Slack channel. So on the Slack channel, myself and anyone who wants to be involved, we're going to try and solve this problem and liberate as much of this data as we can so that people like me can do analysis on it. - Well, that's disappointing. It sounds like you did a lot of work, and then the results were kind of-- - Should I put a sad trombone in here? - Wah-wah. - Yeah, sorry I interrupted you. - I don't know, I just thought you could have started the episode by saying that. (laughs) - What, that it didn't help? - Yeah. - Well, you know, this is actually a common story in data science. A lot of times you'll chase something like this and get to the end and realize that there is an assumption that didn't hold up or a methodological problem. And in my case, I have a good method, and I would have used the elbow method, but I'm limited on data. So hopefully my little home sales project can help us fix some of that, but let's wrap up by revisiting the elbow method just a little bit. So let's just go a boring route and look at the image of the elbow plot that one can find on Wikipedia. What do you see here, Linda? - Y axis says percent of variance explained and it goes from zero to 100% and then the X axis says number of clusters, which is from one to nine. - Yeah, so we like it when variance is explained. Like we talked about last week in multiple regressions, the more variance I can explain, the more I understand why a house costs what it costs. So in this case, it's a similar idea. And what do you see happening as the cluster, as the K increases? - Seems to be leveling off. - Yeah, it's leveling off exactly. That means that as you increase K, you don't get as much return on your investment. So you wanna find someplace where there's an even trade off and you can tell which one the author of this plot has chosen, right? 'Cause he circled it, or she circled it. - He or she circled number four. - The elbow method is all about looking at a plot like this that is increasing, but leveling off and picking what might appear to be a good like joint, just like your elbow bending, where you don't get as much return on your investment. So not all plots have these. I would even argue this canonical example doesn't have a clearer explanation as to why that's the elbow or the knee, as it's sometimes called. That's what the elbow method is. And in principle, I find it to be not that analytical, but the basic idea is you look at your plot and you pick a point that seems justifiable based on it being a turning point in the series. But regardless, the elbow method is quite popular. You should at least know about it, even if you're not that impressed with it the way I'm not. So that's why we did this mini episode. Join us in our Slack channel. And until next time, I want to thank you, Linda, for being here. - Thank you. - And remind everyone to keep thinking skeptically of and with data. - For more on this episode, visit data skeptic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music)