Data Skeptic

[MINI] R-squared

Duration:: 13m
Broadcast on:: 04 Mar 2016
Audio Format:: other

How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why different houses have different prices. There's some amount of variance that can be explained by a model, and some amount that cannot be directly measured. R-squared is the ratio of the explained variance to the total variance. It's not a measure of accuracy, it's a measure of the power of one's model.

(upbeat music) - Data skeptic mini episodes provide high level descriptions of key concepts related to data science and skepticism. Our topic for today is R squared. (upbeat music) - Linda, you're aware that I'm building this model, right? We talked about multiple regressions last week. I'm hoping to do what? Or do you even know? What do you think I'm up to? - You should summarize what you're hoping to do, 'cause I don't know. I'm just waiting for us to enter a number in so we could buy a house. - All right. - That is what I'm waiting for. - So then the new objective of these mini episodes is to help explain what I'm doing to you and why. We wanna be able to know a good price for a house, right? - Yes, we wanna know what's a good investment. - Why is that information valuable to us? - Because we are spending a lot of money and we could lose a lot of money. - We choose badly. - We overpay. So overpaying is obviously really bad. What about underpaying? That's pretty bad too. - That's good in our favor. - Yeah, good for us bad for the seller. So we would definitely wanna buy quickly maybe if we thought we found some sucker. - Yeah, well the odds are it's probably competitive. - Yeah, so in fact that's an economist will say that's how markets are supposed to work, right? Even if someone is not very smart and they would undersell, there would be more buyers in the market would correct. I don't wanna get into economics, but that's kind of the argument there. We would like for our benefit to have a good understanding like, okay, what does this house actually seem to be worth? And I want us to help build a model that tells us that so that when we make an offer we do so intelligently and we don't risk losing our money or maybe we even find that we got a deal, 'cause we were smart, who knows? But at least I want zero, I wanna pay the fair price, right? - Yeah, I just wanna add a footnote that there is a cost to doing all this, which is time, right? We could be losing out on all these houses, all these deals, a price of houses can go up. So I just wanna know how much you think we're gonna save. I hope it's worth the time that we're wasting by not doing anything. - Well, what if we're enjoying that time doing a data science project? - Enjoyment? I don't think houses buying should be an enjoyment. This is a job that we should get done. - All right, so you wanna accelerate the process? - Well, there's a risk. We should go down and model everything. - There's a risk that housing prices can go up. So by the time you do your little analysis, you could be like, hm, that nice housing prices raised went up $20,000. - All right, is it worth $20,000? I just wanna throw it out there. - So I feel like you're making an excellent point, but it's not a podcast point. It's a Linda and Kyle talking after the show point. - Yeah, Kyle wants to take this offline, but I just wanna call out as a project manager that you guys always wanna do your projects and you're not thinking about what the overall consequence is. - All right, so that may be fair. And let's actually talk about that after it, 'cause this is a mini episode. I don't want this to go on for two hours. Or do you-- - Yeah, no. - I'm down for that. I'm trying to look out for your time. - So we can talk about this offline. - Okay, assuming I'm doing this model, and I came to you and I told you like, hey Linda, every house that's sold in these neighborhoods in the last two years, I've been able to predict the price exactly. - I don't know, that sounds amazing to me. - It sounds almost unbelievable, right? - I don't know what's believable or not. - Well, how could I precisely predict the price? I mean, you could get in the neighborhood naturally, right? Get close to a good prediction, but to know exactly down to the dollars what people paid for it, like I didn't know like if there were multiple bidders, for example, that would make a price go up. You don't have that information anywhere, that's not recorded. - So what you're telling me is that it's not possible. - To get it exactly. You have to build a model, 'cause that would be an overfit model. You have to get a model that's as close as is plausible. Does that, do you know what I mean by plausible? - Believable. - Yeah, so how about this? What if I said, hey Linda, every time I want to price a house, I go and I get a bunch of 10-sided dice, and I throw them on the table and range them in order, and I guess that's what the price is. - I mean, to me, it doesn't sound linked to real world. - That's right. So in fact, there's absolutely no relationship between those dice rules and the price of the house. So we'd like to have some measurement that would tell us when things are like that, or even like very close to it. Maybe score it as zero, being like this measurement has zero value. But if the measurement's perfect, maybe we'd call that like 100% or 1.0 value. Okay, so it's our ability to assess how good of the model we built. That's what the R squared statistic does. Is the ratio of the explained variance to the total variance? So every system has variance, right? Not every house sells for the same price. They all sell for different prices. And there's a certain number of reasons that explain why they sell for different prices. Some of those are obvious things like how many bedrooms they have, or if there's a pool. But there's also non-obvious things like are the walls gross looking in stucco, and that would drive the price down? And you don't necessarily know that about a property. You don't even know that there's things like if there were more than one bidder, or if a person just was in love with the architecture, you can never learn all these things. All of the things put together are the total variance of a system. But you're able to observe some of them. I can find out data about a house. And the more data I find, the better a model I can build. And if I build a really, really good model, then I would say I explain a lot of the variance. So my model tells you a price, and it gets it pretty close because it knows all these factors, like the number of bedrooms, bathrooms, square footage, neighborhood, stuff like that. So you would like your model to be able to really describe the data you have available. But it can't describe it perfectly in every system. So like in buying a house, you might not know certain things. Like I was saying, you might not know that, oh, the people that bought it fell in love with the architecture. So they were willing to pay an extra 10 or 15,000 or whatever because it had just the right beams or whatever the way these people like it. Maybe where they grew up or something like that. So that the total variance is all the things that would make a house arrive at a price. And if your model is good, you can learn some of the things and you can explain some of the price and the ratio of these things. What you can explain to the total variance is called the R squared or coefficient of determination. You like baked goods, don't you? - Yeah. - What are some of your favorites? - Pastries, anything with a flaky. So like flaky buttery. - What if I told you, Hey Linda, I have a secret formula for how to make a good pastry. You take the temperature it's baked at, the amount of time you bake it and the amount of cinnamon it has in it. And with those three piece of information, I know how good it's going to be. - Do you mean just the flaky shell? How good it's going to be or the whole end result? - Because if you took the bound of baking time, what you cooked it at and whatever, those are scientific things. The pastry chef has to follow it to the nine to make the exact perfect thing. That is part of the variance of the system. I'm not able to observe if this person is a skilled pastry chef and if they executed their craft completely. - I mean, I don't really know. I'm not a pastry chef, but listen, I think people, there you have certain techniques for how you're supposed to fold in the butter, the flour, whether it should be room temperature or not. There's things where you have to beat cream or egg whites till they become stiff. - Yeah, yeah, good point. - All of these things matter. - You're exactly right. And this is what I would call unexplainable variance. Maybe a master pastry chef could observe someone and they could score it, but mostly we just can't know. Like did they fold the egg whites really well or only quickly and lazily or something? That's going to affect the output of the cake. And since we can't observe it, we can't, doesn't help explain the final result for us. It's part of the unobservable variation. So this R squared therefore, it's not a measure of how accurate your model is. It's not about, oh, your model is really perfect. It got 1.0. It's about how much of the variance your model explains and a good value for R squared depends really on the situation. If you're trying to model something where you don't have all the information, like, you know, baking a cake or buying a house, you don't expect the perfect model, but it has to be good enough that it's useful. And that's what R squared is all about. It's essentially an evaluation for that. - So what's a good R squared number? It depends on the problem. Is the problem you want me to give you? It's for the real estate one, don't you? Well, it's a good R squared. I really have to get back to you on the answer, but I'm leaning towards, like, 0.7. But I could explain 70% of the variance, at least. Hopefully higher, probably higher actually. - That doesn't sound very good, actually, 70%. - I know it's not in the 90s. There's just some variables we won't be able to observe. - So we're waiting for a model that's, let's say, 80%. And it has an R squared value of 80%. - That would be good, yeah. - So as an 80% of them are within a range of accuracy, that's acceptable. - Sort of, what you just described is not exactly what it is, but yes. The question you're trying to ask is yes. - So what is R squared? - R squared is the percentage of the variance that your model explains. - Okay. So quick correction, listener Bob, and Bob's a friend of mine from LA. He wrote in and pointed out that when I said the Kernahan and Richie's C-book, that in its index, the page that recursive is on, is also listed in the index, which is quite a good computer science joke. Bob corrected me, he said he has the first edition of that book, so good for you, collector. And it does not actually appear there. So my copy did not make the move out to LA from Chicago. I could not confirm, but I'm certain I saw it. So I must have had a second or later edition. Let's assume I had second, in which case, in Kernahan and Richie's defense, I would say maybe the first edition is the base case. I don't know, just putting that out there. But in any event, that's my second most favorite correction. Do you know what the most corrected thing to the podcast is Linda? - No. - By like a long shot, this is 10 times more than anything else the correction I've gotten. - What? - That one time when I was talking about the Devil Wears Prada, I called one of the main actresses, Meryl Streep, and it's actually Glenn Close, or vice versa, I still don't know which one. - Oh, interesting. - But in my defense, pull up two pictures of these women. They look startlingly similar. - Oh, yeah, now that you said that. - Right. - Wow, I'm sad, I didn't catch that. (laughing) - Until next time, Bob, in any other corrections, feel free to write in. Join us in our Slack channel, and until next time, I wanna thank you Linda for being here. - Thank you. - And remind everyone to keep thinking skeptically of and with data. - So three quick announcements before I let you go. Berkeley, California, Thursday, March 10th, 2016. I'm going to be speaking in Berkeley at the La Pena Community Center. The title of my talk is A Skeptics Perspective on Artificial Intelligence. I'm gonna be hosted by the Bay Area Skeptics, so please come out and see me there if you're around. Starts at 730 and the details are at datasciptic.com or at the Bay Area Skeptics site. The very next day, Mountain View, California. Friday, March 11th at 630. Silicon Valley Data Science has been kind enough to host me. I'm going to be giving a talk called Clustering, Beyond K Means. It's a very data-centric talk. The Thursday talk is a very skeptically oriented talk. Come see me at both one or neither. For the second event, the Friday event at Silicon Valley Data Science, there is a ticketing system on Eventbrite, so please go, I know that's filling up fast. I think there should still be tickets left as this airs, but head on over to the link is at datasciptic.com. Also, go sign up and grab yourself a ticket. The Thursday event is unticketed. I think it's sort of first come, first serve on seating. So I hope to see anybody from the Bay Area come out there. Please say hello if you do. Lastly, as I did last week, I wanted to give a shout out to one of my favorite podcasts, Real Prime. In fact, I gotta reach out to Sam. You'll see if he'll come on again next week for a final announcement as his Kickstarter's going. Data Skeptic does not accept any sponsorships or advertising at this time. This is not really an advertisement. I just love the show and I want to help support it. I gave my contribution and we're not there yet, so I hope some of my listeners who I think might really enjoy a good show about the mathematical domain, go check out Real Prime if you haven't already, and please support that Kickstarter because I wanted to succeed and that's why I'm talking about it here. And since I don't at this time, accept any support for Data Skeptic, maybe if you would be inclined to do so, turn your attention towards Real Prime, or maybe one of the other great data related podcasts that does accept donations and whatnot. Anyway, until next time, I hope to see you guys out in the Bay Area next week. Other than that, we'll see you next Friday. - For more on this episode, visit data Skeptic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music) (upbeat music)