Archive FM

Data Skeptic

[MINI] Multiple Regression

Duration:
18m
Broadcast on:
19 Feb 2016
Audio Format:
other

This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price.

Unlike a typical episode of Data Skeptic, these show notes are not just supporting material, but are actually featured in the episode.

The site Redfin gratiously allows users to download a CSV of results they are viewing. Unfortunately, they limit this extract to 500 listings, but you can still use it to try the same approach on your own using the download link shown in the figure below.

(upbeat music) - Data skeptic mini episodes provide high level descriptions of key concepts related to data science and skepticism, usually in 10 minutes or less. - Our topic for today is multiple regressions. - So when do we're hoping to buy something, huh? - We are trying to buy a house. - Yeah. - So everyone out there, please don't buy the house. - We wanna buy. - Or you could solicit them to sell us a really nice house. - Oh yeah, if you have a great house in Mid City. - Yeah, Mid City, Los Angeles. That's your favorite neighborhood or no? - It's good for our budget. - And like for where we work and where we hang out and stuff like that. - Yeah. - What do we do, just flip a coin, throw a dart at a map by any old house? - Well, I've lived in LA for almost a decade. So that's a long time. So I'm familiar with all the neighborhoods and we work on the West side and then we're familiar with what neighborhoods we hang out in during weekends. And then we're also kind of trying to see which neighborhoods are up and coming. And then based on that, we also had our pre-approved alone. So now we have a budget. - Yep, so we got a budget. How do we know how to make most use of it? - Yeah, so we take an account where we live, where we hang out, what our budget can buy us and where. So we've narrowed it down that, at least for me, I don't want to condo, I want a house. - Eventually, we're going to open houses, we're seeing places, we're going to find something we're going to want to make an offer, right? - I hope so, end of this year. - How much should we offer to pay for the house? - Well, that's what you're working on, Kyle. - But if you didn't have me, how would you decide what to pay for the house or what to offer for it? - Well, I would look at the neighborhood, the crime rates, I would take an account also where we live, the potential of the neighborhood and the future, how big it is, the property, the land. - How many bedrooms, how many bathrooms. - But also we would also take an account, what was the last price, the last buyer paid for it? - Well, why is that important? - Well, because of a month ago, someone paid half, then they flipped it and now say it's worth twice as much. - Well, yeah, what if they put in, what if they doubled the value in their improvements? Or what if they found gold in the basement? House is worth its current value, they shouldn't worry about what somebody else got it for. - You don't think so? - Good on them. - So even if like a week ago they got it for this amount and then they're trying to sell it for twice as much. - Yeah, I mean, I can see what you're saying, you're like, oh, it's not fair, but I actually find this argument really annoying and I often don't get along with realtors because of this, 'cause the house is worth what it's currently worth. I don't care, what if they paid too much for it and they're gonna lose money by selling it to me? - Well, all I wanna know is like, what's the market price? So I mean-- - Exactly, yeah, the market price. - The market price, I'm hoping didn't change so much for a month to market-- - Okay, so that's fair, yeah. You could always say that maybe that person, whoever they bought it from, like just had an emergency and they'd sell it at a discount and that person was in the right place at the right time. That's, they don't need to share the profit with us of being lucky. - Oh, I don't have a problem if they wanna sell it less. (laughs) - Right. - It's the higher price that I want them to justify. - Yeah, and in general real estate appreciates, so we're probably expecting to pay more in the last transaction, so there's also a question of how much more. So there's lots of really interesting data science problems in this whole topic. We could do some time series analysis, we could do some geographic analysis, we could even get into auction theory, talk about one-sided incomplete equilibrium auctions and all types of good stuff like that. So I actually think the next series of mini-episodes are all gonna be around this housing topic, but for today, let's talk about multiple regressions and what that is. - Do you remember the formula from your math days? Y equals Mx plus B. - No. - You never had the TI calculator, you had to put in the Y equals Mx plus B. - I did have a TI calculator. - What color was it? - Dark gray. - 83. - TI 83, yeah well. (laughs) - You would've been 83, you were a little younger than me, I had a TI 82. - Wait, now that I think about, could it have been 86? - Maybe, I don't remember, I was too old for that. The 81 was garbage, I remember the 82. It might've been an 85, actually now that I think I have to look it up. - Anyway, those were graphing calculators, so you could look at different mathematical functions. So Y equals Mx plus B, this is a formula, a simple formula for a linear equation. We have a bunch of data, right? So maybe we have things we know about a house, let's just say it's square footage, right? And we know what it lasts sold for, do you think there's a relationship there? - Between what it lasts sold for in square footage? - Yeah. - I mean, I was thinking about that at some point, more rooms doesn't equal more price. - No, that's true, it's probably not linear. - Or it doesn't come back in reward for the same cost. - Yeah, or like one room could be like useful, and another could be like an awkward J shape, that's just like a weird workaround or something. - Sure. - But, you know, so that's what we would call a little bit of variance in the data, that two houses that are the same size, let's say 1500 square feet, one could just be laid out way better than the other one. You know, one could have like no hallways, while the other one has a bunch of hallways, which are kind of like wasted space. So it's not true that we can break it down like, square footage, you know, it's $5 a square foot in this city and it's $6 and some other city. There's always a variance, right? Because it's how they use the footage, where they are in the street, there's all these other values, so we have to consider it too. - Well, design matters. - Yeah, design matters. If it were as simple as just square footage you paid a certain dollar amount for that, then you might use the formula y equals mx plus b, also if it was linear, where m is the slope, that is how many dollars per square foot you'd pay, and b is the y-intercept, which is the base price of any house. Every house costs at least a certain amount of money. What do you think that certain amount of money is, is a base price, no way. - In all neighborhoods? - I guess so, yeah. - That's really quite a range. - Yeah, it is. So, we know that it's more than just square footage, though, we want to account for. We want to take into account how many bedrooms, how many bathrooms, whether or not there's parking, what else? What are the major important features? Like you said, neighborhood, that one's a little abstract. - Are there any problems? Are there termites? What's the plumbing? - Yeah, last inspection kind of stuff. - Is the house falling apart? - Right, yeah, how much work does it need? - What's the foundation? - Yeah, so all these things go into valuing a house. Now, we can't necessarily get all these parameters, can't we? 'Cause without like maybe going there and doing an inspection, but we're not going to do an inspection on every house in LA. So hopefully, we can get a large data set, and we can run multiple regressions to find out the relationship between the variables we can observe and the price of the house. So I started down that path. So quick announcement, actually, I'm gonna talk more about this at the end, but I'm actually very unhappy with the access I have to transactional real estate data. So I'm starting a little project. It's a community project. Everyone's invited to participate. We're gonna try and liberate some of this data and do interesting work on data science along the way. And I've kicked that off tonight. We're gonna do a visual part. It's an audio podcast. But I'll rely on Linda here to describe a notebook I'm about to show her. Okay, so here's what I did, Linda. I went over to Redfin. And if anyone wants to see what I'm doing, all this is available in the show notes. You can follow along. But also, if you just wanna listen, that should be okay too. One of the first things I looked at, so when you get a new data set, you wanna do data exploration. You just wanna find out what you can learn from it. Anything unexpected. Look at the summary statistics, stuff like that. So here, I've plotted out. This is just more a curiosity. I wanted to know, what is the ratio of the last sale price to the listing price? In other words, are people offering more or less than what the asking price is on the market? Oh, well, if you're gonna offer it a buy, you probably offer it less. Well, actually, that's not true. If it's really competitive, they might offer more to try and win the bid. It depends if other people are bidding. I have a lot of variables. So I was just like, well, what happens in our neighborhood? So I got some data for where we live. This is what I found out. Can you describe, or first of all, do you know how to read this? Nope. It's been by like 1% groups. So I divided the sale price by the asking price. So if it's one, that means people offer it exactly what they asked. Someone said, please buy my house for 300,000. Someone said, sure, here's 300,000. If someone said, my house is on sale for 300,000, and somebody else says, that's nice. I'm only gonna pay you 200,000. That would be .66, right? That's 2/3 the offer. What's more common to offer more or less? I don't know, they seem almost equal. Yeah, it is pretty evenly distributed. But going above and beyond, there must be some really competitive houses, because there's some that they offer significantly more. But in terms of submitting a bid, people just don't submit a bid lower than a certain percentage. Because they're like, oh, they're gonna turn me down. Yeah, but there are some outliers here. So we might have to account for those in our data, or do something, because outliers can skew our results. So that's partially why we're looking at stuff like this. OK, next we move on, and this is the most important thing. Before you do any regression, you explore your features. You understand your data. So in this next graph that you're gonna describe in a minute, I want to know, OK, how do the number of bedrooms affect the sale price? So I did a scatter plot. Looking at this, what can you tell me? Have you learned anything about how much the value of two bedrooms is over one bedroom? Well, it looks like the average price of a one bedroom for what neighborhood? This was mid-city. OK, for mid-city is 500k for one bedroom. And then two bedroom, it looks like around 650. It's the average price. Yeah, so an extra $150. But they have quite a range. Yeah, yeah, quite a range. We're going for two bedroom, 25% of the houses in mid-city in our price range have two bedrooms. Yeah, it's a good way of looking at it. Like, this line is kind of like our off-limits. So could we get a three bedroom place? I mean, we can, but it looks like it's only-- it looks like it's only, I don't know, 1% with three bedroom places. So the minority of cases and the cheapest of transactions. So maybe those are, I don't know, not nice places or something. Also, you should know, we're going to come back and talk about whether or not my data set is good in a little bit. Because that's actually the main theme of the end of this podcast. But let's move on. Bathrooms. On the left includes all of like the half baths, three quarter baths. How are half bathrooms accounted for? I'm confused. So I just took whatever was in the listing. So all of the two baths got grouped together to make the median plots. And all of the two and a half baths in this column got grouped together and so on and so forth. So what have you learned about the price of an additional bathroom based on this plot? I mean, I'm having problems reading this. It's all over the map. Yeah, that's my point. It is all over the map. You can't really read it. So something's funky because intuitively we know having more bathrooms must add value to the home. Maybe not a lot. If you've got nine bathrooms, you add the 10th. That's not necessarily like a smart investment per se. But going from one bathroom to two bathrooms should up the price of the house, wouldn't you say? I mean, it depends how the bathroom is done, but sure. Sure, but on average. And we don't see that here. So something's fishy. So I actually have to delve more into this issue and find out what's going on. But what I noticed at first, my instinct was, all right, let me just take what's called the floor of the bathrooms, round it down. I'm not going to count the half baths. Now they're worth something, but I'm just going to ignore them to try and get a better picture. And that's what you see on the right. What does that mean, zero bathrooms? House goes for almost a million. Yeah, so somehow there's one transaction that apparently had no bathrooms, but sold for exceptionally high. That is what we might call an outlier. That's probably inaccurate. Maybe a data quality issue. Even an event venue has a bathroom. So something's fishy about that. I'm also going to throw out bathrooms for and above, because I don't know who has six bathrooms. We're certainly not buying that house. Event venue space. Maybe, but let's look at the common situations of one and two and maybe three bathrooms. What it's telling you though is the relative value of the second bath. It's second bathrooms appear to be worth about $50,000 in the data set I have. That's average. On average, that's right. See, a lot of this now is about backing into averages. We want to find a model that, on average, is descriptive of the statistics. We're going to use multiple regression to do that. And what multiple regression is doing is saying, OK, let's gather together all these features. Things like the square footage, the number of bathrooms, the number of bedrooms, anything I can have. And let's evaluate how much of those contribute to the price of the house. If you can think of, as we thought of a scatter plot and drawing a line through it, what if you had a multidimensional scatter plot and you wanted to draw a line through it? That's what multiple regression is trying to do. And then each of the coefficients gives you some information about how much each of those things contributes to the overall price of the house. So here's a scatter plot in two dimensions, for example. Can you describe what you see in the blue dots and the red line? OK, it looks like the x-y says square footage. And the y says the last sale price. Right. And what's the general relationship between square footage and last sale price? I mean, you drew a line through it, so it sounds like you want me to say it's linear, but it just seems like there's a lot of data clumped around. There's just a max. That's true. So at some point, it doesn't matter that much. So I'm not sure if that line is really useful. Well, that's the best fit we have under a linear system. So you're right, that it's good to be skeptical of if that's a good model or not. We're just going to throw everything at the wall to start with, and then we're going to tweak and refine this and look at our diagnostics. And that's what we're going to do in future episodes. So there's way more to cover here than we can do in one mini episode. What I want to introduce today is the concept of multiple regressions. We'll talk about multi-colinearity. We'll talk about diagnostics and all these sorts of good things in upcoming weeks, where we hopefully will build this little model together and be able to say, what are the contributing factors that we should evaluate when we're looking for houses. So I thought you were going to talk about whether or not the state is accurate or these are just summary. Even we should be looking at these graphs. Yes, that's a good point. So I pulled a random sample-- well, not even a random sample. It's probably a bias sample. I just got a data extract from a site called Redfin to do this. Yes. Now, it's only ones in their system, and it's only recent. What I actually want to do is look at all transactions over the last maybe five or six years. Now, I'm having trouble getting to those. You know why? Nope. Because despite these things being public record, there's no easy access to a database of these sorts of transactions. This is free public information, at least in the United States. But it's inaccessible, in my opinion, even though it's public and out there. So I'm putting a little project together because I've been struggling for a little while trying to assemble this sort of data and not being successful. Kind of a call to action here to all the data skeptic listeners. Anyone serious about the data science side of things who wants to participate in this? We're going to make a community effort here. We're going to build some software. We're going to assemble data sets, maybe open source the database, who knows what's going to happen. But if you're interested in getting involved, whether you're a student looking for a project or to maybe just someone who wants to lurk on the site or even an advanced data scientist with an interest in real estate. If you go to data skeptic.com and you click on the link at the top that says home sales, there's instructions there for how you can join our Slack channel. So anyone who gets on Slack, you like Slack, right? Nah, it's okay. What do you prefer, HipChat? No, only because I don't like the name. (laughs) Well, yeah, anyone get on our Slack channel. I'm putting stuff up there almost every day. I'm going to share a lot of analyses there in advance of the show. And I'm going to go into more detail there than we do on the show, because of course we can show visualizations and talk about things in more detail. So anyone interested in participating and a little fun project, an open community type project, please come join us. Or if you just want to lurk, that's okay too. But I hope we can, as a community, get together a nice, concise way that data scientists have access to home sales data. So we can teach people lessons about time series analysis and geographic analysis and classification and all types of stuff on a real world data set. And this is a new project I'm launching as of this episode. And I have something to ask you, Linda. Okay. Will you be the project manager of the project, of the data skeptic home sales project? I don't know what the project managed. You're going to have to onboard me. Well, we don't know either. Okay, I'll onboard you. That sounds awesome. Yeah, let's do that. You're just asking me to do something that I have no background on. Well, it'll be an adventure. And we're going to report on it in future mini episodes as we thematically explore the topic of buying a house over the next few weeks, months, who knows? Well, it's an adventure. So anyone who could help us, and hopefully it helps you buy a house. Yep, hopefully. Join the cause. Yeah. Last announcement, Berkeley, California. I'm going to be giving a talk Thursday, March 10th, that's 2016 in Berkeley, California at the La Pena Cultural Center at 3105 Chateauk Avenue, Berkeley. Starts at 715 on March 10th. The title of my talk is a skeptics perspective on artificial intelligence. So I hope anyone in the area, the whole barrier in general wants to come out. Please come see me there. More details at data skeptic.com. And I'm also, since we're going to be up, we're going to stay for the whole weekend. We're looking for stuff to do. I'd love to speak somewhere else the next day. That'd be Friday the 11th. So if anyone has a venue, I actually have a, it's not a skeptical talk. It's purely a data science talk on clustering that I'm looking for a venue for. So if anyone's interested in hosting me, I'd be, I would love to come speak for your organization. And yeah, hit me up at kyle@datascheptic.com. And until next time, I want to remind everybody to keep thinking skeptically of and with data. - More on this episode, visit data skeptic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music) (upbeat music)