Archive FM

Data Skeptic

[MINI] Ordinary Least Squares Regression

Duration:
18m
Broadcast on:
06 Mar 2015
Audio Format:
other

This episode explores Ordinary Least Squares or OLS - a method for finding a good fit which describes a given dataset.

[music] The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking and data science. Welcome to another episode of the Data Skeptic Podcast mini episode series. I'm joined, as always, with my wife and co-host, Linda. I'm Linda, hello. Thanks again for joining me, Linda. So, before we get started on our topic today, I have a fantastic pun that was shared with me by a recent co-worker of mine who's just joined the company, and has also been a listener. Do you recall last time when we talked about the K-means clustering? Triggly. And I commented on how you could track Yoshi's, shall we say, output on the piece of paper that's on the floor of her cage. Yeah. Our bird Yoshi. Did you ever take a picture? No, I didn't take the picture. I totally failed there. Oh, I still take the picture. But the point is, we can't necessarily track her in altitude, like how high she is, or in time. It's not covering all the dimensions, just like, you know, length and width. So that's kind of what we refer to in mathematics as a projection. Might you have thought to call that perhaps a poo-jection? No, I wouldn't. And I wish I had, because that's quite clever, actually. So, shout out to my buddy Tristan for thinking of that. I mean, with puns like that, people might think this was the monster talk podcast. So maybe it was good I didn't think of it. But in an event, we are going to talk about a process called linear regression. Do you have any concept of what that means? Well, linear is a line. Yep. And regression. I mean, I never came up with a specific definition, but I always think of it as backtracking. Oh, that's an interesting, why do you say backtracking? I kind of like that, but tell me more. I don't know, that's what regression is to me, backtracking. Yeah, and a matter of speaking, it's finding a way to fit a model to a series of points you have from the past. And then, presumably, you hope that model is descriptive both of the past and of the future. There's a common problem many people have that you have some data, right? And you'd like to know what describes the underlying data. Now, data can be really complicated, so sometimes this isn't easy. But once in a while, you get to a situation where the data is a little bit more clear. For example, do you remember the formula from your early mathematics days, y equals mx plus b? Nope. Really? You don't remember y equals mx plus b? No, what does it describe? Do you remember that m was the slope and b is the y intercept? I don't know if we defined it the same way. Yeah, that was meant about b. Is m slope, I don't know. I don't remember. Well, that's the way it was taught to me, and I think other people know that. But let's go back to the basics. When you're older, though. By three years. There were massive advancements in teaching technology. Well, we had different teachers. Yeah, that's true. But anyway, so let's put some context around this. You've been getting back on your kick of making ice cream lately, right? Well, I just made one batch of ice cream. Yeah, tell me about it. It is butter pecan ice cream, which to those of you who don't know, there is really butter in butter pecan ice cream. Yeah, so actually you have an overdue request by three. In fact, three listeners have all reached out about wanting the healthy cornbread recipe. I have to. I never wrote it down. I just do it. Well, then maybe we need to make like a YouTube video or something. Well, I mean, I could just make it and I'd have to record what I'm doing and then write it down and then. You want to do that next weekend? I don't know. We have time. We're already going to make coleslaw. What about this ice cream recipe? Tell me more about that. Ice cream. I just got it from this website called TastingTable.com, which is they talk about food. So anyone who's interested in food can read about it. And a lot of chefs chime in, so I love that. Give us some hints about the recipe. You've mentioned there was a whole cup of butter in it. There's an entire cup of butter and you have to boil the butter until it becomes nutty. Nutty, not in a psychological way, but when you sniff it, it smells like nut. That's actually what I just assumed you meant. I wouldn't have thought the other nutty. I don't know. It was a joke. For the lady that made the healthy cornbread recipe, it's surprising to hear you didn't cut back on the butter that went in. Well, I was worried. What if I cut back? It wouldn't turn out like butter pecan. What's the point of making butter pecan? You cut back a little bit on the sugar, didn't you? Oh, I mean, not really. I put in as many scoops as I required, but I didn't pack it down. So with brown sugar, you're supposed to pack the sugar now and pack it down. So that's the only difference there. Well, I have a confession to me. Earlier when I told you that this batch was savory, that was my way of saying that it lacks the sweetness I associate with ice cream. Oh, no, you want more sugar? Are you mad at me? Why didn't you tell me when I asked you? I don't know. I feel like if I say it on air, you can't get mad at me maybe. I don't know. Well, I think my ice cream is still good. Yeah, no, it's still good. It's just unusual to my taste buds. But it does remind me of two things. You could change out how much butter you put in and you could change out how much sugar you put in. If you want to, you could experiment as you put in more sugar, for example, what would happen? It would taste sweeter. Uh-huh. The overall taste we get sweeter and sweeter, the more sugar you put in until I presume at some point it's so sweet, it just tastes purely like sugar. I mean, that assumes you keep adding sugar, but sure. Yeah, I mean, if let's say we want to make an experiment out of this, we did like 20 or 30 different batches and you controlled how much sugar went into each one, you might be able to kind of find out like the relationship between sugar and reported sweetness, right? Sure. Let's think about the sugar thing for a second. Do you think you could accurately control how much sugar goes in? Yeah, you just measure it and dump it in. Yeah, so if you have a good measuring cup, you should have no problem. What if you wanted to ask people how sweet they found the output to be? You know, ask different people off the street or your friends or relatives, whatever. You know, do you think this ice cream is more or less sweet than an average or maybe like on a scale of one to 10, how sweet is this? How precise do you think all those answers would be? Probably not precise. All right, it's kind of noisy data, wouldn't you say? Yeah. You can control the independent variable. That is, how much sugar goes in, you can, you know, with very high accuracy control how much sugar you might put in your recipe. But the output variable, the dependent variable, the degree to which people say it's sweet, that's kind of more random, right? It's a little fuzzier, it's hard to say. You know, you ask some people their taste buds or their individual preferences will kind of put some natural variation in the results you get. If you surveyed like a hundred people of each varying level of sweetness, what do you think their answers would look like? I mean, I guess we could ask them about scale. Scale zero to 10, how sweet is this? And let's say you did like 10 recipes between like the minimal sweetness and the maximal sweetness and discretize it evenly in between. Do you think there would be a relationship between how much sugar you put in and the average reported number of sweetness that your tasteers say? Yeah, probably. Do you think that would be a linear relationship? That is to say, if you double the sweetness, you would get double the response or something like, you know, proportionate to that? I mean, I feel like you would think that, but probably not. Okay, so that's a really good point. We don't actually know if that's true, like maybe at a certain point you add a little bit more sweetness and suddenly like the overall presence of sweetness explodes and people expect that it's like way more, you know, that could happen. So we don't know for sure this is a linear relationship. And that's an important question when you do a regression problem because you have to make sure your underlying model is correct. But for the sake of argument here, let's say that we know it is linear, that if you double the sweetness, you would have some proportional, you know, increase in the reported sweetness that your tasteers say. Or in other words, maybe a way to put it is from if you do the minimal sweetness and the maximal sweetness and the halfway point that the average that people report should basically be halfway between min and max. No, why would it be halfway? Well, if it was linear, it would be halfway. Oh, it would be linear. So let's say like for the minimal sweetness, people on average say it's like a five out of 10 and the maximal sweetness, people say that 10 out of 10, then if you went halfway between the minimal sweetness and the maximal sweetness, people would probably say it's 7.5 on average, you know, plus or minus a little bit something like that. That would make us think it's linear. So I want to talk specifically about linear trends like that because number one, they're the easiest ones to do, but also in most problems when you're trying to to regress something and find out the underlying model, even if it's not linear. What you first do is you transform your data. So like, let's say, you know, you thought it was quadratic, so like everything was squared, you could take the square root, and then that's linear, and you can do a regression on the transformed version of the data. So it always kind of comes back to a linear regression. So let's say you want to understand the relationship between amount of sugar put in and the reported sweetness people say they taste when they taste it. So the sugar you put in, you can control, you can say that there's very little error there, and that's an important part of the technique we're going to discuss, which is ordinarily squares. But the output, you know, it's up to people and people have a lot of variance. So maybe if you were trying to figure out the appropriate amount of sweetness for your caramel pecan ice cream, you could run an experiment like this. This is butter pecan. Sorry. Butter pecan, my mistake. You could get all these measurements, right? If you shared your ice cream with enough people. Measurements of sweetness. Of people's perception of how much they like the ice cream compared to the amount of sugar you put into it. Yeah, sure. Alright, so picture this. Let's say you made 10 different batches, each with a different amount of sugar, and then you took each batch and you served it to 20 people. So you have 200 data points, and everybody told you reported back on a scale of like 1 to 10, how sweet they found that ice cream to be. So for example, the batch you made today, I would probably report as like a 2 out of 10 in terms of sweetness. It's sweet and it's delicious, but it's not like sweet the way caramel ice cream was that you made the other day. I would then be like on the low end and I would be a 2 because I reported a 2, but maybe someone else would eat that and say it's like, oh, it's like a 4. Probably no one would say this batch was a 10 in terms of sweetness. So you'd have these 200 answers that are all sort of dots on a plot. And if you graph them, you would probably find that they slope upward into the right, meaning that the more sugar you put in, the more people tend to say it's sweet, but not everyone will agree on every trial. Yeah, you don't want to trust any one person's word for it. And you also don't want to play connect the dots where you sort of touch draw a line through that goes through every single point because you'll end up with this incredibly complicated line that, you know, really over fits the data. You want to discover like the underlying linear relationship between sugar in the recipe and perceive sweetness. And the technique most people use for that is something called ordinary least squares. Can you picture all those points kind of starting down to the left and they go up into the right? Sure. What if I drew a line that started at the top left corner and went to the bottom right corner? You think that would be a good fit? No. Right. It would kind of like be almost diagonal to the data you actually have. Now, I could also do my best to draw a line through the middle and that would probably look like a good fit. But the question is, what's the actual optimal best possible fit that matches that data? In order to define something like that, you have to define something called a goodness of fit function. That is to say, how well the line you draw actually matches the data. And there are many reasons and different ways to do this, but a common one is called ordinary least squares. And the idea here is you go through every person who gave you feedback on ice cream. And you say, what does my line predict this person would say based on just the sugar that was put in? And you subtract their answer from the predicted answer. How do you subtract their answer? So they said two, you subtract two. So let's just assume for a second that you can put in between one and ten units of sugar. And of course, people can report that it's between one and ten sweetness. And just to make this easy, let's say it's one to one. That you think that for one unit of sugar, people should report one point of sweetness. Maybe that's not true, but let's just go with that to make this simple, right? If that's your model, then if you put in five units of sugar, you expect people on average to say it's sweetness level five. But of course, some people will say six, some people will say four, and so on and so forth. But you would expect the average to match. So if you go through every single rating you got, and let's say you predicted a five, but that batch that was a five in five units of sugar for a particular person, they said, oh, it's a six. Well, then six minus five is one, squared is one. So they're one unit of difference from what you predicted. So basically, you think of it this way, you're going to make a prediction, draw a line through the data, and then say, how well does that line actually describe my data? So go through every data point and say, well, if I follow the model, I would predict this number, but I actually got a different number. So what's the difference, or at least the absolute difference? And if you sum up all the absolute difference, that's a pretty good measure of how good of a line you drew. And the line which minimizes the average distance from all those points is considered under ordinarily squares to be the best fit for the data. And there's a couple of assumptions here, you want to assume that you can pretty much always observe the independent variable, that the amount of sugar in our case, you can control that. And, you know, within a couple of grains and a couple of nanograms or whatever, you could really do that. But you can't control the natural variation in the answers people give, can you? Nope. Oh, there's another problem I failed to mention. What would you do if you were very serious about doing this experiment, and you wanted to ask, what did I say, 200 people? You wanted to ask 20 for each of your 10 different increments of sugar. How would you collect the data? How does one collect data? I mean, for example, would you maybe throw a party and invite everybody over and ask them to taste the ice cream? I guess, just for convenience. No. No. No. No. No. No. No. No. No. No. No. No. If you invite everybody over for a party, and you ask them to try all the ice cream, it'll be gone. Oh, no. That's a third problem. Okay. Let's say my two problems, then where you discuss your problem. So my problem, first of all, is that it won't be independent. Yeah, Yoshi thinks it's funny. It won't be independent. If everyone's in a room enjoying the ice cream, certain people will say their opinions, and that might affect other people's perceptions. Well, they'd have to write it down. Without discussing it. Yeah, you need to double-blind this thing. Just having a random party would not be independent enough. So you have to control the experiment that way. Otherwise, ordinarily, these squares will probably give you a bad answer. The other thing that's important is if you just let everybody taste every ice cream, you haven't controlled for, like, sweetness centering. Which is to say, if the first bowl I eat from is very sweet, my taste buds might not be receptive to tasting everything I eat after that. So you almost really need 200 independent trials of people who are tasting just one ice cream in a vacuum. So that their prior experience doesn't influence their answer. Well, anyway, let's wrap it up on ordinary these squares. This is a methodology, a simple one that's easy to execute, and a good introduction to linear regression, that tries to describe the relationship between an independent variable that you can observe very well. And a dependent variable, which you don't have perfect observability of, that there's some noise in your measurement of it. And you would like to have the best possible model to describe the relationship between those two variables. In our case, how much sugar you put in and the perceived sweetness of the ice cream. Well, I still would like to note that my ice cream is still very good. I agree very much. Well, thanks again for joining me, Linda. It's always a pleasure to have you here. Thank you. And are we going to see any ice cream recipes, let alone the cornbread recipe? Yeah, it gets any of the link. I just use the tasting tables recipe. Okay, I'll put that on the website in the show notes. And please note that Kyle thinks you need to add more sugar. And I want to give a shout out before we go, as I've been doing the last couple of weeks on the mini episodes, to the other Fine Data Science podcasts that are all out there as well. This week, shout out to Dr. Richard Golden and his show Learning Machines 101. You can find it on iTunes and Stitcher and all the typical places. So that's my show recommendation for this week. And we will catch you next time. Thanks for joining me, Linda. Thank you. Bye. (upbeat music) (upbeat music)