Data Skeptic

[MINI] Bias Variance Tradeoff

Duration:: 13m
Broadcast on:: 13 Nov 2015
Audio Format:: other

A discussion of the expected number of cars at a stoplight frames today's discussion of the bias variance tradeoff. The central ideal of this concept relates to model complexity. A very simple model will likely generalize well from training to testing data, but will have a very high variance since it's simplicity can prevent it from capturing the relationship between the covariates and the output. As a model grows more and more complex, it may capture more of the underlying data but the risk that it overfits the training data and therefore does not generalize (is biased) increases. The tradeoff between minimizing variance and minimizing bias is an ongoing challenge for data scientists, and an important discussion for skeptics around how much we should trust models.

(upbeat music) - Data skeptic mini episodes provide high level descriptions of key concepts related to data science and skepticism, usually in 10 minutes or less. - Our topic for today is the bias variance trade off. - So Linda, we live in a dense automobile traffic city, right? - We live in LA. - Near a couple of major streets, we're not far from streets like Venice and Santa Monica Boulevard. They have a lot of commuters on them, a lot of general traffic. - Oh yes, all of LA, they say driving through LA is like driving through a small country. Actually, I bet if we were to compare city sizes versus country, I bet LA's on the top. - Top like 100 maybe, yeah, yeah, yeah. - That's pretty dense. - All right, it's almost definitely gonna fact check that, so let's not double down on it, but maybe something like that. It is quite dense. Well, you bike to work, right? But you still see cars stop at stop lights, and they're on rush hour. Let's just pick an intersection, what's a good one, like Sepulveda and Venice. How many car lengths do you think you'd be back on average when you get caught for the light? - What do you consider being stuck at the light? Like when it's bumper to bumper traffic? - You have a red light and as a result of that, you come to a complete stop. - The back could go back really far. - Yeah, it could, it could definitely. - The one's limited then by the length till the next red light. - Right, yeah, that's hope we're not gridlocked, but it could be all the way back in the worst case scenario. So it has a minute and a max. - But it varies. - How much does it vary by time of day? - Oh, a lot, rush hour. - You also have some theories about different days of the weeks being worse than others. - Thursday and Friday are the worst, but I think Thursday is the absolute, like on average, worse. And then when Obama comes to town, we call that Obama lock or something. - Obama is the worst when he comes to town for traffic. - Well, the first time apparently he came to town, he just took a car everywhere, which resulted in streets being shut down. Then resulting times after that, he figured it out like, oh, I'm screwing up traffic. So I guess they picked a helicopter. Thank goodness. - Yeah. - For most of it. - Yeah, so that's an unpredictable, I guess you can predict it. You can find out when he's coming to town and on days like that, it'll be much worse. But let's start to say that, or maybe an intersection closer to our house that you could walk to when you regularly counted the number of cars waiting and you took an accurate measurement. That's how you plotted that, where on the x-axis, it's the time of day and the y-axis, it's the number of cars waiting. - Well, when you say count cars, do you mean like in all directions, like if it's an intersection for lights? - Oh, good point. Let's just say the one. - Which direction? - Whichever one's closest to you, consistent. Let's say maybe always headed west. - During rush hour, like, you know, 9 a.m. and between 5 and 7 p.m. there's gonna be a higher number of cars. - Yes, you'd start to see a trend where there's little humps near rush hour, right? - But then there's, the min is obviously, zero cars are waiting at the light and the max is obviously, whatever amount, there's always gonna be a ceiling. - And what do you think the afternoon looks like compared to the middle of the night? - Afternoon is probably heavier 'cause in the middle of the night, I assume people are sleeping. - Yeah. - Unless there's a block party. - Yeah, I'm usually the first person on the line and it's like two or three in the morning, something like that. So we'd like to have maybe a model that describes the traffic. Would that be useful? - What's a model? - A model is a description of what you think generates the traffic, some function that says it. This time of day you'll get this much traffic. - Generates the traffic? - Mm-hmm. - Doesn't generate the traffic mean like people are motivated to leave their house. What you're saying is the expected amount of cars at any given point in time, that's not generating. - Well, the origin. - No, it's the model generates a value and it's trying to describe nature, right? Whatever is naturally happening in the world. And as George Bach said, all models are wrong but some are useful. So if you picture that kind of roller coaster of when traffic is high and low, and let's say I said you could draw a straight line somewhere on that page. How representative would that straight line be of the traffic you're measuring? - And it can't change direction then. - That's correct 'cause it's linear. - Well then it's not accurate. - Yeah, that's pretty terrible. - Only half of it is accurate, maybe. It's a max optimal. - Yeah, the best you would do is kind of put it in the middle and be like traffic is on average this. And then you'll be very wrong during the day. Like you'll be under predicting during the day and over predicting at night. Okay, so let's say I gave you a little bit more room and I said, okay Linda, you can draw more than a straight line. You can draw a parabola. How would you draw your parabola? - Oh, it looked like a you. - Would it be a you like a smiley face or like a frown? - Oh, a smiley face you. - Where would like the edges of the cheeks or dimples or whatever that's called go? - 9 a.m. And the other ones are around five or six or seven. - If it's a parabola, that would mean, and it's a smiley face, then it starts, as you go lower than 9 a.m. would go up, right? - Smiley face, it dips down. - It starts at nine and then goes down and then goes back up at five o'clock rush hour. - But what happens before nine and before five? - I don't know, smiley faces don't run off the charts like that. You ask me if it was a smiley face. - I thought there was a clever way of not getting into convex in a concave. - Okay. - Well, I don't know, what do you want me to say? - Well, I was looking for a frowny face. - I didn't think about it. - That a wide, a wide parabola would start low in the early morning, go up a little bit and then come back down. So it actually have its peak, like maybe around noon, which wouldn't adequately capture either of the rush hours, but it would get closer than the linear model, right? We'd make less errors. - Yeah. - Do you know what I mean when I say we'd make less errors? - Well, the dots on the smiley face are more aligned. - Yeah, that's right. There's less variance between what the model predicts and what's actually observed, okay? Now that's also for a second, hypothetically say, you and I quit our jobs and we took all this massive millions of dollars we make off data skeptic and we decided to dedicate our money and our lives to modeling traffic in Los Angeles. How accurate do you think our models would be 10 years later? - Well, first of all, not how hypothetical is not there. So I just want to ask everyone to help support us 'cause we are not making millions off of this. - Yes, you can now buy pins and stickers at data skeptic.com. We have a new store. So thank you for reminding me, Linda. - And that's mainly just for you guys 'cause we're not making money off of that. - No, not really. But yeah, I just want people to share the show and show off the swag. - Okay, so go back to your question, what is it? - If 10 years into this grand experiment of dedicating our lives and our wealth to trying to model the LA traffic problem, how accurate of models do you think we would get? - Oh, I think we could predict traffic very well. - Pretty well, probably if that was the kind of investment that we made. - Oh, yeah. - But would we be perfect? Would we always know precisely with no mistakes? The number of cars stopped at every intersection in a given period of time? - Oh, no. - We'd never get it perfect. - You don't know when Obama comes to town? - Well, that I figure we could figure out in 10 years. - You don't know when there's a flood and it shuts down one street. - Yep, that's unpredictable. - And then there's road construction one day. What other crazy things have happened that have delayed? - There's something here we call the irreducible error, even with like the best model. You can never be perfect, okay? So we usually describe that with an epsilon. We say some of the model is just errors we'll never explain. But we'd like to explain as much of the errors as we can. So we went from the linear model to the parabola model and it got better, right? What if we allowed higher order polynomials, which effectively what those would do would be to have more humps? With enough polynomies, you could kind of make it like a camel's back, right? Two humps? - Why does it have to relate to any of the model shapes? Why does traffic? - Yeah, why can't it just be? Like you will line up the dots and you're like, there you go. - Well, you'd like to have a model that describes the data. So like a function of formula, mathematical formula, rather than being like, man, here it is what it is. But well, you actually raised an interesting point though. What if you just played connect the dots and you drew a line that looped around and ended up hitting every single dot? You have an error of zero. That's what's called overfitting your data. And in a problem like this, you wanna train a model based on some historical data. But then later on, you wanna see how that model performs on real data or testing data or holdout data or whatever the case may be. If you overfit a model, that means it really well describes your training data but doesn't describe anything else but the training data. - So if we're talking about our one traffic intersection, are you saying you're trying to make that and apply it to all traffic? - No, no, let's just talk. We wanna have one model. Just keep it simple for just one intersection. - Oh, one model per intersection. - Yeah, and let's only concern ourselves with the nearest major intersection to our house and say we went out there at random times today, took all these measurements. If we built a model that 100% was consistent with all the measurements, that would be overfitting. 'Cause we know there's supposed to be some irreducible error. So if we go from linear to parabola to higher order polynomial models, we can start to have this more curvy model that looks like a camel's back perhaps. And might that be a better description of traffic like a big hump in the nine a.m. range and a hump in the five p.m. range? - Yeah. - Maybe of different heights? - Yeah. - And dipping down late into the evening. With every degree of polynomial you introduce, you can actually get a tighter fit but the model you came up with fails to really learn the actual relationship. It just learns how to describe the training data but what you'd prefer is a model that captures the actual underlying relationship and can be extended to describe other data points you haven't observed yet. - Well, how can you test that? - Well, two ways. You could either take a bunch of measurements and then set some aside for testing and train on the rest and then see how well it performs on the data the model didn't get to see or you can build your model on all your data and then go get new measurements and see how those stack up. So now the new measurements won't be perfect, right? 'Cause there's always some irreducible error. But we'd like to build a model that adequately balances how much we overfit and underfit. Or in other words, how much bias and how much variance it has. And that's the bias variance trade-off. - So how do you measure this trade-off? - A good thing to do is to plot it out. You can see that as you increase the complexity of the model, like maybe strictly in the polynomial fits we've been talking about how many there are, then as that goes up, the bias will always go down, down, down, down, because you're always able to fit the data better with more complex models. But the variance will go down, down, down a little bit to a point where you have a model that's like the right size for your data and then the variance will start to go back up. And that's the case where you're starting to overfit. Ideally, you might wanna find some balance between bias and variance, but depending on your application, you might prefer one or the other and fine tune it in a certain direction. So what would you do if we had this good model of predicting a number of people waiting in the stoplight? - Isn't that what Google does? - I don't know if they model exactly what we're discussing. This is just kind of my fictitious example, but yes, Google does try and measure the throughput of all segments in their map graph. - So they do it in their own way. So I think I use Google Maps, so-- - And that's how you're now. You're not a Waze person. - Waze is owned by Google. - Yeah, but they're very fundamentally different. Well, not fundamentally different, but they're rather different systems. They often produce different results. - Yeah, well, if anyone from Google's listening, Google Anways has a problem where in LA, they tell you to turn left too many times. - There's no hope in turning left. - Yeah, there's no hope, there's no light, and it's a giant, giant road that is end to end busy and dangerous, so I would like to say Google, you did not think that through. - Well, give them a chance. You know the state of mapping what it was like 15 years ago? - Oh, I used a paper map. - Yeah, and that's long gone. So, much improvements have been made. Presumably, back then, they had high variance, low bias models, because this is a complex problem they weren't solving. But now, with improvements in computational power and data that you can track and all this good stuff, they've probably reduced the variance, and I would guess the bias as well, and we might even be at the point where they have to worry about the bias variance trade-off in traffic measurements, because it would be easy to overfit this data. But you want a model that is not only descriptive of the data you have, but also predictive of the data you don't yet have. Well, anyways, thank you as always for joining me, Linda. - Thank you. - And until next time, I'd just like to remind everyone to keep thinking skeptically of and with data. Goodnight, Linda. - Goodnight, Kat. (upbeat music) - For more on this episode, visit datascopic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music) (upbeat music)