Archive FM

Data Skeptic

[MINI] The Curse of Dimensionality

Duration:
10m
Broadcast on:
26 Jun 2015
Audio Format:
other

More features are not always better! With an increasing number of features to consider, machine learning algorithms suffer from the curse of dimensionality, as they have a wider set and often sparser coverage of examples to consider. This episode explores a real life example of this as Kyle and Linhda discuss their thoughts on purchasing a home.

The curse of dimensionality was defined by Richard Bellman, and applies in several slightly nuanced cases. This mini-episode discusses how it applies on machine learning.

This episode does not, however, discuss a slightly different version of the curse of dimensionality which appears in decision theoretic situations. Consider the game of chess. One must think ahead several moves in order to execute a successful strategy. However, thinking ahead another move requires a consideration of every possible move of every piece controlled, and every possible response one's opponent may take. The space of possible future states of the board grows exponentially with the horizon one wants to look ahead to. This is present in the notably usefulĀ Bellman equation.

[MUSIC] >> The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. >> Well, welcome to another mini episode of the Data Skeptic Podcast. I'm joined this week by my wife and co-host Linda. >> Howdy. >> To talk about the curse of dimensionality. So, Linda, this comes up in a couple of places, and actually, I learned about it from, it's actually original author, a guy by the name of Richard Bellman, who developed the Bellman equation, which is a really good topic for another day. And that's where the curse of dimensionality actually kind of comes from is that equation. That's about sequential planning and exploring different actions and the exponential growth in gameplay. But we're going to talk about it more in the machine learning context. So I thought we'd start out with a really simple example to help illustrate the concept. And we'll start by talking about buying gas for your car. Now, what's important when you're buying gas? >> For me, location and price. >> Okay, two variables, or I guess location is technically two variables, latitude and longitude, plus price. So, are there different areas you prefer? Like would you drive 100 miles to save a nickel on gas? >> No. >> No, so all these things kind of meshed together for you to decide on what's an optimal gas station. Is that kind of how you look at it? >> I like to get the ones on the right side of the street. So I don't have to turn left because waiting to turn left. And also depends which direction I'm headed in after I exit the gas station. >> Yeah, so all right, hypothetical. Let's say you're driving down the street and you're in desperate need of gas. There are two gas stations, one block apart, both on the right. One is charging $4 a gallon, the other is charging $4.05, which do you go to? >> The cheaper one. >> What if $4.05 is on the right and $4 is on the left? You'll have to make a left hand turn to save a nickel. What do you do? >> If it's rush hour in LA. >> Let's say it is. >> Then no, I'm not going to turn left. Especially if I'm going to continue heading in the direction that I was already headed in. >> Okay, and you are. What if the gas station that's convenient is $4.05, as we said? The one on the left side is not $4, it's $3, it's $1.05 cheaper. >> It depends how low I am running the gas. >> Yeah. >> Oh, then no. >> Okay. $2.50 again. >> No, if I'm on E, I'm going to turn right. I refuse to turn left. >> No cost would you turn left to save? >> Well, it had to be like 20 or something. >> 20 cents a gallon? >> No, $20 difference. >> Oh, for the whole fill up? >> No, for one gallon. >> Oh, that's not possible. So then, yeah, you would never do it. >> No, I wouldn't do it. >> Okay, interesting. So, this is actually an experiment in pricey elasticity, which is not the topic I wanted to cover. That's fine. The point I wanted to make is that your gas choice is relatively simple. There aren't that many factors you take into account. And each of those factors, sometimes called features, sometimes called dimensions, are what you want to explore to find answers to machine learning problems. Sometimes those are optimization problems, sometimes they're classifications, sometimes they're regression. So, let's talk about a much more complicated one. How about buying a home? What's important to you in buying a home or in considering buying a home? >> Well, Kyle and I are looking for homes recently, but for us, in LA, there's price, there's location, condition it's in, how big it is, but also just like the square footage, how it's laid out, whether they do make good use of it, whether the sunlight streaming in looks good at varying times of day, do they have outdoor space, the patio, how big is it, blah, blah, blah, how many bedrooms and bathrooms, obviously. >> Yeah, so those are all very important. We could probably think of a few more. Let's just for the moment, simplify the problem. Let's say we look at six houses, all right, I'm going to throw some numbers at you. The first three have one bedroom. Their prices are $100,000, 105, and 110. Then we look at three two bedroom houses. The prices are 150, 155, and 160. In those numbers, do you think is the value, the cost of a second bedroom? >> What were the prices again? >> For one bedrooms, they were $100,000, 105, and 110. For two bedrooms, they were 150, 155, and 160. >> It sounds like it's about 40 to 50 more for a second bedroom. >> Yeah, I gave you a trivial example to kind of force the hands. So in this situation, if all that mattered was number of bedrooms and price, you can regress out the cost of that second bedroom, it's obvious. But as you said, there are dozens, if not potentially hundreds of dimensions we could be looking at. Like we want a place that's going to be good for our bird Yoshi, right? >> Yeah. >> What are some of the features she would appreciate? >> She likes looking out the window. >> She doesn't like a lot of crows, though. I know that much. >> Sure, but I don't think we're going to consider how many crows are in the area. >> She's an important member of this family, her opinion counts. >> She lives inside, away from crows. >> That's true, but they still make her uncomfortable. >> Yeah, not a factor in mine. >> Anyway, so as one might guess, I'm doing a data project in our search. I've crawled the entirety of the county's tax records. I know what's sold in the last couple years and what it's sold for. And I have some feature sets on these houses, but I'm running into the curse of dimensionality because I'm trying to evaluate what is the cost of all these additional features. So for example, covered parking is important, right? I mean, we don't need it, like one of us parks on the street right now, the other doesn't have covered parking, but maybe it'd be nice if we had a two car garage, our cars would be more secure and they'd be covered at night and all that. What should we be willing to pay, do you think, for covered parking? How much extra? >> It depends how much street parking is available. >> Oh, geez, another variable I didn't consider yet. Wow, and there are so many dimensions we can look at with this. In order to evaluate the fair market value of a house, we have to consider all these. In order to do that with machine learning, you have to have enough examples that the algorithm you're using can figure out how much each of those features contributes to the price. Algorithms don't necessarily scale very well with number of dimensions. Some do and some don't, but even when they scale well, there are certain features that are rare and you might not have enough training examples. How many houses have we looked at that had a spiral staircase, for example? >> I think one. >> One, exactly one. It's not a common feature, is it? >> Not in the ones we've looked at. So right now, if I tried to run an algorithm, and I use the feature of has spiral staircase, it would pretty much either ignore that feature, which means that it's not necessarily doing its job if that's an important contributor to price, or it would over fit that feature and say like spiral staircase means it equals exactly this price, which is the price of the one property that has the spiral staircase. When you have too many dimensions and not enough examples showing the richness of your full feature space, you can have the cursive dimensionality creep up and not allow your algorithm to be able to properly figure out what each of those things contributes to the overall objective function, in this case being the price. In a case like this, you actually want more data, not more features, and this is a common mistake I see people make when they're doing machine learning. They think, oh, let's give it more and more features in our training example because the more the merrier, it'll maybe use them in different ways, but that actually makes the problem more sparse because now you have all these other variables and not necessarily enough test cases and you can over fit the problem. I know I'll get some more emails by saying this sort of similar thing I've said before that isn't always perfectly true, but usually more data is better than more features because it allows a richer training set to derive information from. So what about the exterior of a house, right? It could be stuccoed, it could be siding, it could be brick, do you have any preferences over those sorts of things? Depends on execution. Yeah, so it's really how good it looks. Yeah, I mean the condition also depends on the style of the house, like if it's a Spanish house then like all those things look good, but then if it's Spanish and you mix some style in a weird way, like I don't know what to say. Yeah, so there's a bunch of dimensions to describe the exterior of a home, like the material, the style as you say, all these things, and we probably won't see that many examples to learn the value, like have we looked at any Spanish houses? Yeah, I don't think so, have we? I think we've only looked at one house, so I don't even remember. Even condos, like there was the one condo where the building was sort of Spanish-y. At least one. Yeah, so all we have to go by is one example. That's not enough to tell us how much the Spanish architectural elements could influence the price. There's a couple of ways around the cursive dimensionality. The most common one people talk about is dimensionality reduction. Instead of talking, for example, about the exterior, like the materials and the little cost in the upkeep and how well it fits with the architecture, maybe we could just look at the exterior and classify it as neutral, ugly, or attractive. Do you think you could do that classification based on your own opinion? You mean the entirety of the exterior? Summing it up, yeah. Sure. So then you've just done dimensionality reduction. You've taken a complex multi-dimensional space about materials and colors and architecture and reduced it to one categorical dimension of cardinality three. That just reduced our dimensionality quite a bit. Good job, Linda. You're improving our machine learning problem. There are a couple methods of dimensionality reduction, things like PCA and K-means clustering that we talked about before and even K-nearest neighbors. We'll get into some of those in future mini-episodes. So we've talked about how many dimensions there are in a home buying process. What are the most important, I don't know, top four to you? Location. Last. Like how the space feels. Could we call that architecture or do you mean something different? I guess we could say architecture. You got one more? I don't know. Other than your top three then, you want my top three? Sure. Ability to get Fios and possibly a T1 installed. Not too hot on the interior, I don't want to live in an oven and location. So we agree on location. So not too hot means you want AC or what? Either AC or like good windows so that you can get cross ventilation. Interesting. Anyway, perhaps I'll share more on the pricing model and building as we get deeper into this. Okay, I'll take a look. Well, thanks for joining me, Linda, on both our journey of the podcast and our house hunting. And thank you for humoring me as we explore the exponentially increasing number of dimensions that go into a decision like this. Thank you. (upbeat music) (upbeat music)