Data Skeptic

[MINI] k-Nearest Neighbors

Duration:: 8m
Broadcast on:: 24 Jul 2015
Audio Format:: other

This episode explores the k-nearest neighbors algorithm which is an unsupervised, non-parametric method that can be used for both classification and regression. The basica concept is that it leverages some distance function on your dataset to find the $k$ closests other observations of the dataset and averaging them to impute an unknown value or unlabelled datapoint.

[music] Data Skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes, just like this one. Today's topic is the K-nearest Neighbors Algorithm. So this is an unsupervised learning algorithm. Do you remember when we talked about those? No. Unsupervised is the case where you either can't or don't want to provide training data. So let's say you had some, like, group of, I don't know, bacteria, and you wanted to say which ones will survive or not survive. Survive not survive is a label, so you'd have to go gather a bunch of data for the machine learning algorithms to use. In unsupervised learning, you don't have to do that. They're going to just kind of figure out and give you a solution based on some parameter. In the case of K-nearest neighbors, it's the K. So the general idea is that if there's some piece of information you'd like to know about a data point, that averaging, either by mean or median, over the K closest other cases is a really good way of imputing those missing values. So I think maybe a practical example would help here. And maybe we could talk about the number of steps people walk. What do you think? Sure. Isn't that a useful number? Do you track the number of steps you walk? My phone does, yes. How many are you usually walking? I don't walk that much. How come it's always ringing bells and giving you awards and stuff? That's only on the weekends. But yeah, on a day to day... Well, you get an award at like 11 a.m. It's crazy. That's because we hike before 11. All right. But yeah, walking, I think people have been trying to increase their fitness level by using pedometers. Just by monitoring how active they are and how much movement. Or those wristband fitness trackers too. Just to be clear, we're talking about things like FitBits, right? Not those hologram bracelets, which are a total bunk. I don't know those, but yeah, FitBits. Yeah, FitBits, or like I have the Basis watch, stuff like that. If you have those tools, like I have my watch, I can actually look back for a couple years now and see how many steps it estimates that I took. Can you do the same? I don't know, but years, I don't have that much data, but I can look back like a week or two ago. All right, because your phone's only been tracking. You've only been doing that for a while. Yeah. Now, what about someone who hasn't been tracking on their phone or on a measurement device? What can we do for them? Well, I think you're telling me we could use this algorithm. Well done, Linda. Yes. If there's a new data point and we want to estimate the number of steps taken and there's no measurement, they're not using their phone or like some other measurement device. How do we get at it? Well, the basic idea of K nearest neighbors is find the K and you'll have to pick a K on your own. Find a smart one, but usually it's kind of a low, perhaps single digit number for a lot of applications. What are the data points you do know that are very similar to that unknown data point? And then if you average all the similar ones, you'll probably get a close answer. So when we say nearest neighbors and we're talking about walking, it obviously is alluding to thinking of like a community, which actually kind of works pretty well in this analogy. If there's a person you want to know how much they walk, looking at how much their neighbors walk isn't like 100% accurate way to do it by any stretch of the imagination, but it's a pretty good estimate, I would say. What do you think? Yeah, sure. They have these things called neighborhood walkability. Yeah, that definitely plays in. I'm sure that encourages or discourages you to walk around. Or like if there's stores and cafes and restaurants in walking distance, probably people are walking more there. Or within walking distance and parking is very low. Uh-huh, yeah. Or if there's a bus stop and people think, "Well, there's all these reasons neighborhood-wise." So when we think neighbor, and this is where the analogy is not perfect, we think of latitude and longitude distance, right? Like our neighbors are the people kind of on our block, we would say. But someone can also be a neighbor in dimensions that aren't geographic. Like income is another dimension. And it seems reasonable that someone's income could affect how much they walk, wouldn't you say? I don't know. Yeah, I mean, I feel like if you're like a construction worker, you probably walk more. Or if you're a postal person. Oh yeah, they walk all the time. So I don't know where their salaries fall versus someone who sits at their desk all day. Actually, that would be really interesting to see dollars of income versus depth per day. Because it's actually a lot higher variance than you'd think because there are high-paying jobs where you do walk a lot. Like a doctor, right? Or a lawyer who's just going to pace around the courtroom to make a lot of money and also be walking a lot. But then we kind of have this impression of most people making pretty good salaries or sitting at a desk, like you said. So it's not a perfect dimension, but it's an interesting one. Another one would be like age. Do you think age correlates with steps taken? Well, I think the older you get, you're probably going to get lazy. You're going to alienate all my elderly listeners. Who are you calling elderly? All these ones that you just said don't walk anymore. I don't know. It's a good question. You know, people love listening to podcasts while having afternoon walks. Or is the opposite the older you get, the more steps you take. Because maybe other exercise like running, you'd probably be like, "Oh, I can't run anymore." You choose to walk more. I mean, I guess are we going to distinguish between walking or running or counting them the same? Let's just go with walking for this example. There's any number of dimensions we can look at. The sort of literal latitude longitude, like, "Yeah, your neighborhood probably affects that." And people of similar lifestyles also kind of live in similar areas. But we can also look at dimensions of age and income and even profession, these sorts of things. So if there's a new person, they don't track their step taking, but you'd like to estimate it. So the other thing that you can do is find some similarity metric that compares all the data points that are unknown to the ones that you have known. Find the K most similar for your point of interest and average in some way the metric over all those other points to get your estimate, your imputed value for your missing data point. So is that you average? Typically, I mean, you can, depending on your application, do some fancy stuff here or different things like this, probably the second most common things take the median. But there's other stuff you could do. But I think, yeah, average is, the mean average is probably the most common thing you do with K nearest neighbors. So all you're doing is averaging. Yeah, but you're strategically averaging from some points that you've been very clever about selecting. Okay, wow. I think if I ever use this at work, I'm going to throw out this term, K nearest neighbor. Let's use the K nearest neighbor. I'll be like, what do you mean? I'll be like average. Yeah, pretty much. So I like this algorithm a lot because I love unsupervised learning. And we don't always talk about too many unsupervised approaches. It's actually really easy to implement, although there's already great libraries out there. And it works pretty well and can be calculated kind of on the fly. There's not a lot of pre calculation that has to be done to get this implemented. This requires some notion of proximity or neighborness, if you will, but that's usually pretty easy to define and doesn't require a lot of other criteria. So also in some implementations, the neighbors could be weighted. So maybe people that are really close could influence the average more than people that are only kind of close, depending on how sparse your data set is. But it's a really useful algorithm, especially in high dimensional space, where it might be hard to otherwise divide up all those data points. I want an example of high dimensional space. Good question. So let's think about the walking. We have, you know, like I keep saying latitude and longitude, age, income. Oh, we could count what kind of car they drive. Because, you know, certain people, certain types of cars, I guess people would just drive more while they would be like, oh, I'm going to walk and also in this car. How many minutes a week they spend bicycling. There's all these dimensions of people that are somewhat predictive of how much they're going to walk. So you can use this for what? So every potential dimension you look at makes it higher in higher dimensional space, because it could be interesting little clusters somewhere in there that are very hard to visualize. Nice. Well, thanks again for joining me, Linda. Until next time, I want to remind everyone to keep being skeptical of and with data. Good night. [MUSIC PLAYING] (upbeat music)