Data Skeptic

[MINI] k-means clustering

Duration:: 14m
Broadcast on:: 20 Feb 2015
Audio Format:: other

The k-means clustering algorithm is an algorithm that computes a deterministic label for a given "k" number of clusters from an n-dimensional datset. This mini-episode explores how Yoshi, our lilac crowned amazon's biological processes might be a useful way of measuring where she sits when there are no humans around. Listen to find out how!

[music] The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. So welcome to another mini-episode of the Data Skeptic Podcast. I'm joined as always on the mini-episodes by my wife and co-host Linda. Hello! Thanks for joining me once again, Linda. Thank you. So we're going to talk about something called the K-means clustering algorithm today. It's actually a technique that's the classic example of unsupervised learning. Have you heard of the difference between supervised and unsupervised learning? Mmm, I've heard the words. What does that mean? Supervised learning is where you're trying to train a machine learning algorithm to find a result. Typically, that's either a classification problem, like you want it to label something as like, "Oh, that's a vegetable, that's a fruit," kind of thing, or a regression problem where it wants to look at data and fit a function that you can then kind of make predictions or at least explain the old data with. And the reason it's called supervised is because when you give it training examples, you start out by saying like, "Here's a hundred examples that I've already told you what they are." So if you wanted something to learn, you know, whether it was a fruit or a vegetable, you'd be like, "Here's a hundred examples." And then you would give it some new examples that's never seen in order to test it. But unsupervised learning, you're not going to give it any guidance. You're going to say go out and try and find a way to classify or regress the data without any, you know, intuition as to where it goes. So, actually, I'll go out on the limb here and say that there's more significantly more supervised learning in the world than unsupervised learning, although I'm eager for a listener to give me some reason to doubt that statement. But we call "cameen's clustering unsupervised," because its goal is to take a bunch of data points and put them into k groups where you put the k, and then be able to classify new data points as to which group they fit into. So you haven't told it how to make the groups, you just said, "I want k number of groups." So I thought of a good example that could make this a little bit easier to understand, I think, especially in lieu of a visual explanation. So picture this scenario, Linda. You have to go away on a very important business trip. So I drop you off of the airport, you're going to be going for two weeks. I come home and I immediately clean Yoshi's cage. Yoshi is our bird in case there's any listeners listening. And I also move around all her perches. So she has a cage that's, what is it, like, six feet tall, and then it's like two foot by three foot on the cross section. Yeah. There's a lot of different spots she likes to go on the cage. Some of them are because they're different like food bowls there, and some are just her favorite spots and her favorite perches. So we kind of know where she goes mostly because of, you know, we see her where she sleeps and where she is on the webcam and stuff like that. Let's say that I moved all that around so you don't know where her favorite spots are anymore. And you come back two weeks later from your business trip. Would you be able to tell where she's been sitting? Yeah, based on where she pooped. That's right. So describe for me what the paper looks like, especially after two weeks of use. Well, there's tons of food, droppings, and just general mess. Would you say that there might even be one bit of something, be it bird poop or a seed shell or something that every square inch might have something in it? No, they're all clustered around one area because we have her food bowl in one area, and then she likes to sit in her two to three spots. So how would you decide that there were two or three spots and not like 30 different spots she could sit in? Just based on the clusters of her poop. So you can eyeball it and be like, okay, clearly she's been in these three locations. Yeah. Alright, you just did K-means clustering. Now you did it on a two dimensional space. We don't know where she was vertically, whether she was on a high perch or a low perch, but we have this two dimensional projection of all her movements and positions. And we can track it because she somewhat regularly releases poops. She could go whenever she wants, really. Sometimes she goes three times in 15 minutes and other times she could wait over an hour. But by and large, let's assume it's sort of uniform. It's not like she goes to one spot and holds it in that spot, so we wouldn't be able to observe where she was there. Let's just assume she's not being so devious. In that case, the two dimensional projection of all of her waste gives us a pretty good idea of where she's been. That's kind of cool, right? Yeah. So each spec of whatever on the paper is a data point and you want to find a way of clustering those together. Now you did it kind of by eyeballing it, right? Or that's how you imagine doing it against Yoshi's paper. Yeah. What if you couldn't eyeball it? The data was too complex or it was on a greater than two dimensional space like we talked about. We just have length and width that's two dimensions. What if it also dealt with time or a couple other dimensions? I don't know. You wouldn't be able to visualize it, so you couldn't eyeball it. You can make a 3D chart. You might be able to. But even beyond 3D, it's pretty much impossible for us to be able to kind of visualize something without at least some good visualization tricks. A key means clustering algorithm is there to solve problems like this for you. So it's great in two dimensions where you can kind of validate it. The algorithm also scales up to multiple dimensions. So if you wanted to look at maybe a business use case where you're looking at how long someone's been a customer and what the average purchase price is and all these other variables, you could kind of line all those up and say cluster my customers into k number of different groups. Maybe certain groups would get certain offers or coupons or things like that. So k means is very popular because it's easy to use. How do you think someone would go about writing an algorithm that did that? Let's go back to the Yoshi cage example. If you had to do it purely mathematically, how would you decide which of each of those data points goes into each of the clusters? I don't know. Just wherever it's clustered. Maybe describe for me what a cluster means to you. Where there is more than one poop. So if I go off in the corner and I find two coincidentally by themselves and I circle it and call it a cluster, is that a good cluster? We can for now. Right, so essentially you could circle any number of the spots and just call it a cluster arbitrarily, right? So ten different people could come up with ten different clusterings. But one must be better than the others, right? There must be one kind of best-case cluster. And the best-case cluster means what? The one that most appropriately and effectively describes the data. For K means they introduce this idea called a centroid. And the centroid is the middle point of whatever cluster you have. So if you just arbitrarily put a bunch of points together and say that's one group, then whatever the center of that group is, is it centroid. And then you can take the centroid of all the other groups. So every group has a centroid now. And then a new thing shows up. So let's say you come back the next day and there's a new poop. You can compare the location of the newest one to all the available centroids. And you would say the centroid which is closest to the new data element is the one which the new data element belongs to. Do you compare it to the closest one then what? Then well once you have your clusters established and you're happy with them, to classify a new data point, you just say it belongs to the group that it's most closest to the centroid of. What if it's a new group? Interesting question, yeah. So in k-means you generally don't evolve the number of groups over time. But actually group selection is an important, a very important problem. You have to decide what k you want to use. The machine will figure out the most optimal way to fit the data to it. But it's your pick. So I've often found when I've used this, I'll try a couple of different k's and I'll kind of see which one intuitively makes sense to me. If you set k at like two then you might accidentally over group something. So like let's say Yoshi sits in three different spots and you say my k value is two, I want two clusters. The output of the algorithm will be one cluster that probably is a unique cluster and probably a second that's actually covering two different spots. And it's not very descriptive of those two but it does happen to cover them. Or it might be that if the correct answer really should be three clusters, she has three locations. It might put one, the first and the third together but the one in the middle it might split halfway between the two. So picking how many clusters you want, there's not necessarily a correct way to do it but it's an important consideration. One thing you can look at is how clustered all of your centroids are. So if you have all of them that are very tight, like all the data points are very close to the centroid, that's probably good. But if you have some that are very far outliers and like a really big centroid like maybe you would say oh one of her sitting locations covers 40% of the cage. Well that's obviously not correct right because she's not, her body doesn't take up 40% of the cage. I think the correct term is accuracy clustered around one point right? Yeah accuracy and precision are the key metrics here you want to look at. Trade off to that is I just described this scenario where maybe she has three sitting spots and you only have two clusters. If you said oh there's 99 clusters here then you'll actually have a very high accuracy in the sense that your class of eye all the points very close to the centroids because you'll have different sitting locations all over but you also know it's not quite accurate. You've over fit your data because clearly she doesn't sit in 99 unique locations. So we bring in a little bit of domain knowledge about our bird and her relative size and how many spots she'll actually sit in. How many spots do you think she does sit in? Three? Why do you say three? That's about where her poops are. So you're already way ahead of me on this clustering stuff. So in terms of solving this stuff the exact solution like the mathematically best way to find the centroids is actually a really hard problem. It's something called an NP hard problem but you can find approximate solutions pretty easily. So you get a pretty good answer in a relatively efficient period of time which is why a lot of people like cameen's clustering. I actually I think it's useful but I have two complaints about it. First is that you have to pick your K very well. If you pick it wrong you might either over fit or under fit your data. And secondly I don't like that it is not probabilistic. Every time you run it you get a classifier that will take a new data point and label it as certainly a new spot. But to your point like what if Yoshi picked a new location she wanted to sit. It would be better if we had a probabilistic way of doing this saying like oh it's a 99% chance this is from the spot she likes by the water bowl. Or like oh there's only a 10% chance I classified this correctly because that added piece of information can give you a little bit deeper insight. And there are probabilistic ways of doing this sort of thing but for today I just want to cover the standard cameen's clustering algorithm. There's also cameen's which uses the median average instead of the mean average which is useful in certain situations. It's actually I'm fairly sure computationally more difficult so I see it used less probably for that reason. Cameen's is a little easier. Well the other thing is it's really good for continuous data. Do you know the difference between continuous and discrete data? What's the difference? So continuous is like there's a range. For example like income people always make I mean you do make a precise income right you make a certain dollar value but relatively speaking it's you know wide open spectrum of what people salaries are versus like a categorical data like male or female. It's not that you know females one and male is zero or vice versa. You might set up your problem that way and in fact you can still do cameen's clustering if you do a discretization like that. But it's a little bit square peg round hole because if something's not numeric and continuous discretizing it or forcing it into some numeric value is a little bit not the right technique. Although it can work sometimes. What about Yoshi's when she flings food off of her perch? What about it? Where it falls? Yeah do you think that falls in the clusters? Oh yeah. Oh really? I thought it was just random. So when you look at the floor how many clusters do you see? Oh well I guess I don't really see any clusters but I feel like she has her head in the same position when she shakes it. So I feel like we if we had like speckles like if she had like juice on her beak and flung out eventually there would be clusters. I wonder if we might need a more Gaussian model for something like that. We should look into this for a future episode. Speaking of other episodes now that you've finished listening to the data skeptic podcast as I announced in the last mini. I'm going to be doing a couple of shout outs over the next couple of shows to other great data science podcasts. The one I want to recommend for this week is the data stories podcast and it's the you can go to data stories but not.com. It's like .es so data store is yes and I'll put it in the show notes if that's confusing everyone go check that out. It's a great show. They're kind of like more focused on data vis is two professors that are like big in a data visualization and they have on a wide variety of industry gas and authors and stuff. So it's a great compliment if you like the data science aspects of the data skeptic show go check out data stories as well. Well, thanks again for joining me Linda. Well, thank you Kyle. Do you think you'll use the K means clustering algorithm or suggested in your workplace environment? It's really not my place. I don't really deal with data. So not likely. But you do clean up the cage. Yeah, I don't think I need that for Yoshi's cleaning daily day to day cleaning. It's not important knowing how many different spots she sits in. I guess it is important, but do we really need to use it rather than just look at it. I think we should every day photograph it and open source the project. Well, you could lead that. All right, perhaps we'll see if I can find some free time for that. I don't think people want to look at her poop mats. It's pretty gross. It's not that gross. Mostly it's shelf. Yeah. Anyway, thanks for joining me. Thank you. [ Silence ]