There are many occasions in which one might want to know the distance or similarity between two things, for which the means of calculating that distance is not necessarily clear. The distance between two points in Euclidean space is generally straightforward, but what about the distance between the top of Mount Everest to the bottom of the ocean? What about the distance between two sentences?
This mini-episode summarizes some of the considerations and a few of the means of calculating distance. We touch on Jaccard Similarity, Manhattan Distance, and a few others.
[music] Data Skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes, just like this one. Our topic for today is distance measures. So Linda, you and I are about to go on a trip to New York City, right? Yup. For what purpose? For your friend's wedding. Yes, the actually second guest of the show, Nikki Athanasiadu, is getting married, and we're going to help celebrate that event in New York City. We'll be traveling all the way from here in Los Angeles, California. How far of a flight is that? Probably five and a half hours direct. Ah, I see you chose to call it in terms of hours, but aren't we actually going to travel a distance that's in miles? Yeah, probably two thousand miles. But how come you said hours? Because it doesn't matter how far it is, the question is how many layovers. Right, yeah. What's important to us is the time of travel. Maybe what's important to the pilot might be the miles because he wants to optimize the jet stream and minimize the fuel and all this stuff, but that's not our concern. We paid a flat fee, right? Well, it better be. I'll check my credit card in a little bit. I'll be like, "Catching for the next extra mile." Yeah, exactly, so distance is something that's tricky to measure, but it's also very important because we often want to minimize or maximize distance depending on our different situations and being able to measure how similar or how distant two things are is a very common problem in data science that we need to worry about. So that's why I thought this would be a great topic for today. Now, interestingly enough, we're going to New York City, right? And where are we staying? Brooklyn. In Brooklyn, but where's the wedding taking place? Long Island. But the ceremony, where's the ceremony taking place? Central Park. Uh-huh. Let's just for a moment assume we're in the, what are they called, the Lower East Side? How do we get to Central Park? No idea. Well, we could take the subway. We could take a taxi. I guess we can't take a Uber because that's illegal or something. We could walk. Walk, yeah. Yeah. Can we go by what they call as the crow flies? I don't know what that means. You don't know what as the crow fly means? What does it mean, though? Well, I'll give you the technical definition. It means the length of the geodesic connecting two points where a geodesic is the curved line because, you know, we live on a sphere or an oblique spheroid. So you don't go straight. You go on a geodesic, which is the curved line because we're going across the surface of the oblique spheroid. It means going straight across. I'm confused. Going from LA to New York, like if we thought about the distance, like you said, it's about 2,000 miles. That's because we assume like our plane will take off from LAX, maybe land in, I don't know if it's JFK or LaGuardia, but basically it probably goes pretty much straight there. Maybe it takes advantage of like the jet stream or it does something weird because of the Rocky Mountains, but more or less goes pretty much direct, right? It's not like we fly perfectly east till we get to like Georgia and then we go perfectly north. I don't really know what direction it flies, but yeah, we could assume. Yeah, it pretty much goes as straight as it can. That's the most logical route. That's like as the crow flies, meaning like if there was a bird, it wouldn't worry about where the roads are. It's just going to go from the shortest distance from may to be. Interesting. We can assume that, but you know what? I bet if they study it, I bet birds follow a certain path that's like easier, which could even be roads or something that like carve it out easily for them. It's funny as you mentioned that because I'm trying to get a guest on the show to talk about bird tracking right now, but anyway, back to our main thing. Let's just assume as the crow flies means the shortest possible distance from a to be. But if we're in Manhattan, we can't go through buildings, can we? Well, some buildings, I guess we can, but we can't necessarily take the perfectly shortest path. Like a crow? No, we can't fly like a crow. So how do we have to travel? Well, we could walk, take a subway. And how are the roads that we decide to walk? How are the roads laid out in New York? I don't know. Some are grid. Yeah, almost a perfect, pretty much perfect grid. There's actually a name for this. It's called Manhattan distance. It's sometimes a slightly different idea, but some similar idea is rectilinear distance, which means you can travel only in the horizontal or vertical direction at one time, never both. But you want to travel both on the journey from your source to your destination. It's a little bit like those grabber machines, you know, that you see at stores where you can maybe win yourself a stuffed animal, where you first move, you know, in the horizontal hand. Yeah, first move that in the horizontal direction, then the vertical, but never both at the same time. So what is this called? Manhattan. Manhattan distance. The distance you would travel in Manhattan to get from point A to B accounting for the fact that you can't go on the diagonal. So why is it called distance? That's just how you measure. It's a way to measure things. It is. It is a distance measure because it's about traveling from A to B. So how would you use it in a sentence? I would like to go from our current location on the Lower East Side to our destination in Central Park for the wedding of Steve and Nikki. And I will be walking there and I will take a series of streets and I would like to minimize the amount of walking I do. So I will cut, I will go, you know, explicitly North at some times and explicitly East or West at other times. But how do you use a phrase in Manhattan distance in a sentence? Yeah. The path that I will take will have a Manhattan distance of 0.7 miles. So it's always expressed as distance? Yes. Not time. Correct. Yes. Manhattan distance. Well, could you ever express Manhattan distance in time? Let's go ahead and say no. Maybe there's some corner case, but no, Manhattan distance pretty much a positional length-based measurement. Positional length-based measurement, what does that mean? It means that you start from some absolute deterministic point. Let's just say it's established by some x and y coordinate and you want to travel to some destination also an x and y coordinate, but the distance is not the Euclidean distance, which is like, you know, the square of the hypotenuse of the triangle that's made up of the connection of those points, but rather the traversal of all the left and right up and down diagonals you have to take or non-diagonals you have to take. Now distance and similarity are two sort of linked ideas, right? Because something that's similar, we think of as being very close and something that's distant as being not very similar. I also want to just mention something about this idea called jacquard similarity. Well, so the jacquard index is the similarity or the jacquard distance is sort of the compliment of that, like how far apart things are. So imagine we went to the grocery store and we saw two different grocery carts that had some stuff the same and some stuff different. Actually, let's just say there were three grocery carts and you wanted to say, which two are the most similar? How might you do that? Well, if some of them have the same thing, like something had chicken breasts, then you could say those are most similar if they have two of the same things. Yeah, yeah. So maybe the more similar things they share, the more similar they are, or put more formally, the intersection of the two sets divided by the union of the two sets would be a useful measure of how similar or how distant they are from one another. Say it again. The intersection, meaning the things that they share, divided by the union, meaning the total items, types of items represented, could be an interesting way of measuring how similar or different two carts are. Well, that's dependent on something being similar. Yeah, yeah. So you can see how distance and similarity are important ideas that have lots of applications. I don't know about applications. That hasn't been made clear to me, no. Okay. What if, just for example, we wanted to build a recommender system on an e-commerce website and you knew the buying history of a bunch of people, would you want to recommend products that were purchased by people who have similar purchasing patterns or products that were purchased by people that have very different and distant purchasing patterns? Well, obviously the similar one. Right. So similarity is some measure of distances required to really define that. So there's an applied version of it. But even as we take our trip to New York, we're going to have to have our own distance metrics on the way there. We're going to measure it in minutes because we want the shortest flight. But once we're in town, we're going to measure it in terms of like miles or maybe meters or whatever, because we want to probably minimize the walking time, potentially, unless we're on like a cool walking tour. Another place this comes up, we previously talked about the K-means clustering algorithm and that uses some distance metric that I didn't really define then. We also more recently talked about the K-nearest neighbors algorithm. Do you remember that? I mentioned like that we might use something like that in finding a house to buy. What was it based on again? So imagine like there are two houses and we want to compare them. And there's the distance apart they are driving, but there's also the distance they are apart in terms of other features, like how many bedrooms they have, stuff like that. And actually a coworker of mine recently pointed out kind of a clever nuance to that. If you have a house immediately on the beach versus half a mile away from the beach, do you think we would just look at that as distance from the beach and evaluating like the value of that home? I would. You would. You would just say that there's some cost trade off in terms of dollars like for every foot you are away from the beach, it reduces the price by $1.75 or something. Yeah, but it's probably exponential the further away you move from the beach. Yes, yes indeed. In fact, I would say it's actually sort of stepwise like being precisely on the beach is probably exorbitant, but being like semi-close like within the first block costs a lot. And then there's like sort of plateau drop off from there potentially. Yeah, because then once you're far enough away, people don't care. Like the difference between like living at La Brea and La Cienega in Los Angeles doesn't matter like the beach is irrelevant at that point, right? Yeah. Distance is this funny thing that we think we always understand, but we have to have pretty specific definitions of at time to make it useful. But since we're on the subject of that came nearest neighbor's algorithm episode, I want to mention that I got an email from a listener, I'll just call him Matt T. And I want to thank him very much for writing because he made an excellent point. I kept talking about like, okay, nearest neighbors, it's multi-dimensional, high-dimensional spaces. And he actually sent me a link to a really great lecture by Trevor Hasty about how as you increase the number of dimensions, can your neighbor doesn't necessarily stand up. Because as you get into more dimensions, the distance between things can be exaggerated. Well, exaggerated isn't quite the right way to put it. I'm going to link to that video in the show notes. I would encourage people to check it out because I guess my moral here is that distance is a fickle thing and you have to be very careful with how you measure the distance between two things. And this is a common problem that's all over data science and carefully and cleverly measuring distance can often make or break your solution. So when you say distance, you're saying how different things are, not just things that so there's concepts like thoughts, how similar and concept they are versus how different they are. If they're similar, they're probably close in distance, or if they're not similar properties than black and white, they're probably further in distance. Yeah, yeah. We're approaching the political season, so I'm sure at some point people are going to start talking about the similarities and differences in the language of speeches, which might use cosine similarity that will have to be a topic for another day because I think we're running over here. But basically, yeah, you can even measure the distance of language if you define it quite well. Any distance measures lacking in your life, Linda? Well, I never thought of concepts as being having distance between them. Oh, everything has distance. Well, I think of Democrats and Republicans because that's a clear scale. Yeah. Well, how about we minimize the distance between you and me right now? Bing! Thanks for joining me, Linda. And until next time, I want to remind everyone to keep thinking skeptically of and with data.