This episode reviews the concept of k-d trees: an efficient data structure for holding multidimensional objects. Kyle gives Linhda a dictionary and asks her to look up words as a way of introducing the concept of binary search. We actually spend most of the episode talking about binary search before getting into k-d trees, but this is a necessary prerequisite.
Data Skeptic
[MINI] k-d trees
(upbeat music) - Data skeptic mini episodes provide high level descriptions of key concepts related to data science and skepticism, usually in ten minutes or less. Our topic for today is K.D. trees. - So I have a book here in front of me. Can you tell listeners what I've just handed you? - It is a book about, looks like a textbook about an inch and a half thick. And it says an introduction to komagorov complexity and it's applications. - Yeah, I don't know if it's komagorov or komagorov. I think it's actually you were more correct, but I'm used to saying komagorov 'cause that's how it was tied. Anyway, I'd like you to tell me real quick, does this book contain the word potato? - No idea. - Well, find out for me. - Well, it's gonna take longer than this message from our sponsor. (laughs) - You mean basically you're gonna have to go read the whole book, is that right? - Maybe, or there might be an index. - Okay, yeah, check the index. - See if the index is by fruit and root. (laughs) - By root, is there root section of the index? - They're root, name potato. So under P.O's, there are no potatoes. So I would guess there is no word potato in this book. - But the word does not in the index either. - You didn't look. - I did actually. I wanted to double check that. - Some people put in jokes, like Colonel Hannah Ritchie's C book, the word recursive also lists the index on which you find the index page that has the word recursive on it. - Yeah, it says theorem, but no though. - Right, so you can't be sure. Maybe they didn't regard potato as important enough to be in the index. So the only way to know if the word potato appears in that book is to read the whole book, essentially. So how many words do you suppose are in that book? - Maybe there's a thousand on each page. Maybe there's a thousand pages. So what's a thousand times a thousand? - A million. - Well then, a million. - A lot though, for sure, right? - Sure. - All right, now I have another book I'd like to give you. Can you tell the people what this book is? - He just gave me the Miriam Webster dictionary. - Let the record show we had to actually buy a dictionary to record this when he didn't have one. - We had to delay this recording until Amazon Prime delivered it to our doorstep. - Now, is the word potato in that book? - Oh, for sure, it must be. - Does it end in an O or a E, though? - Potato, I don't know, maybe O. - How long will it take you to find out? You'll have to read the whole dictionary, right? - Should you time me? - Well, you'll have to read the whole dictionary, right? - No, 'cause I know the order that the dictionary is in. - Uh-huh, yeah, so that order being? - Alphabetical. - All right, text is sorted, so you can take advantage of that and you can use an algorithm to find the word very quickly. Now, you've just kind of cheated, you're going straight to the word. - Potato. - Okay. - No E at the end. - But, so how many pages-- - Except plural. - For that? - Then it is spelled like toes on your feet. - Really? - The toes, there's knee. - Is that the real way that it works? - I think it's the same with tomatoes. Oh, let me know, P, R, S, T. - So how many pages are you needing to look at to find tomato here? - I flipped there a lot more than I did to find potato. - How many pages would you guess? - 20. - All right, well, what if I told you I could almost guarantee you could get to it in 10 pages flips? Actually, I think I can guarantee 10. - I think you're trying to be a magician. Tomato is the same, Justin O on singular, and then plural hat looks like toes. - Here's what I want you to do. Take that book, set aside the fact that you know like how many words begin with X? - I don't know that. - But compared to like B. - Less, less to B. - Yeah, we're less. So you have a lot of this prior knowledge about the alphabet where you could kind of guess where to turn to in the dictionary a little bit and get a head start. But imagine this was a language you don't know very well and you're gonna look up a word. You don't know how letters are spaced out. So the word could literally be anywhere, right? And you wanna look up tomato. Turn to the middle of the dictionary. - Should I have checked the numbers and to go exact? - No, it doesn't have to be really exact. The computer would do it. - I'm in middle. - All right, what's your favorite word on that page? - Mortar. - Mortar. - The J.R.R. Tolkien thing? - A strong bowl. - Oh. - In which a substance is pounded. Mortar and pesto? - Oh, I got it, mortar. Yeah, okay. I heard mortar. Is tomato before after mortar? - Oh, a man, so after. - Okay, so now you're welcome to do this if you'd like or you can just pretend rip that thing in half and throw away the front of the dictionary, okay? - Yep. - I know the word isn't there. That part is no longer of use to us. - So you're trying to eliminate it? - Yeah, and I have to say this dictionary was so cheap, we could actually just tear it and have it if we wanted. - No, thank you. - Okay, now go to the middle of the back half and tell me what word you got. - Slough. - Slough, like skip to the, slew my darlin? - I don't know, they told me to then go to another word to look it up. I hate when they do that. Go to this word. - Do at least tell me the page number? No. - What if it's a circular reference? They have to check, I wonder if dictionary people have to check for circular references. Anyway, is tomato before or after that? - S-T, so after. - Okay, so throw away the half you got in your hand there and take the remaining and cut that in half. - Okay. - Well, after it cut. - Vice president. - Okay, is it before or after? - Before. - Okay, so now cut between that and where your old finger is. Here's where it gets tricky now. - Okay. - All right, what do you got? - Talnet. - Ooh, good word. Oh, I love, do you know about Talnet? - What do you think the definition is? - Talnet is a protocol by which it's kind of fallen out of fashion, SSH has really taken over, but Talnet, I used to use it in the modem days to dial to different computer systems. - Yes, it's a protocol to port 20 or-- - A protocol to use a computer on a network via a local computer. - Yeah, but different from TCIP, you have to really want to connect through Talnet. Talnet's like a user is a tool. - Yeah, well, we used to have a lot of companies named with the word T-E-L and now that's so out of date. Oh my God, today at work, you know, we all use the program Slack. - Yeah. - I said I am like instant message. And this guy, he like graduated from college a few years ago and he goes, he starts laughing and I go, what? He goes, he just brought me back to the 90s. And I was like, well, what are you saying? He goes, I just say Slack. - Anyway, now is it before or after where you're at? I would tell him that after. - After. - Okay, now cut it that in half. - Okay. - So that was our fourth. All right, well, what's the word? - Uncovered. - Uncovered, okay, so we go to the first half. This will be our fifth. - That's getting a little smaller. We're trying to find the middle, okay? - Yeah, it's hard to even find the middle. - What are you at? - Tuleen, liquid hydrogen carbon used as a solvent. - I think that was seven, I might have made a mistake. - I know a men-opique. Now it's very, very hard to cut this. - Hey, you don't have like three pages. - It's before this. - Okay, well, what's the word? - Oh, you know what? When I said Tuleen, it was on that page. - Okay, so well, I mean technically-- - So actually, I had to go back. - No, let's just call it even like seven or eight, maybe. - Okay. - That's pretty rare. I'll fix that dictionary. Bigger than Liam Batani's book we were looking at. - It's an 852 pager. - Wow, and in basically seven or eight steps, you could find a word and that would pretty much be true for almost any word in there. Because if you notice, what happened when you cut it in half? - Well, it cut the problem in half, huh? - Recursively, every time, right? - Is that recursive, but it's not exponential, huh? - No, it's actually logarithmic. - Logarithmic? - Yeah, but it was just related to exponential. - It's an equal slope. - Yeah, so imagine this now. What if that dictionary was twice the size? I got a dictionary with a bunch of extra words in it. How long would it take you? - Same amount. - Well, maybe one extra. - Okay, actually, a few more cuts. - As the problem gets bigger and bigger, the work to solve it gets bigger in a sub-linear way. It's not super bad. Like if I was like, "Hey, Linda, read the Coma Girl "conflexity book." Oh, and by the way, it's twice as long as you thought, it would take you twice as long to read, right? But the dictionary, I can double the size and there's no problem. It's one extra little step. So this is an algorithm called binary search, very famous, and it's good for one-dimensional problems. Did you notice that, what do we know about the dictionary here? What's the order they're in? - Well, they're in alphabetical order. So that attribute, I think, is related to the way it's spelled. - So what if I said, "Hey, Linda, "find me the longest word in the dictionary?" - That would be harder to find. - Yeah, 'cause it's not ordered that way. - So it doesn't work. So you just gave an example where binary, what is this called, binary? - Binary search, yeah, it doesn't work. - You're not hurting my argument, you're helping me here. That's why we need KD trees. - There are some multi-dimensional problems, such as, I was thinking of an interesting one, there are men's pants, I buy my pants, in the length and the waist are the two independent measurements. So if I want to know if they have pants in my style, I can't really binary search, can't I? - Why not? - Because if there's one rack, what's the order they put them in, in two-dimensional space? - I thought they put them by size. - Yeah, but there's two sizes. There's length and waist. - And don't women have a similar thing? Aren't there like three numbers you always have for your clothing? It's like something dash something dash something. It'll be like 22, 32, 42, that kind of like. - What? - Well, those numbers are kind of funny of your triangle shape first thing. (laughing) - Hi, I'm a triangle. I live on Sesame Street next to the letter B in the number four. (laughing) But wait, what if that's somebody's actual size and we just insulted them? - I don't know, but anyway. - But I mean, couldn't it be? I'll cut it out. I don't wanna be mean. - For women's pant sizes, you don't measure. Those three sizes are for your bust, your hips, and your waist. That doesn't apply to pants. So then your hips and your waist. - It's not an SI, so I'm not familiar with this measurement. - What's SI? - It's this scientific standard of measurements, like angstroms and meters and all that sort of stuff. - So I had to cut in during the editing process here to correct or other bone-headed error. - SI, the international system of units, does not include angstroms. The seven base units are meters, kilograms, seconds, amperes, kelvins, moles, and candelas. Angstroms, of course, are a unit of length and are therefore redundant with meters, which are already there. - Anyway, so that you said, bust, waist, and thighs. - No. - Bust, waist, and hips. - Oh, what's the difference? The same thing. (laughing) - Fies, one leg. There's a crumb run. Okay, but hips are plural, and there's no E at the end. - Well, no, 'cause then it'd be hypes. (laughing) It wouldn't be the same word. - You're not confused with tomato potato. Anyway. (laughing) But if you want to organize your men's pants and search them efficiently, they are not one-dimensional, they're two-dimensional. So you can use a KD tree. And what you do here is it's a lot like a binary tree, but at every level, you flip the thing you're storing on. So first you split it on maybe a waist, then you split it on length, and so on and so forth. And when you get down to the leaves, you don't actually sort of store the objects there. You store like a bin number there, and you say, oh, well, these pants can be found in this bin, and then I go straight to the bin I want. So that's what a KD tree does. It lets you store and search multi-dimensional objects using something very much like binary search, which runs, as we said, in very efficient time, sublinear, what they call big O of log n. So as the problem grows, you don't have to grow at the same rate. You can grow at a slower rate. I use it mostly for latitude, longitude stuff, putting geographic things together. It's great for that, 'cause it's two dimensions. But technically speaking, you could run it in any number of dimensions, I guess. - Well, I just want to point out that Kyle, you excel at making up situations and saying how useful your solutions are. (laughing) Like the example with the maze, you're like, oh, just put your hand against the wall of the maze. If you ever find yourself in a maze and you need to get out as quickly as possible, and you don't have a map or any aerial view, just put your hand against one side and just follow that, and you're likely to get out faster than anyone else you don't know. - And you'll never forget it. - Yeah. - And there's not gonna be another sequel to the maze running movies. - So, yeah. - Kyle just comes up with these situations. - No, I didn't invent that. - Well, I just think it's funny that if your situations are so obscure in real life, you have to make it up. Like a maze, like ever, Linda, when you're ever stuck in a maze, which will never happen. - What if you woke up with amnesia in a corn maze? That'd be a great movie. - No. - And you need to find your way out, but you remember this 'cause you heard the solution. Anyway, thank you as always for joining me, Linda. - Thank you. - Until next time, I want to remind everybody to keep thinking skeptically of and with data. Good night, Linda. - Good night. (upbeat music) - More on this episode, visit dataskeptic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music) (upbeat music)