Archive FM

Data Skeptic

[MINI] MapReduce

Duration:
12m
Broadcast on:
10 Jul 2015
Audio Format:
other

This mini-episode is a high level explanation of the basic idea behind MapReduce, which is a fundamental concept in big data. The origin of the idea comes from a Google paper titledĀ MapReduce: Simplified Data Processing on Large Clusters. This episode makes an analogy to tabulating paper voting ballets as a means of helping to explain how and why MapReduce is an important concept.

(upbeat music) - The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. - Well, welcome to another mini episode of the Data Skeptic Podcast. I'm here as usual with my wife and co-host Linda. - Hi, I'm Linda. - Linda, our topic today is MapReduce. Have you heard of this before? - Nope. - You don't even have like a crazy engineer running around your office talking about Hadoop all the time. - It's called Hadoop? - Well, Hadoop, okay, maybe we should get into it. I just thought maybe you'd recognize one of these words 'cause people love talking about Hadoop. MapReduce is a particular type of framework. And I'll talk more about that in a second. And Hadoop is a very popular implementation of that. And it's for working on big data. Surely you've heard of big data, right? - Yeah, but I don't really know what it means. - Okay, let's start there. What would you guess that it means? - Lots of data. - Yeah, that's a good start. Do you think it would fit on a three and a quarter inch floppy? Three and a half, right? - I don't remember how much a floppy this holds. - Virtually nothing but today's standards. - Maybe 500K. - Like what? - No, no, I was wondering that. Basically it would hold like one MP3. - Okay, so like five megabytes. - Actually, it might not hold one. Hold on three, three is 1.44 megabytes if memory serves. Anyway, small data. The analogy I like to use sometimes when people ask me what big data means is I say, have you ever tried to open a file that's really, really big? Like usually this will be like a spreadsheet for some people or maybe like a Photoshop file with a bajillion layers and it just takes forever to get it open. Does that ever happen to you? - Maybe, I think eventually it just doesn't open. - Right, so that is actually sort of the, just like we have the sound barrier, there's kind of like a big data barrier. When you try and open something on your computer and it just does not work. So let's talk about what your computer has available. It has CPU, you know what that's for? - Processing? - Yep, CPU is generally fast. It has memory, you know, it goes there. - Well, it's your thinking ability. - Yeah, it's like RAM, right? - Yeah. - And then there's the hard drive, the disk space or sometimes like the network. So I'll call that IO because it means that your CPU has to go out to either disk or over the network to get information and bring it back. So that's pretty slow. Memory is very fast, it's right on the motherboard near the processor. So have you ever noticed that like your computer will be sluggish when you turn it on? - Oh yeah. - You know what that is? - I don't know, I don't, maybe I don't like my computer. (laughing) - Well, it mostly has to do with the fact that your computer, the operating system in this case is very smart and it knows how to cache stuff into memory because memory's fast and hard drives are slow. Do you know how much RAM is in your computer? - No, you picked it out for me. - Of course. - Which is not to say I'm not technical actually. - No, you're actually racial. - Your computer before this that you bought on your own was a good computer, you did good. - Yeah, then one day I just stopped working, so I guess it was a bad computer then. - I had nothing to do with that. - And well, so you don't know what to remember, I think I put, I want to say eight gigs in there, I might have put 16 of us feeling generous. - I don't think you put them in there, we just bought it as is. - Yeah, well I filled out the form. - There wasn't a form, we bought it at Best Buy. - Did we? - Yeah, 'cause the earlier one online, there was a problem, and then we had-- - Oh yeah, oh yeah, oh yeah. - So then we just bought an out-of-the-box one for Best Buy. - I forgot that horrible company we tried to use anyway. Well, so you don't know your memory, let's just call it eight gigs. Do you remember your hard drive size? - Nope. - Take a guess, you have a must have an idea. - Five gigs? - No, gotta be more than maybe like 30 gigs? - No, you got a terabyte in there, I'm sure. - I don't think so, no. - Yeah, we'll check after this. - I'll check. - We'll do a little prologue and see who wins the bet. (laughing) - I do not have a terabyte, 'cause the other day, I remember I looked, and I was like, oh, I surprisingly don't have as much empty space. - You should take too many photos. - I don't take that many photos right now, so I haven't really been filling it out. Most of the photos are on my phone, which don't fill up that quickly. It's if I use a high-res camera. - Yeah, so getting kind of back to topic. The reason I bring all this up is 'cause in a perfect world, you would just cash everything the memory, and it would be very fast, but you have a limited amount of memory. Eight gigs is a lot, and you can get bigger machines than that, but there's a limit there. The disk space is another limit, so you could swap stuff in and out, but you could be dealing with data that's larger than a terabyte, or larger than the number of hard drives you could put in a computer potentially. So what do you do when you have so much data that it can't fit on one machine? - It'll fit on two. - Yeah, exactly. And what if it doesn't fit on two? - Three, and you just keep going up. - So the first thing about big data is when you have data that's no longer manageable in sort of these traditional ways, it can't be managed by one computer. So a great example, and actually the origins of a lot of this big data stuff is Google, right? 'Cause they wanna crawl and index the entire internet, but last I checked, there's no way to download the entire internet to your computer, nor could it fit if there was such a link. - Well, when you say crawl and index, what do you mean? - Oh, we should come back to that. Crawling and indexing is another good topic, maybe for next week or something. But that's just basically how they look at all the pages and summarize them. So the next time you type something, they know what are the good sites to recommend. - Immediately. - Immediately, yeah. Without going out and asking every web server in the world if they have any matching documents and then showing them to you, it's pre-done. The other more maybe practical real world example and getting into what the map reduce part of our discussion is is, okay, so we've established there's sometimes data that's too big it can't fit on one computer. So how do you manage it? Well, like we said, you go to multiple computers, but maybe you just get the data off the first computer, solve it, then you go to the second computer and get the more data and third and go on, but that could be really slow. You can actually take advantage of doing things in parallel. The kind of like hands-on example I was thinking of is, what have you and I were put in charge of handling voter records in the state of California? And we wanted to know we had all the like vote punch cards or whatever from the last election. And we wanted to know what the total vote was broken down by age group, like, you know, the decades, people in their 20s, 30s, 40s, so on, for each party. What percentage in each age bracket voted for which two candidates? So Kyle's assuming we fell at our age. Yeah, yeah. Submit a vote, so we're just making something up. And then he's even assuming that we punch holes in the ballads, so Kyle's clearly never voted in the state of California. There's actually these two lines and you connect the line. Like connect the dance? Kind of, and you make like a line. Okay, that's amazing. Oh wait, that's just one way. And then another way is that there's different numbers and you just fill the numbers. Like, you know, when you take a multiple choice test in school and they put it in their little auto reader. Does that mean they offer none of the above? Is that like that? No. Well, they have a booklet that has questions and it only has, let's say you have like 200 dots, you'd only answer like five. Okay, I get it, I think. So assume that the age was there and we wanted to go tabulate this. Would you have every city in the whole state ship their ballots to our house to be counted here? No, we have a smallish house. Well, we have no storage for all that. Well, there's no storage and we don't have people to count them. Yeah, yeah, yeah, exactly. Assuming it's manual. So this is like the good analogy I came up with for what MapReduce is. Instead of going out, getting all the data, bringing it home and then processing it, you not only do you store the data elsewhere, but you send your computation to where the data lives. So you call up friends or whatever and you say, "Hey, you know, tonight this friend needs to drive to San Diego, go get the ballots, count them all up and then come back and just tell me the answer." So you don't have to drive the ballots back. You just go there and do the computation and bring it back. When we talk about MapReduce, it's two steps. It's Map and Reduce. Map is like the filtering sorting kind of step. So it's dividing each of the pieces you have available into the age groups. And then Reduce is like the calculation, adding it up and then aggregating it and bringing it all home. As each city returns in each of our friends going out to all the cities in California to do these counts, as they bring those back to us, they just need to tell us their breakdown by age group. And then we add all those up as they come back. So we don't know each individual ballot, but we know the pieces of the solution that were found at every city. And every city's like potentially a different node on a distributed file system. So every worker just solves for the data it has and then all those solutions get aggregated together. And the real breakthrough that has made this possible for the common man, such as myself, is what's called commodity hardware. Are you aware of this? - No, what is this? - Do you know anything about virtualized servers? Anybody talk about AWS around you at the office? - Oh, I've heard of it, virtualized servers. You talk about them. - Oh yeah. - Do you ever see that charge for AWS on the credit card? That's Amazon Web Services. - Oh, you talk about that. - They're one of the big players. So what they essentially offer is you can go on their website and with the click of a button, you can have a computer, a virtual computer, but a computer nonetheless set up for you. And it's like instantly available, which is way different from the old days where if you wanted a server, you had to buy an actual hardware computer, go have installed somewhere and takes a lot of time and it's expensive. I can go spin up a thousand machines, keep them for an hour and then delete all of them if I want to at the Amazon Web Services. And one of the tools that makes this MapReduce stuff possible is called Hadoop File System or HDFS. So I can spin up all those instances, install Hadoop, spread out my data across them and then have little worker bees go out and that's the map step, kind of do whatever they're doing and the reduce step is bring it all back and putting it together. Just like the counting votes in each city and just bring home the totals. - So what your worker bees, as you call it said, is like your programming. Were you actually sending them out there to the count, right? - Yeah, and this part actually is maybe a good point to break it for another episode because my analogy's breaking down a little bit, but more or less you're saying like to each of those servers, here's the code I want you to run, run this on my behalf for me and it's just gonna go locally and like calculate a bunch of things and then bring the answer back. - So it's purely used for like counting, right? - So I keep saying counting for a couple of reasons. Number one, because that's like the default entry example that most people do, most people who start this stuff will do the prolific word count example and truth be told, counting activities are probably the thing that's most done with map reduce. Though not unless I could be wrong about that, but it's like sort of the simplest thing that needs to happen on big data is counting of some kind, getting frequencies basically, at least in my world. But you can do just about anything with map reduce. You could use map reduce to plan chest strategies if you wanted. - How? - That would be a good topic for another episode, but the basic gist of it is that each of, I guess each of the nodes you have could be pondering different board configurations that are spread out. So like, would it be good if I move the night or would it be good if I move the rook? Each of those, they don't necessarily need to know what the other's doing. They just need to bring back like a final heuristic of how good the board is, and then we would decide if we, you know, which move to take. - I don't know. We'll have to wait for that next episode. - Okay, well, sounds good. Well, thanks again for joining me, Linda. - Thank you. (upbeat music) - Yeah, here's our addendum. We're gonna check your computer's hard drive. See who was correct. - Okay. I'm clicking around here, looking to see where I could find out how much space. - Aha, 917 gigabytes. - That's from one hard drive. - You only have one hard drive. - Oh, okay. - So not a terabyte, but sort of close. And without talking about bits versus bytes versus all that thing, the round. And do you know about that problem? - I don't know. - Well, it's a mini episode for another day. - No, I know where you're talking about. - You have almost a terabyte though, see? - So I have 917 gigs, which means not a terabyte. (laughs) - Right, but almost. And the point is, I do not have a terabyte. - Okay. - So you know what that means? - What does it mean? - I'm right. - That's correct. As usual, it's limited, right? (laughs) - Thank you. - Thank you. (upbeat music)