Archive FM

Data Skeptic

[MINI] Monkeys on Typewriters

Duration:
17m
Broadcast on:
14 Nov 2014
Audio Format:
other

What is randomness? How can we determine if some results are randomly generated or not? Why are random numbers important to us in our everyday life? These topics and more are discussed in this mini-episode on random numbers.

Many readers will be vaguely familar with the idea of "X number of monkeys banging on Y number of typewriters for Z number of years" - the idea being that such a setup would produce random sequences of letters. The origin of this idea was the mathemetician Borel who was interested in whether or not 1,000,000 monkeys working for 10 hours per day might eventually reproduce the works of shakespeare.

We explore this topic and provide some further details in the show notes which you can find over at dataskeptic.com

(upbeat music) - The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. - So welcome back to another mini episode of the Data Skeptic Podcast. I'm here with my co-host and wife Linda. - Hello, hello. - Linda, do you know what our topic is for today? - Is it numbers? - Close, a specific kind of numbers, and I've brought along my game science dice here to generate some random numbers. Do you know about the game science dice? - You told me about them. - So anyone who doesn't can go back to where I interviewed the creator, these are Precision Dice, aimed at giving uniformly distributed outcomes when you roll them, unlike a lot of standard run-of-the-mill dice, which will exhibit a bias. If you had to define randomness, how would you define it? - Random without pattern or logic. - That's pretty good. Without pattern is actually a great definition of random. Can you think of cases in your life where you think things are random? - Sure, whether or not you get the job. Always seemed pretty random to me. - Now that's a great example, because what you're saying is, it doesn't seem to correlate with the available data, but it's not like the hiring person flipped a coin. I mean, maybe that happens sometimes, I guess, but they had some pseudo logic they were executing, right? Presumably. - Yeah. - So I guess random would mean like, you weren't offered jobs, you thought you were qualified for, and perhaps you were offered jobs, you thought you weren't qualified for. - Well, okay, obviously when they interview you, they probably look at your resume. So that's not random. But I think once they narrowed down to say three to five candidates, and then after they interview you, whether or not they pick you, I guess it could be based off what you said during the interview. - Could be. - But it always seems a little random because you just don't know what they value sometimes. - That's very true. On their end, it's probably not like the role of a dice or a coin toss. They had some logic, but to you, because you don't have all of information available, it appears random. So a lot of the study of randomness has to do with the sequences of numbers. So you could talk about how many times I roll the six-sided dice, but mostly we break it down into zeros and ones, coin tosses, 'cause it's easier to process mathematically. Well, let's start this way. What if I gave you a list of zeros and ones, and it was 999 ones followed by zero? Would you find that to be a random string of numbers? - No, I mean, the first step would probably be counting. How many there are of each one? - That's a good thought. So what would your expectation be? - Well, there was only two numbers, and you thought it was random, then it should be about 50-50, depending on how many numbers you were given. - Good instinct. So we can look at the frequency with which you see all of the different digits come up, and that is a good indicator of randomness. But now let me try and break your test. What if I gave you a string of numbers, a thousand digits long, and they went 01010101 and so forth? That would be a 50-50 distribution over zeros and ones. Would you still find it to be random? - No, 'cause then that sounds like a pattern. - Aha! So maybe we could say that the frequency of numbers is necessary, but not sufficient, to describe randomness. You might also look for pattern. - Yes. - How would you detect the pattern I just described? - Well, yours is alternating, so that was a simple pattern to pick up on. - How about this? 0001, 0001, 0001, 0001. - Well, that was still repetitive. It's the same string repeated. Three sequence of numbers repeated infinitely. - What if the repeat sequence is very, very long? - I mean, it could still be a repeat sequence, and that's not random. - That's good observation. So a lot of the times, this is called serial frequency. Let's say you only cared about two digits. So then the possibilities are 0001, 01, 01, 01, right, four options. - Okay. - So you would expect that each of those four options would appear equally as often in the grand scheme of things. Just the way you said the single digit would appear just as often. - Okay. - So there's lots of tests like that that you can use to test if something's random. Have you heard the term pseudo-random? - Pseudo means fake. - Good. - And random we defined earlier. So I do not know what that means. - You seem like you're within arms distance of a good guess though. - I guess they're trying to fake someone into thinking it's random. - That's right. Did you ever make random numbers on your computer? Just for fun, you know. - I tried, but it's wherever my finger is hit. - Oh. (laughs) Well, that's interesting. Actually, we're gonna come back to that when we talk about monkeys and typewriters. But what I was referring to more specifically is that your computer has the functionality that it'll generate random numbers for you. But it's usually doing that in a process that's called pseudo-random. What it does is it bases a calculation on a very complex formula so that hopefully it doesn't exhibit any patterns. But at the end of the day, it's generated based on following a procedure. So if you know that procedure, you can kind of guess what the random numbers will be. Although there are other random number generators that are close to perfect, that are generated based on entropy. For example, I remember what was, I think it was free BSD eventually came out with a version. That's a flavor of Unix where it generated its random numbers based on entropy inputs. So it read like the temperature readings from the CPU, which fluctuate in a non-predictable way. And it would use those to seed its random number outputs. So it took sometimes a long time to generate random data, but it was more random than you would expect, and that was super cool. - So when you say entropy or saying a loss of heat, or what does that mean? - Ah, great question. Entropy is the measure of how governed by regularity or how sort of stochastic a process is. If I had a die here and it wasn't a game science die, so it was weighted. And the six would come up half the time. That would have a lower entropy than a correct dice like this one, because you have some ability to predict. That's not a perfect definition of entropy, but it's sort of a good vernacular whittle look at it. - Okay. - What if I ask you to generate random names? Maybe I was making a video game and I need to have all these characters in a town. - I guess you could already have a list of names and then pick randomly from that list. - What about first and last name? - You could have two sets. - And you would just pick one randomly out of each, sort of uniformly distributed? - Yeah. - But these things aren't necessarily random, right? There's often a linkage between the first and the last name. Not always, but it's there. Like, there are very few Cecil Rodriguez's, right? But there are a lot of Juan Rodriguez's in the world. All right, what if I asked you to generate, okay, so we've named all the characters for this video game? Now we need to specify what their heights are. Would you just roll a dice for that? - I mean, you can. As long as you don't care if they're drawers or one inch tall versus 10 feet, but the range doesn't matter, sure. - Well, good point. And if it does, you need to generate those heights from the expected distribution of heights, which is that people are kind of generally speaking between like, I don't know, four and seven feet tall or so. So there's actually different kinds of randomness that we have to concern ourselves with. The one we're most familiar with from dice is uniform randomness, that every outcome is equally likely. But there's also Gaussian randomness, which you might know what that is, right? From your photo editing days? - Gaussian, all I know is that there's a blur feature called Gaussian Blur. - I don't know how it works. All I know is Gaussian Blur, or click. (laughing) - Well, it works. - I know what it looks like. I'm like, ooh, they use that blur. (laughing) - It works by using the Gaussian distribution to inform how the nearest neighbor pixels should contribute to the blurring of the local pixel, if that makes sense. So for every pixel, you kind of look at what's around it and you merge those colors. But you don't merge them uniformly. You merge them based on the Gaussian distribution, which is from previous episodes. - A ballast shape. - Well done, Linda. This whole process is often called Monte Carlo sampling. If you have an expected distribution, you can sample from that distribution using Monte Carlo sampling so that you get randomly generated values which are within sort of the shape and domain you were expecting. Now let's talk about why we would care about randomness at all. Do you have any thoughts on this? - I imagine somewhere from a data perspective, you need random so that you could tell when it's bad data. - That's a good point. You want to often ask the question, does this data appear to be random or does it appear to have some regularity to it? Do you remember last episode when we were talking about private and public keys in encryption? - Yes. - So you generate those and you generate those based on random data. But if you don't have a random number generator, if you have a generator that's very predictable, then someone could derive your key. So it's important for security reasons to be able to make random numbers. It's also important for statistical sampling. If you wanted to, let's say, get some information by surveying people, how would you decide who to ask? - I guess you could say random. You want to decide random. - Right. - And how do you know if something's random or not? - I don't know. - Well, you can use a random number generator and that's a safe way. There's a piece of this that I'm dancing around and that's the importance of using random numbers rather than trying to generate them yourself because people are notoriously bad at trying to come up with random numbers. So it's good that we have tools to do that. Have you ever heard the analogy of monkeys banging on typewriters? - Only from you. - Really? You've never heard that outside of me. - I don't think anyone cares about monkeys. Would you like to explain to me where it came from? - It comes from this mathematician, Boreal, I think his name is B-O-R-E-L. And he conjectured, what if you had 50, actually it was a million, I think, a million monkeys banging on a million typewriters? Would they eventually, in a coincidence, produce the works of Shakespeare? What do you think? - No. - Why not? - Those are words. - Okay. So if they were banging on typewriters, do you think they'd eventually produce the word Yoshi? - Mm, here's the thing. I think that people bang on typewriters or probably banging with their fists. - Good point. So let's assume that some kind of way, they're banging such that every letter comes up equally as often. It's just completely random. - Oh, well, if it's random, then sure. - How long might it take? - How many words? I mean, let's just talk about one work. Romeo and Juliet, how many words are in there? I mean, I guess you could talk about it from probability, which is the probability is probably like 0.001. I mean, I guess there's probably a small chance, but I just don't think it's possible. - So I'm gonna work out the exact math and put it in the show notes. By the way, the show has show notes. If you go to data-skeptic.com, I often add some extra stuff to each episode, but the moral of the story is it would not be surprising if a million monkeys and a million typewriters, eventually in a coincidence, produce the works of Shakespeare. Because if you have a long enough string of random numbers, it will always have subsections that look like they're not random. - But it has to be the whole like novel or whatever. - Sure. - That's just not possible. It's quite unlikely, that's true. It's like a one divided by 10 to the 10 to the 3.5 or something like that. It's absurdly tiny of a possibility. - Yeah, I mean, I didn't say it was not possible. I just said, no, that's not gonna happen. We're just talking about probability. And so it's so slim. I think it's just like outside the realm. - But with enough attempts, anything that's possible can occur. - Well, I mean, let's think of it this way. There's, I don't know how many humans are on earth. Billions. - I think you have six billions, man. - I don't know how many humans know English and how to put words together. How many have separately, independently of reading Shakespeare? Written, Romeo and Juliet. - Well, you're coming at it from a different angle now. - These are people who know English. - Right. - Not gonna happen. - As well as the Bard himself? - Well, and also they used old English too. So these monkeys would be typing, old English. - Yeah, they need a lot of caffeine too, boy. 'Cause that old English stuff, I can't get through that. - So that's just not possible, no. I mean, yeah, point zero, zero, zero, whatever percentage, but for the purpose of this conversation, I will say it's not possible. - All right, well, statistics will say it's possible, though with some likelihood, so we can get into that. And there's one last point I wanna hit on with randomness. Have you ever compressed a file? - Yes. - And what happens to it when you compress it? - The whole point of compression for those people who've never heard of it, is to make a file smaller. - Do you know anything about how it does it? - Well, I think you briefly told me, which is, you said it looks for similarities, and then writes it as one thing. Like if it's, for example, colors, it goes on, look, there's two blue dots, we're gonna consolidate, and remember that as one, or something, I don't know, something like that, where it's fine, similarities. - That's essentially correct. Compression tries to remove regularity and remove patterns and sort of make them smaller. A perfectly compressed file will be indistinguishable from a random number, isn't that cool? - Indistinguishable. - Right, because if there was anything that was unrandom about it, it means you did kind of a crappy job compressing it. - Well, I don't know what a compressed file versus non-compressed look like, are there numbers? - In the sense that every file can be expressed as a number, right, it's just bits and binary. But to make a good joke, a compressed file ends in .zip, so you do know what they look like. - I don't get that joke. - Because you compressed the file, you said I don't know what a compressed file looks like. It looks like that zip, and it has a folder with a zipper on it. That's really hilarious. - To those listeners, you could tell in the comments if you think this is funny or not. (laughing) - So, what have we learned today, Linda? - I learned that random is important for security. - That's true. Did you learn that there are, well, you even proposed them, ways we can detect if something's random or not? - Similar to the IQ test, you look for patterns. - That's right, that's exactly right. And at a high level, detecting randomness is exactly equivalent to that. So any test for randomness should also be complementarily a test for patterns. And I guess the most important part about being able to generate random data and analyze things to say if they're random or not is that it's useful when looking at data to say what is the likelihood that this data is just random, or that it's based on some pattern or has some periodic frequency or something to it. So, random numbers play a big role in a data scientist's life. And I might even claim in every person's life, what do you think? - Well, the whole time you've been talking about random, I thought it was interesting because there was a time before math, like arithmetic and stuff, where people didn't know zero existed. So zero is a fundamental unit of our math system. And it sounds like random is like the fundamental building unit for statisticians and such. - That's very true. And there's a great book on the number zero that I read when I was in high school. I'll try and find a put in the show notes. But good summary, and thanks again for joining me, Linda. - Bye, thank you. I just had to wake up, roll out of bed, and sit here. (laughing) (upbeat music) (upbeat music) (upbeat music) (upbeat music) [BLANK_AUDIO]