Data Skeptic

[MINI] Markov Chain Monte Carlo

Duration:: 15m
Broadcast on:: 03 Apr 2015
Audio Format:: other

This episode explores how going wine testing could teach us about using markov chain monte carlo (mcmc).

[music] The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking and data science. [music] Well, welcome back to another mini episode of the Data Skeptic Podcast. Joining me, as always, is my wife and co-host Linda. Hi! So, Linda, today, as I announced last time, we're going to talk about part two of this discussion, Markov Chains, we talked about last time. So, Markov Chain is sort of like a bunch of circles that describe states and arrows between them that show how you can go from one state to another and the probability of following each of those arrows, those edges. Now, Markov Chain Monte Carlo is a set of algorithms, and there's a bunch that do this. What they try and do is simulate changes in that system in order to find the underlying distribution of those states when they're kind of at an equilibrium point. You think of systems as having a state, and a Markov Chain describes how the current state is related to the previous state, and what they call a transition matrix. So, where you are, where you could go next and the likelihood of each of those changes. Actually, let's take a step to the side for a moment and talk about probability distributions. The binomial distribution is the one I want to talk about, and it's like flipping a coin. So, if I flip the coin, and it was a fair coin, it's not like a trick coin, what's the probability I get heads? Fifty. Good. What if I flip it twice, what's the probability I get two heads? One over two times one over two. That's right. So, there's one out of four ways that can happen. So, we could keep going down this path, and you could understand, if I flipped it a hundred times, what's the probability I'll get at least sixty heads? And if you had enough time, you could kind of figure that out, right? Yeah. And that's based on this binomial distribution. And it's nice because it's very smooth, and if we plotted it out, it would eventually converge to look very much like a bell curve, like the Gaussian distribution. That's sort of simple, right? You can picture it. What if the distribution was incredibly complicated? Had all these different parameters and values and spikes everywhere? You wouldn't be able to easily describe it the way you can easily describe the binomial distribution. So, here's an example I thought of. Imagine if you wanted to know the likelihood of the different varieties of produce available to supermarket. What do you think are some of the variables that affect that? Season, floor space. Mm-hmm. That's a good one. Prices, where that grocery store is located, like the state of California versus, I don't know, Montana or something, I don't know. Even potentially the presence of certain things not others, like maybe when there's really good cherries at a great price, people buy less grapes because people that normally buy grapes would rather buy the cherries. So, this is this incredibly complex system, actually, of all these potential interactions that could affect what you find when you go to that grocery store even day of the week and when the new delivery trucks come and the drought and stuff going on. It would not be easy to figure out all the possible variables and to model all of them and to describe this very complicated system. But if you understand some of those relationships, like delivery day, what sells faster, the rate at which you diminish your supplies at certain prices, you could build a Markov chain that describes that. And from it, you could use Markov chain Monte Carlo to try and guess the probability distribution that you will usually find certain fruits and vegetables at the supermarket. And how did you come to that? I just like, that would be a result of Markov chain Monte Carlo a little bit. So, the supermarket problem is what we call highly dimensional because there's so many little variables like, what about this? What about that? Do they stock it? How much space is there? What's the neighborhood? All of these are different dimensions in that calculation. And when something's highly dimensional, it can be very hard to calculate, even impossible. So, it may be an interesting way to describe and explain Markov chain Monte Carlo to you. This episode will go live on the day before we're headed out to Santa Barbara. Okay, can you tell me a little bit about some of the things we have planned for our trip? No, I don't want people stalking me. All right, well, might we go wine tasting? Yeah. You said that's so unenthusiastic. You love going wine tasting. Well, if someone's stalking me, I don't want to say it really enthused because then they know I'll be there. But they're not going to know it. Okay, this is great. So, we're going to use that. They're not going to know which winery we're at, right? Aren't there a whole bunch of them? Yeah, yes. Depends if you're talking about technically a city or outside of Santa Barbara. Let's just say the greater Santa Barbara area. How many wineries do you think there are? I don't know. Let's go with, let's say, like 50, maybe, last more. Maybe more, but I'm not sure. So, let's talk about how that stalker might be able to figure out where we're going to go in Santa Barbara to all these wineries. He might describe the whole... Here is she. Oh, good point. You might have the female stalker. All right, let's just call them stalker in general. They could use Markov chains to describe the way in which people move between wineries, and then they could use Markov chain Monte Carlo to find out the probability of where someone would be. Think of it like a board game where there's a map in front of you. Every winery is there as a location, and you have all the roads that go between them. Put some pieces at random wineries. Let's say there's a winery that's right on the edge of town, super convenient to get to, and there's, you know, one of your pieces is there. And then imagine that there's a second winery right next door to it, very convenient. What do you think is the odds that someone will go that super convenient winery versus another one? Well, greater than that, it goes next door. Yeah, like maybe 50% chance to the next door one. Maybe. Let's say there's, you know, like 50 wineries total, and the very furthest one away that's super inconvenient to get to, what's the chances they go to that one? Oh, very low. Very low, maybe 1%. There's a probability distribution over where they'll go next. We just set out 50% to the one next door, 1% to the furthest place, and, you know, some variety of waiting in between. If it were 50/50, I'd tell you to flip a coin. But actually what you need to do here is something called Monte Carlo sampling, which is a little trickier. You need a random number generated uniformly between 0 and 1. So you could, like, roll at 210-sided dies, and that would give you a percentage, you know, like 99 or 64 or whatever. Wouldn't have all the decimal precision. But find a way to get a random number, and use that to determine where they go next. That would mean that, let's say you put 10 monopoly pieces at the first winery, and you rolled some dice and did all this random stuff. Probably half your pieces would wind up for their second location going to the nearest winery. One out of 100 might go to the furthest, and then they'll spread out all over the rest of the board. So, wait, you have to go back between -- you said to find a random number between 0 and 1. That could be, like, infinite. Yeah, it could be. Is the computer doing this? Yeah, the computer can do this. So you just say, "Find a random number between 0 and 1," and it will. Yeah, actually. You don't say the depth or how many decimal points? No, and we should actually talk about random number generation as a separate episode. But basically, it will give you as many decimal points as it can generate. So that's, like, 64-bit precision for most of us. Oh, so that's what it means? Yeah. And that's assumed. Yeah, that's assumed. So think of it this way. Let's just say there were three four total wineries. What's your next location? You've got three choices. One, you're going to go do with 50% likelihood, and the other two with 25 and 25. So you generate that random number. And in this case, it's going to be between 0 and 1, which is between 0 and 100%. If the number is between 0 and 0.5, just assume they go to the first winery. If it's from 0.5 to 0.75, they go to the second, and 0.75 to 1, they go to the third. That's called Monte Carlo sampling. I see. All right, yeah, that is an important point. Thanks for slowing me down, because Monte Carlo sampling is critical to all this. So now, picture our board game. If we start out with a bunch of pieces all over, and then we said, where will they go next? We did Monte Carlo sampling, and we moved all the pieces. Now, you can take a tally of what are the most populated wineries? Like, where are people right now? And then you can do that again and say, where do people go next? Because I imagine there are some wineries where there's like maybe a convenient loop, like a road that curves around, and people kind of go down that whole road and stop at each one by one until they get back to the start. Or maybe there's one that's way out in the boonies, you have to drive to it and come back. But all the pieces will move all over the board sort of randomly as you simulate where people might go. And if you keep track of how many people were at each winery at each time, you can start to build up a distribution over how populated you think each winery is going to be at any given time. So let's say there's one winery that has a terrible reputation. People regularly have to go to the hospital after drinking this poison is wine. And it's like very distant, and it's on a gravel road. And they're mean to you when you get there, so people don't frequently stop at this winery, would you? No, not if I had heard that. Uh-huh, yeah, but some people maybe didn't hear that, so those suckers are the ones that still go there. So let's say that the likelihood of ever visiting that winery is very small, like 1% chance that anybody goes to there from any other winery. Now, if you picture this board game of people, you know, little pieces at the wineries and moving them around, it's very unlikely that you're going to have one of those pieces move to this crappy, distant winery. Because only 1% chance, so one out of 100 times someone might pick that. So if you just watch this game being played for like an hour, on Fast Forward, you would see very few pieces going to that far away winery. So, but that's a different question from what you said earlier. What was your question earlier? Well, what's the probability people will be there? Okay. The reason this is interesting is because if you keep track, like I said, watch the gameplay for an hour and roll it back on high speed, the pieces that enter that far away crappy winery won't happen very often. If you keep track of that and compare it to like the fact that the really popular winery lots of people are going to every round of the game, this is a proxy for the actual probability distribution of where people are on the whole winery circuit. So did you just make up those numbers? Are you saying if we're going to use a Monte Carlo thing, we just make up numbers? No, you have to know the probability distribution for each state. Now, that might not sound like a big deal, but actually that makes things really simple. It means for each winery, you just have to know what's the likelihood if someone will go from A to B. You don't have to know the global transition model. But there's 50 wineries. That's a lot of situations. You actually have to know. That seems like a lot of data. Yeah, it can be. It's actually much less than if you knew every possible pairing. Because for each winery, let's say there's 50 wineries, then there's 50 numbers you need to know. And each is the probability that they'll go from that winery to another, or one of those would be staying at the winery. And then so 50 times 50, you'd need to know 250 numbers to describe that system. How do you get that number? You would have to kind of know that in advance. But you could also do some educated guessing. So like let's just say you decided that mostly people go to nearby wineries, you know, with the highest likelihood. Although people also don't like to drive more than 10 miles between wineries. And you could figure in all these little heuristics and build up rules that describe the probability distribution at every winery. So now that wouldn't account for things like, you know, bad reputation necessarily. But it can get you a pretty close approximation to this highly dimensional complex system. So is that what you're doing in your job? I do this method sometimes in my job, yeah. But if you don't know the probability. Well, yeah, you have to know the probability distribution at every winery. You'd need to know like, alright, if I'm at A and there's, you know, B through Z, that if B's my neighbor, then I'm highly likely to go there. Z is very far away. I'm very unlikely to go there. You need to get that information or infer it or come up with it kind of by heuristic rules. Hmm, sounds tough. That actually makes things really easy. If you can do that, you can find a shortcut to a generally much harder problem. Have you ever thought about watching a board game being played on high speed? Nope. Not interested. Can you fix it now? If there were some truly popular wineries, how would it look as little pieces moved in and out of there? Do you think would ever be empty at the popular winery? Yeah, probably. Maybe, yeah, with some probability it could be. But more than likely no, because by definition it's popular. So yeah, it could be, you know, just a matter of circumstances that leaves it empty at one given round. But most often you'll find a lot of pieces there. In fact, you'll have a higher propensity for finding pieces there. And that's the key to Markov chain Monte Carlo right there. The key is what, finding a lot of people in one place? That you're, if you simulate a complex system and you know where things will move to and from, and you keep track of how many people are moving through each location, that that helps you get an understanding of the distribution that describes the likelihood people will visit each winery. Hmm, okay. So this might be really useful to, let's say, somebody that was deciding where they want to advertise, or they just wanted to put together a guidebook, or maybe the city needs to know what roads are most traversed, and this could be a way of figuring that out. There's lots of reasons people might want to know the probability distribution of a wineries that get visited. Maybe we would want to go to the ones we think aren't going to be too crowded, so we're not slowed down by big crowds and big lines. If we took the time to solve this problem, we could have an idea of where places might be more empty. Oh, you're volunteering to plan it for us? Um, maybe. [laughs] Would you like that? Well, I'm just saying that you often say that you would use this in your day-to-day life, so. [laughs] Yeah. Seems like you're trying to insert yourself here. Well, you know, that's not a bad idea. This would be an interesting problem. Um, I don't want to work on it alone though, so if anyone's interested, maybe we can make this a toy gimmicky thing and work on it together. It should reach out to me on the Facebook page or on Twitter or something. I mean, you don't have to solve for this one. I'm sure there's other problems that might be more interesting. Yeah, that's true, but fun ones are, you know, and this one's kind of a simple one. I don't know if the solution would be particularly precise, but it might be something interesting we could do and make some fun visuals out of it. And most of all, use it as a demonstration for what Markov chain Monte Carlo is. You know, these podcasts, the many ones, especially without having any visual to go along, I have to, I'm kind of like a surface level exploration. I know that. So I didn't touch on any of the algorithms, like metropolis Hastings or Gibbs sampling, and there's a lot more to be said about Markov chain Monte Carlo, but I hope people get the notion of what it is, or sort of like the kernel of the surface, and if you go and look it up on Wikipedia, you're a step ahead at understanding, you know, the more practical aspects of what it means and how people use it and stuff like that. And I'm actually kind of curious to reason I bring this up. If anyone would like more technical discussions, like if I really got deeper into that, let me know. I don't think that's the right thing for this show, but I'm curious to hear what listeners think. And also curious to see if I'm going to tolerate the more detailed discussion. Could I get you doing the Gibbs sampling algorithm? No. Why not? I don't know what that is. Well, that's why we'll have to have that episode. We'll see. All right. Well, thanks again for joining me. Thank you. [Music] (whistles)