The multi-armed bandit problem is named with reference to slot machines (one armed bandits). Given the chance to play from a pool of slot machines, all with unknown payout frequencies, how can you maximize your reward? If you knew in advance which machine was best, you would play exclusively that machine. Any strategy less than this will, on average, earn less payout, and the difference can be called the "regret".
You can try each slot machine to learn about it, which we refer to as exploration. When you've spent enough time to be convinced you've identified the best machine, you can then double down and exploit that knowledge. But how do you best balance exploration and exploitation to minimize the regret of your play?
This mini-episode explores a few examples including restaurant selection and A/B testing to discuss the nature of this problem. In the end we touch briefly on Thompson sampling as a solution.
[music] Data skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes, just like this one. Our topic for today is multi-armed bandit problems. So Linda, you and I have never been to Las Vegas together, right? Not that I recall. No, we've independently been there on a number of occasions. Well, some things might have to stay in Vegas. That's why we're never there together. I don't know, I'm just saying. Yeah, yeah, so our topic today relates a little bit to that. It's the idea of the multi-armed bandit, which most people give the analogy to a slot machine, which is appropriate, but I want to break away from it. So what does it mean? Imagine if you went into a room and there's a one slot machine there and you want to maximize your payout, you just play the one slot machine. If it's the only one, sure. Yeah, now what if there are two slot machines and they both have an independent payout, somewhere between 0% payout and 100% payout. You don't know the payout value of either of those machines, right? So which do you play? You want to play the one that has better where you're going to win more. But how do you know which that is? Well, the casinos won't tell you. Right, so you have to discover it on your own. Or do you think they just make it even for every slot machine? Well, okay, so in actuality, I'm sure there's a lot of regulations and stuff like this. So there's casino boards and all that stuff. And yeah, I imagine every slot machine is very highly regulated and all that. So let's take a step away from the highly regulated Vegas situation and just say there's a condition. In fact, let's change up the multi-armed bandit problem and talk about a different situation. Let's talk about going to restaurants. You and I, I think it's safe to say we dine out quite a bit. Yep, we like to eat. Not only do we like to eat, but we like to eat at fine dining establishments, would you agree? Well, we do spend money. Prefix is a phrase we use a lot. Well, sometimes, yeah. Yeah, and wine pairing is something that comes up on the old credit card builder. You act like it's a surprise. I didn't know that was a wine pairing. And Michelin stars, we're not foreign to that. Well, there aren't that many in LA. Right, we've been the pretty much all of those though. No, we haven't been to Melissa, which used to have a Michelin star. Yeah, I was there when I had that. Or maybe we have been, but it was like five years ago. No, we've never been together. But I think we did a three star Michelin restaurant on our honeymoon, right? In Berlin. I don't know, you booked it. But anyway, let's talk about meals, right? Like we like restaurants. We like good food. Would you say we've been to every restaurant in LA? No. Right. Because they're always opening up new ones. Right. Like every week. And are those new ones awesome or terrible? Well, we don't know until we try. Exactly. So we need to explore. Well, I like to explore. Yeah, I like to explore with you. But eventually, like we might arrive at a conclusion that there's certain restaurants we both like and we're both comfortable with, wouldn't you say? And we go to regularly. Yep. So in a way, you could say we're exploiting those restaurants. Well, I don't know. They're making money off of us. Aren't they exploiting us? Chicken or the egg problem, perhaps. But that's the whole thing about restaurants. You can go to a place that you like and you know is good. Or you can take a risk and try something new and maybe it'll be even better. But you're not sure. There's a variance there, right? Yeah. We don't know. So this is the central dichotomy of the multi-arm bandit problem. It's the balance of exploration and exploitation. So the reason I guess they phrase it as the sort of like multi-arm bandit meaning like many slot machines where the whole arm that you pull is the arm bandit is because it's very simple to think of it that way. So imagine you play a slot machine. It will pay out somewhere between zero and a hundred percent of the time, right? Yes. And what would you prefer? One hundred percent weddings all the time. A high payout. So let's say you're in Vegas. There's a long line of a machine you could play. You want to play the one that has the highest payout, right? Yeah. But you don't know which one it is. Assuming one exists. Unless they all pay out the same. Yeah, yeah, yeah. In that case, you have this dilemma of like, should you exploit or should you explore? We've had a couple of different people come clean, right? Yep. Or some better than others? Yep. You have an option. We could ask someone to come back or we could explore a new person coming. I could potentially try someone who's worse. So what do you think the odds are if we brought in a new person, they would be better than the old person? Fifty percent. Oh, fifty percent. That means that the person we have now must be pretty much mediocre in the middle, right? That's my guess. Because half the people are below them and half the people are above. Yep. So the odds of us finding someone at the seventy-fifth percentile are pretty high then, right? No, I think it's harder. Why do you think that? That most people are mediocre. Oh, so maybe our person is actually above average. No, she's mediocre. But ninety percent of people are mediocre. So to get into the seventy-fifth percentile, your odds get smaller and smaller. Uh-huh. So here we get into exploration versus exploitation, right? There's a cost incurred by trying a new person, wouldn't you say? Yep. Because maybe they'll do a bad job. But there's someone out there who's better, certainly, right? Yep. And then we have to book them. And we have to wait for them to come. I'm just too annoying. Actually, I mean, sort of in your setup, you're assigning a cost to the switch, which the multi-arm bandit problem doesn't actually account for. It's more like that you could freely try any of the available levers. But there's always a cost to switch. Okay, so actually, that's a good point. There is a cost to switch. In this case, it's like the cost of your time, right? Yeah, but if you're a company, you're going to switch from A to B, just shifting gears cost time. That's an interesting side commentary about the multi-arm bandit problem is that depending on how you formulate it, there are or aren't optimal solutions. So if you're allowed to have an unbounded search space, then like multi-arm bandit is actually a very kind of unsolved problem, I think. There's a lot of good PhD theses to yet be written about this topic. But if we restrict it to a fixed set of available levers or available for riders in this case, like, you know, people that will clean the house. And we don't know how good of a job they will all do, but there's a finite number of them. Then you can only switch so many times. Yeah, so that's where multi-arm bandit comes in. When you have a very confined solution like that, then you can have an interesting exploration of the problem. So a lot of people will apply this to like A/B testing. Sure. There's a new feature or there's actually maybe N number of features you could try. So how do you decide between trying a bunch of features or picking the best one and running with it? Well, what's the cost of testing all of them? It's just, you know, running a test. It takes time. It takes time. It takes time. It takes time. It takes cost money for us before we actually picked a solution that makes money. Well, time and lost revenue also, because ultimately there's one optimal solution, right? If you knew everything, if you were omnipotent, you could say like, oh, the best bet is this subject line and this graphic and all that. So if you knew all that, you would have the perfect solution. You would just run with the perfect version. If you don't know the perfect solution and you're trying to figure it out as you go, then we can measure what's called the regret. So if you had known the perfect version, but you didn't, you made choices that were different from the perfect version. And the difference there is your regret. How much you lost out on by not going with the ideal situation. What about the cost of switching and running the test? Well, what if that was kind of minimal? Well, it's not. Yeah, in general, it's not. But in the multi-armed bandit problem, you kind of have to assume it's zero in order to have a good exploration of this problem. But what situation has zero cost of switching things? Well, I mean, you could have a test on a website where you're just checking different colors, right? There's pretty much an infinite number of colors. And you want to see how color affects the conversion rate? Okay, I guess you could test colors. So how do you know when you centered in on the best color and you want to stop and just exploit it versus explore other colors? Well, it still costs a changing color. Yeah, that's minimal from a technology point of view. No, like, let's say you want to pick yellow, you have to run it two weeks to get enough data. Then you pick red, you have to run it another two weeks. Oh, by two weeks. You have to gather enough data. You're getting into how this is a very Bayesian problem, right? Like, how many people do you think you should have? I don't know, but more than two? Yeah, so this is the classic trade-off of the multi-armed banner problem, like exploration versus exploitation. You want enough that you're learning versus exploring. There are sub-optimal problems in which people say like, oh, let's just try, like, you know, all the different situations. And then after we try them each enough for like some period of time, then we'll exploit. Maybe for like two weeks, we'll randomly pick the font or the offer or whatever it is. And at the end of that, we'll pick the best one and run with it. If that period of time for trying stuff is really long, like a year, you might have lost out on revenue that you could have made by finding the best solution earlier. But if you pick your pony too quickly, like after two impressions or something, then you might have found a sub-optimal solution. There's a lot to be said about it and more than we can say in a mini episode, actually, because I should be getting into Giddin's index and into Thompson sampling. So if you want to get deeper into this, yeah, Giddin's index and Thompson sampling are the two kind of main strategies one should look up. But I personally find multi-armed bandit to be a really modern and relevant topic that I think most people aren't talking about, because it's very useful for A/B testing and for online advertising and even for like the news article recommendations, right? If I was going to suggest what to put in your feed and I have a new article and I don't know if it's a good one, I should put it in a certain number of feeds to see how people respond to it. So when it comes to multi-armed bandit, Linda, if you had to sum it up, what would you say are the key points? Does not take into account the cost of switching? Yeah, so that's a good criticism because we're dealing with very stochastic problems here that may or may not have the mark of assumption, and this quickly becomes a very complicated space. And depending on your situation, hopefully, you can make a few assumptions that simplify it and you can have conjugacy and these sorts of things that you can solve these problems practically, because this is actually really difficult, an interesting area that requires a lot more research, but if the frequencies of the things you're observing aren't changing, in other words, like the slot machines aren't altering from time to time, they're payouts, then it does simplify to a problem of exploration versus exploitation, and Giddens Index happens to be very exact, but maybe intractable solution, and Thompson Sampling is a more feasible way of doing it that we didn't cover, but maybe those would be topics for a future mini-episodes, because I think we're running a bit out of time here. And if you have any feedback on this episode, I would encourage you to go to our website, datascheptic.com, and on the show notes for this episode, we now have a plugin at the bottom where you can comment on the "discuss" plugin and contribute your responses. And until next time, I want to remind everyone to keep thinking skeptically of and with data. Thanks for joining me, Linda. Thank you. [outro music] [BLANK_AUDIO]