Data Skeptic

[MINI] Gradient Descent

Duration:: 14m
Broadcast on:: 08 Jan 2016
Audio Format:: other

Today's mini episode discusses the widely known optimization algorithm gradient descent in the context of hiking in a foggy hillside.

[music] Data skeptic mini-episodes provide high-level descriptions of key concepts related to data science and skepticism, usually in ten minutes or less. Our topic for today is gradient descent. So, Linda, maybe we could start by breaking this down with an appeal to you as a bicyclist. Do you know what a gradient is? Well, Ombre. Ombre. That's a big fashion thing going on that I like. Tell me what Ombre is. Ombre is one of color fades from one thing to another, like from blue to white. That's a gradient. A lot of problems, specifically optimization problems, can be expressed as some function of a bunch of different variables or dimensions that you have. And you'd often like to find the best solution, which usually means minimizing it. So, like, let's say you're running a business. What would you like to do as a business? Make money. Yeah, maximize revenue. I want to talk about a specific algorithm, the sort of, like, entry-level, main algorithm we use when you have a problem like that. And I thought a good way to understand it would be to imagine you and I went to Scotland or somewhere like that. Is Scotland on our list of places we want to travel? No, unless someone wants us to speak there, you know, fly us out there. Is there a Scottish skeptics in the pub event maybe I could go to or something? You have to ask your audience. Yeah, and just so you know, what did we learn from our 23andMe tests? What did you learn? I'm a proper Scotsman, despite my family's sort of how we trace our lineage. Most of my DNA is Scottish. Oh my gosh. I get to say it that way because I'm officially Scottish. I don't think that accent is accurate. So what percentage Scottish are you? Like 11%. That's not very much. But anyway, let's say we went there. We've never been correct. I don't know. I've been to Edinburgh. I think it's called Edinburgh. No, we pronounce it Edinburgh. Isn't that in Scotland? That's different from Greensboro. I think that's in Scotland. Is it Edinburgh? It might be. Let's look it up. Let's look it up. Yeah, it's in Scotland. So I have been to Scotland. Did you by chance go hiking while you were there? No, I didn't know anything about hiking. But there is a giant castle that looks like Harry Potter's castle up on this, like, cliff. Oh, up on a tall cliff. So you might want to get to that. So let's say I couldn't go because I was invited to go to some skeptics in the pub event. And I'm there speaking. But you've gone off hiking in the moors of Scotland and you're headed for this castle. And suddenly a thick fog rolls in. And oh, by the way, after this, you're supposed to meet me by the ocean. Okay? That's our meeting place afterwards. No cell phone signals. And a thick fog rolls in. You're basically lost. How would you find your way to the ocean? I don't know if you could see the ocean from Scott from Edinburgh. I didn't say you were in Edinburgh. You were somewhere in Scotland. Oh, okay. Near the ocean, let's assume. But a thick fog has rolled in, so you can't see it anyways. And you have no compass, no GPS, and you're a bit lost. So you're up on this hill pretty near this castle, maybe even at the castle. But you want to find your way to the ocean. Well, do I see the ocean? No. So I just need to go towards the ocean and I don't know where it is. That's right. Well, you know a little bit of information, right? You know it's down. Yeah, but if I'm on like a mountain, I don't want to walk down because I could walk down on the other side, in which case there cannot be any other way to the ocean unless I walk back up. Well, you could walk along the coast. Okay. So what's your question? You're up on a high cliff, relatively high. You're not necessarily at the highest point. And then you're supposed to meet me down by the ocean and you're quite lost. And the fog has rolled in very heavy, so you can only see about 20 yards in any direction. But if you could see about 20 yards in any direction, do you think you could tell which way you tend to go up and which way you tend to go down? I mean, for the immediate 20 yards, yeah. It could also be flat. You don't know. You might have to do the exploration. But you could probably get a sense of where things are sloping up or sloping down. And let's say you got near like the base of like a foothill. What would you kind of see around yourself in the 20 yards? I don't know if it is the base. I would just assume that at some point I'm here and then I go upwards. So you think it'd be fruitful to go up the hills or go in a different direction? Well, I don't know. I'm not really sure. It just depends if they're the lighthouse or any civilization. But I guess if I think the right direction's down, even if down can lead me into more remote civilization and we're assuming down is the right way to go, then I guess I just go down. Yeah, so you can look kind of in 360 degrees around you. Find the point that most slopes down and just sort of go that way. Really? It just sounds like a nightmare. So you just want to ask me what would happen if you dropped me in this weird nightmare? Should I put in some spooky music in the background here? No, I'm just saying. If you're acting like there's a realistic situation, I don't think it is. Well, there's no werewolves, so that makes it more realistic. Yeah, but you're telling me that there's no visible landmarks. There's no path. And yet somehow I got there. There you go. I wouldn't wander off the path. At different points, you're going to see different things sloping down in different ways, right? What do you think is the smartest thing to get down if you see like a flat terrain versus like a little, what do you call it, like washed out area that might be going down faster? Well, I would just go with the one that doesn't slope too bad. Yeah, so actually, yeah, in reality, you'd have to pick a slope that is changing in a safe way. So for example, if you got to a cliff, the cliff would take you down to the ocean pretty fast, right? But you don't want to jump off the cliff. So that's where the analogy breaks down a little bit. Gradient descent is all about exploring in a greedy fashion based on what it sees locally. But you also, I guess, a better analogy to this context in terms of how gradient descent works is, how often do you reassess what direction you're going in? You know, should you walk 20 yards and look around yourself, find the right direction and switch course? Do another 20 yards or what do you think is most fruitful if you were walking around? How often would you reassess? I don't know. I think I would just assess as I go. Yeah, that's pretty much what you do. That's actually a little bit more like stochastic gradient descent, which is a topic for another day. But in gradient descent, you're taking observations from this function and then you're picking your next location. So the way you do that is based on the local gradient, which is sort of the derivative. I won't get into that. You remember derivatives from calculus? Not much. Well, it's the rate at which something's changing, so how steep it is all around you. So if you were kind of like... Isn't that a slope? That is also a slope, yeah. And slope actually fits in both contexts here, the mathematical one and the sort of topological one. You want to find, from what you can see around you, the slope that is going down the fastest and head in that direction. But the gradient descent example is more like making really quick jumps. So like you pick your direction, then you leap forward some number of feet and land again and then pick your next direction to go in. So to do that, gradient descent uses the local change, but also has something called the learning parameter, which is sometimes represented with like a gamma or an alpha or some group character. And you can control that to see how fast, how far you want to go before you re-sample. Because of course, if you just got stuck in your mind like, "Okay, I'm going to walk 50 yards ahead. After I've made a course adjustment, what if you walked into a valley after 25 yards and then walked up the other end of the valley?" Then you'd be on the other side of a hill. Now you kind of know that because you walked it continuously, but in sort of the more mathematical context, you jump from point to point, you would have missed this valley, which is a lower area, which is preferable to you, right? Right. Another consideration here is what if you got trapped in a valley? Do you think you could ever get into a low point where everywhere you looked around you, everything went up, but you weren't actually at the ocean? I don't know, you just said valley. I guess it's a valley. But the valley has like a path. There's one point that's low. The valley could always go out to the ocean. I guess I'm thinking of like, basically like a pit. A hole? A hole, I guess. Yeah, you could wind up in a hole and you get to the middle of the hole and everywhere you look is up and you'd be like, "Well, I guess I'm in the best position." That is one of the problems with gradient descent and why there are many variant algorithms that try and avoid these things that are called local optima. Where yes, you're right, in the next like 20 yards around you, being at the bottom of a hole is probably the lowest point that's the best solution, but it's not the globally best solution. And when you deal with a very complex problem, especially in high dimensional space, you can't always know if you're in the local optima. So anyway, that measure of how often you should stop and reassess, that's the step size or learning rate. You can do that in a couple of ways, the kind of quick and dirty way is just a fixed length, like you'd be like, "Well, we're going to re-measure every so often, every 20 yards." So 20 yards, course correct, 20 more yards, keep course correcting until you're making no improvements. But you can also like decrease that with time because probably as you're solving a problem, you're getting closer and closer to the solution, right? So the chances you're going to overshoot the best answer, get higher and higher, so you might want to take smaller and smaller steps to get there. And then there's also Newton's Method, which is good to look up for listeners that want a little bit more detail, but is fodder for an entire mini episode in and of itself. So yeah, in terms of like taking smaller and smaller steps, I definitely do that when I'm swimming laps. Oh yeah, tell me about that. Yeah, because as you get closer to the wall, you either have to like put your arm out in front of you or you risk ramming your head into the wall. Yeah, good example. We don't take, I don't think the goals take smaller strokes, you just put your arm in front of you. But you know, if you're going to like do something, like do a flip or something, it should be smaller. And you probably don't want to do a full stroke and slam your arm against the wall as well, so. Good point. And that's just in a swimming lane, which is basically one dimensional, right, back and forth and has these two edges, and you have to be careful there. So you can imagine how careful you have to be in more complex, multi dimensional spaces. I was worried about our example of hiking in that it might over trivialize the problem, right, because you can really see a little bit more in the fog and you have a general sense of where you're going and that sort of thing. But when you apply these problems in industry and in like, you know, academic problems and larger scales, you really have complex spaces that you couldn't exhaustively search. So having a good search and optimization procedure to find yourself a good solution without finding a local optima is sometimes a tricky thing, you have to be very careful about. So what happens when you overshoot it? Well, it depends on your algorithm. So let's take the example of gradient descent, right, and imagine you're kind of in a valley that's spilling out to the ocean, but you're zigzagging down it, right. So you're on one side and you see the fastest slope going downwards is in a certain direction, you walk there. And what actually happens is you go down through a trough and back up the other side of the next ridge. That would have been better to stay in that trough, but you missed it because you jumped from point to point. But what you'll see then is you'll be like, oh, I'm on a new ridge. And if I keep going the way I just have been going, it's going to go up. So I'm going to turn around and look in all the other directions and I'll notice like, oh, back the way I came and also down is a pretty lucrative path. So I'll go that way and maybe this time you'll end up in the trough or maybe you'll overshoot it again and zigzag down, but hopefully eventually you'll find the very bottom of that valley and you'll stick to it. But of course, there's no guarantees of that. And that's why there's so many variations in interest and alternatives are gradient descent. But even for being sort of the simple vanilla algorithm for optimization, in a lot of situations, it does pretty good. So why would you use this method instead of measuring as you go? A lot of that comes down to computational resources and also convergence. So computational resources are about how much time do you have and how many computers can you throw at this problem to solve it efficiently? If there's so many options you need to check and a certain amount of time you want to dedicate to finding it. There's also consideration of convergence, right? You'd like to get to a good answer pretty quickly if you can. If it's expensive to calculate your local position and calculate the local derivative to find the right gradient you want to go in, then you want to make a limited number of those calculations. So you want to be very precise in where you land next and make that calculation. But what you're describing about like so-called calculating as you go, that's a little bit like stochastic gradient descent where you make very small movements based on every observation you have, like if you have a big pool of observations. If you look at one randomly, adjust according to that, then look at another and adjust after that one and hopefully you converge in very quick but tiny steps. If you use one algorithm and then switch to the other one, can you tell right away which one was better? Oh, that is an excellent question Linda. You've actually just invented a whole branch of optimization. This is what I'll call it ensembling. There's a couple of names for that. But yeah, there's good reasons why you might get into certain situations and say, "Hey, let me switch my strategy because here I'm in like a very flat region or here I'm in a region that has certain features that are good for one method or the other." And that might be called also a meta algorithm. So there is a lot of good work on that and some of those processes can be fruitful, but there's also the no free lunch theorem, which we'll just have to leave as foreshadowing for a future episode. Okay. Well anyway, I think that's a pretty good coverage of gradient descent. Does this make you more or less interested in going hiking in a potentially fog covered region somewhere in Scotland? Oh, I think this is a region that is more in your mind than in Scotland. So no, I do not want to be lost in this landscape. Many television shows that portray it in exactly this way. I'll have you know. Okay. I'll just have to take your word for it. Well, thanks again for joining me, Linda. Thank you. And until next time, I just want to remind everyone to keep thinking skeptically of and with data. Good night, Linda. Good night. For more on this episode visit datascopic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. [music] [BLANK_AUDIO]