Data Skeptic

The Model Complexity Myth

Duration:: 30m
Broadcast on:: 11 Sep 2015
Audio Format:: other

There's an old adage which says you cannot fit a model which has more parameters than you have data. While this is often the case, it's not a universal truth. Today's guest Jake VanderPlas explains this topic in detail and provides some excellent examples of when it holds and doesn't. Some excellent visuals articulating the points can be found on Jake's blog Pythonic Perambulations, specifically on his post The Model Complexity Myth.

We also touch on Jake's work as an astronomer, his noteworthy open source contributions, and forthcoming book (currently available in an Early Edition) Python Data Science Handbook.

[ Music ] Data skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes. [ Music ] Jake Vanderplass is an astronomer and data scientist with a PhD from the University of Washington. He's currently at the E-Science Institute. He's a maintainer and frequent contributor to a number of key data science Python projects, including AstroML, SciKitLearn, SciPy, and several others. He's also a frequent speaker at conferences and author of the forthcoming O'Reilly book, "Python Data Science Handbook," which has recently become available as a preview edition. Jake, welcome to Data skeptic. Thank you. So Jake, I've been following your blog for quite some time and really been enjoying all the entries as well as your talks. It was one blog post in particular that struck a chord with me that I thought would make a nice topic for a good episode. It's the one you titled, "The Model Complexity Myth." Can you start us out with a high-level summary? Yeah, so basically, in simple linear model fitting, there's this notion that you can't fit models that have more free parameters than data points. And that's a really good rule of thumb for these simple models, but I find that when you're talking about more complicated models, this rule of thumb doesn't hold. And one thing I've found in a lot of my teaching and chatting with people and scientists, especially those who aren't classically trained in statistics but use statistics every day, is there's this assumption that this rule of thumb holds in situations that it doesn't hold. So that's why I call it the Model Complexity Myth. Yeah, I think we can all agree that the myth isn't universal. Some models are too complex, but for comparison sake, you may provide an example of a model that has been over-fit or is under-determined that's really suffering from the complexity myth. A simple example of a place where this model complexity rule of thumb holds is, let's say you're fitting a line to data and you only have one data point. There are an infinite number of lines that you can draw through that single data point. So that's a really good example of where this rule of thumb is a good thing. If you have a two parameter linear model that you're fitting to one data point, you're going to be in for a bad time. And that extends to, if you have a three parameter linear model, like you're fitting a parabola to two data points, you have the same problem. You can fit an infinite number of parabolas to those two data points. Those are known as unspecified models. And like I said, if you have a simple linear model, it's a great rule of thumb to say that you can't fit more parameters than data points. Yeah, you have some excellent visuals that articulate that point in your blog, Pythonic Pramulations. Am I saying that correct? I think so, yeah. There'll be a link in the show notes, but if listeners want to type it in as they listen, it's jakevdp.github.io. I like what you're saying a bit ago that it's really a rule of thumb, and I know many students are taught that rule of thumb, that you can't fit a model with more parameters than data points. Do you think that's still a useful thing to teach early on, meaning that we should reveal the myth later? Or would you be in favor of a more general way to the pedagogy of modeling how it's taught? I think it's fine to teach it early on, because early on what you learn to fit are these simple linear models before you add in some of the other pieces that I think we'll talk about later. I wouldn't change the way people are taught, except maybe to have teachers specify that this doesn't hold for all models, but only for the simplest ones. I thought your explanation of the model complexity myth in the context of linear polynomial expansions was a really elegantly simple way of sharing this insight. Could you describe your line of reasoning? Let me say a little bit different way. I mentioned that if you try to fit a line to a single point, you get an infinite number of solutions, and the model is not well posed. Same thing if you think about going up in a dimension, you try to fit a plane to two points. There will be an infinite number of those planes that you can fit. Similarly, if you're thinking in really high dimensions, if you have a four-dimensional space and you're trying to fit a three-dimensional volume to three points, this gets a little abstract, then you can't do that. These ideas generalize pretty well as you go up and up in dimension. What I did with the example there was I took a series of higher-dimensional models that people are familiar with that just fitting a line to data, fitting a quadratic to data, fitting a cubic to data, fitting a quartic to data. Each time you add another polynomial term, you have a higher dimension. It maps quite well onto that idea of fitting a line to points versus fitting a plane in two dimensions versus fitting a space in three dimensions. Can you talk a bit more about the underlying mathematical steps and under what conditions the solutions are problematic? From a linear algebra perspective, when you're solving a linear model, you're inverting this matrix that is basically the number of parameters by number of parameters. The way that you build up that matrix is by comparing the parameters to the data points. If you look at the matrix itself, the maximum rank of the matrix is the number of data points because of the way it's built up. If you have fewer data points than parameters, then you end up with a low-rank matrix, a singular matrix. In linear algebra, we know that a singular matrix cannot be inverted, and this inversion is a central piece of computing the result of the linear model. If you have more parameters than data points, you basically have an uncomputable expression to solve your linear model. I know most data scientists, or maybe I should just say most statisticians or students of linear algebra, it's a pretty broad area, but at some point we're taught to condition our matrices. We do that by multiplying by some scalar of the identity matrix in order to make them invertible. In my experience, that's a subtlety of this process that sort of gets lost sometimes, that whatever scalar value of the identity you choose is going to affect the solutions you find, especially when you have really under-determined problems. I liked your presentation from the frequentist and Bayesian approach to this. I was wondering if you could share a little bit about those two perspectives. Yeah, well, first on that conditioning thing, it's actually interesting because, you know, I was the same way I was sort of taught that, you know, you can do this if you have a problem that's not working, and it was always mysteriously what was going on. Actually, that conditioning is the linear algebra equivalent of, if you have a divide by zero error, if you compute one over x and x is zero, you get a divide by zero error. So you can get around that by doing one over x plus a small number, and then you no longer have it a divide by zero error, but you have an answer that depends on what that small number is. Yeah, it's a great comparison. In linear algebra, adding a diagonal term to your matrix that you're trying to convert is the analog of that. What answer you get out depends exactly what you add to the diagonal, not conditioning. What you can do to understand that and understand what this conditioning number is doing to your eventual solution is. What I did in the post is taking a look at how, in a frequentist sense, something called regularization allows you to compute these under determined models. In a Bayesian sense, looking at how something called a model prior, a parameter prior, allowed us to go ahead and compute under determined models. From a frequentist sense, if you look at it, this regularization term, especially if you're doing what's called ridge regularization or L2 regularization or TIKENAW regularization, I think it's called, there's all these names for it. If you're doing that, it's essentially equivalent to adding this conditioning term to the diagonal of your matrix. The intuition behind ridge regularization is that you're penalizing the size of the model parameters. Let's say you're trying to fit a line to a point that's at x equals one, y equals one. You could have a flat line, a horizontal line, you could have a vertical line, you could have a diagonal line. There are an infinite number of possibilities. One possibility in there is that you have a y-intercept of one billion and you have a slope of minus 999,999,000. And you'd fit that point perfectly. But the intuition is that that's just a silly model because if your parameters are meaningful in some way, you don't want them to cancel out that way. If you sum the square of those parameters, you get something in that case that's really, really big and you want to basically downweight that. That's the motivation behind regularization. You're trying to penalize large combinations of parameters. And if you apply that regularization term, it turns out to be mathematically identical to adding this conditioning term on the diagonal. So what the conditioning term does is it penalizes large combinations of parameters, say the slope is very large, positive number, and the intercept is very large, negative number, and thus their effects cancel out. And so still that penalty to me, I learned about ridge regularization, and that penalty to me was always mysterious as well. It just seems like you penalize to some degree and you get out an answer, but what does that answer really mean? If you've read my blog that I'm outspoken proponent of the Bayesian method, part of that is the Bayesian method helps me really think about what things are going on and what problems we're solving. If you look at this regularization and in particular ridge regularization from the Bayesian perspective, it turns out that there's, again, an equivalency. And the equivalency here is that the regularization is the same as applying a Bayesian prior on the parameters. And in this case, for ridge regularization, it's exactly the same as applying a Gaussian prior to some width that pushes your parameters towards zero and penalizes large values of those parameters. When I had that insight, it helped me really understand the relationship between what's going on in just simple linear algebraic conditioning, what's going on in the frequency regularization, what's going on in the Bayesian method, and now after thinking about it that way, I feel like I have a grasp of what these various constructs are actually doing. Yeah, definitely. I actually was sort of from reading your blog the first time I saw that insight, and I've been searching for a good problem to apply it to now because I think it's a clever way to go with things. I also like the variety of the examples you wrote about that demonstrate when the model complexity myth is really manifesting itself. I thought we could go through and touch on at least maybe two of those. The first one is compressed sensing. Could you describe what that technique is? Yeah, so compressed sensing is this really cool thing where I think kind of the typical idea is that let's say you have an image that's 1,000 by 1,000 pixels, and maybe for some reason you lost a lot of your data and you only have like 100 out of those million pixels scattered throughout the image. Can you use that small amount of information to reconstruct the full image? And in a simple model fitting sense, the answer is obviously no, right? How do you predict from 100 pixels what those other nearly million pixels are? But if you have some really strong prior on what the image is, then you can actually do this. And this is kind of the basis of compressed sensing. So what compressed sensing does is it uses our knowledge of the world and our knowledge of the world might be a whole compendium of images out there that we think are like this image we're trying to reconstruct. And there are different flavors of this, but one of the flavors is that you can build up a dictionary of basis functions that like kind of describe fundamentally what images in general look like. And still there, you can use that in a linear sense and try to solve the problem. But if you use a model prior or a regularization penalty that tends to choose small subsets of those basis functions to reconstruct your image, you can actually do these amazing things like reconstructing a full image from just a few of its pixels. There's an example on the website that's taken from a wired article. It's a picture of President Obama where you have a few pixels throughout in a sea of black missing pixels. And through iteratively applying this, you can actually reconstruct the original image, which is still amazing to me. It comes straight from the sparsity prior along with a good set of basis functions. Yeah, can you describe the sparsity prior a little more? It seems like that's the real secret sauce here. The sparsity prior, we were talking earlier about ridge regularization. We're adding the squares of the parameters. The magic of the sparsity prior that this is known as L1 regularization or LaSue regularization. Instead of adding the squares, you add the absolute values of all the parameters. And that doesn't seem like a huge difference, but it turns out that when you do that, the prior tends to favor sparse models, the models that only have lots of zero to add parameters. And there's some geometric intuitions to that that may or may not be helpful. Yeah, so if you apply that, then in effect you're reconstructing the model based on a small number of these basis functions. And that's how this works out. Yeah, we've all recently had the absolute pleasure of starting to see these images of Pluto, so I want to reach out to your other discipline of astronomy and inquiry if this is a technique that could maybe help us enhance or construct better images of our universe. Yeah, that's where I first came across. Compress sensing. I haven't seen a good application of this, but there's been some discussion in astronomy circles about whether you can use a compressed sensing technique to characterize the extent of what we see in the sky based on small amounts of observations. Because a telescope like Hubble is really, really expensive and you can't build enough Hubble telescopes to cover the whole sky. Right. So we need to learn about the universe based on tiny little snapshots of little pieces of the sky. So there's been some discussion about whether you can use compressed sensing techniques to construct this information and maybe even to do things like decide where to point your instrument next in order to learn about various problems we're interested in. And like I said, I haven't seen any applications of that, but I'm optimistic that something might come up if someone puts their mind to it. Yeah, absolutely. It's a nice transition as well into one of your other examples, which was detection of extrasolar planets. Now is extrasolar the same as exoplanet? Yeah, exoplanet is short for extrasolar. Just want to make sure. I think everyone knows the objective of exoplanetary research, but could you describe some of the methods that people use to detect planets? Yeah, there are a lot of different methods to detect planets. Basically, the thing to realize is that in most cases, we can't actually see the planets themselves. So we're just seeing indirect effects of the existence of these planets. This has changed in a few cases. We actually have a few direct detections. But for the most part, yeah, we're not seeing photons from the planets themselves. So a couple of the more prominent ways of seeing these, there's one called the radial velocity method, where basically you have, when the planet goes around sun, there's this equal and opposite gravitational reaction. And the sun is actually in response to the movement of the planet. The sun is wobbling back and forth in its own little orbit. And if you look at distant stars, we can use spectroscopy to look at the radial velocity. We can look at the star and see it moving a little bit away from us, a little bit towards us through the redshift of the light. And that allows us to detect planets. So that was an important method, maybe up until about four or five years ago. And it's still being used, but that's been completely eclipsed by this project called the Kepler telescope. And what the Kepler telescope is doing is it's looking for eclipsing planets, meaning that if you have a distant star and a planet orbiting around it, that planet might pass in front of this star and make that star just a little bit dimmer. So you don't see the planet itself, but you just see it in the total amount of light coming from the star. There's this little dip. And that dip comes back periodically as the planet goes around and passes in front. So this has been the most important way of discovering extracellular planets recently. And I should say that I've not really been involved in this research, but I have a lot of colleagues who do this. And it's fun to be able to talk about the good work that they're doing. One of the problems with this is that you're looking for really, really small signals, right? If you think about the size of the sun versus the size of, say, an Earth-mass planet, it might be a factor of a million smaller in area. Maybe not a million, maybe hundreds of thousands, something like that. So you're looking for this dip in intensity in the star that's like one part in 10,000 or one part in 100,000 or even smaller. And so you can imagine the difficulty there. There's all sorts of things going on in the background. The instrument has noise and reading these photons. There are astrophysical sources of bias like stars themselves that get slightly brighter, slightly fainter at that level over time. So when you're looking for these, when you're trying to model the time series of the stellar brightness to look for these planets, there are all these confounding factors that you have to take care of. And one way people have addressed this is to essentially model everything and have all these nuisance parameters that are free-floating and allow for things like the intrinsic stellar variability or things like the instrumental error on the Kepler telescope and things like that. Once you account for all these and you kind of average out their effects or, in the Beijing sense, we call it marginalizing out the effects of a nuisance parameters, then you can go ahead and find these planets. In this case, the accounting for all these effects leads to really, really complicated models. I mentioned that I don't know at the moment, I don't know of any models that are actually more parameters than data points at this point because there's just a lot of data. But it's one of those things where you could imagine the models getting sophisticated enough that you do have that situation. And you do have a classically under-determined model. So is then my underlying data stream, that time series of stellar brightness observations? Exactly. So then I've got a whatever appropriate model fits that plus all those nuisance parameters could absolutely seem like you'd have any dominant parameters. Yeah, it makes sense. I want to ask you as well about AstroML. I think that's one of the packages you are the initial author of and maintainer of. Is that correct? Yeah, that's right. So AstroML is something I started a few years ago. I was actually in conjunction with the earlier textbook that I wrote with some colleagues. So this is the statistic state of mining and machine learning and astronomy. It's a graduate text for astronomers aimed at statistical methods. So we wrote AstroML to go along with all of that. And one of the cool things about this textbook, I'll just put in a little plug, is that every one of the 200 plus images in there has associated Python code to run their analysis and generate the image. And AstroML supports all that analysis and implements a lot of specific methods for astronomy or specific methods that astronomers find useful. Do you think you're an astronomer that found their way to machine learning or a machine learning expert that found their way to astronomy? Definitely the first one. I started out as an astronomer. And I had ten years ago, I had no idea that I'd be in this world. And sometimes I joke that I'm a fake astronomer now. I don't do as much astronomy research, but I do. I do some, the stuff that I like to work on is started to become called Astro Statistics. Oh, cool. So basically I'm thinking about how to use classic methods from astronomy and generalize them to the cases that are more interesting today, where we have larger data sets or heterogeneous data sets or data sets with interesting noise properties. Yeah, how prolific is the use of ML and Astro? Do you find yourself evangelizing a lot, or is it commonly, assume that people will apply these techniques? It's getting more and more common. One of the problems is that if you talk to someone who's a classic machine learning researcher, someone who's applying their methods maybe to text recognition or things like that, or you have a classic training set and a test set, you evaluate your model and make predictions, that doesn't really work. That classic method doesn't really work well in astronomy. And that's because often the training sets we have and the test sets we have are statistically different enough that you can't just do blind machine learning approaches. So in astronomy, you have to be a lot more careful. You can't just throw a random forest or an SPM at your data and expect to get something useful out. So it's actually a really interesting area of application of these machine learning techniques. And that whole difficulty is one of the reasons that a lot of astronomers tend to lean towards Bayesian approaches because they allow you to be a little bit more careful about how you treat biases in your data. I want to circle back to one more of the examples. I think it's the last one from your blog post. I thought it was a really novel use case in which you, I guess to sum it up at a high level, you have a basic time series, well maybe it's not a series, but a set of data and you're looking to fit it and you directly model each point in sort of a binary fashion of whether or not it might be an outlier. So of course then even if it's some simple linear model, two parameter linear model, you have one to one parameters to data points. So this would be a good example of when the myth demonstrates itself. I thought that was a really novel and useful approach in many problems where you might have a small series of data but a couple of just weird outliers that could heavily bias your results. Could you share a bit of the details on this approach? This is an approach that I was trying to distill down the kinds of things that people do when accounting for biases and noisy data. One of the types of things you can do is you can look at each data point and ask what's going on with each individual data point. If you look at the data point by itself, you can't really say much, but if you look at the data point in relation to everything else that's going on, you can say a lot and then you marginalize out those individual effects to kind of find the global model. So the simple idea or the simple example that I showed is fitting a line to data where you have these outliers. And as everyone who's done linear regression knows, if you have outliers in your data, the classic chi-squared minimization just fails. It's so drawn to those outliers. Models so drawn to those outliers that you end up with this crazy fit. What I did instead here is to fit a mixture model where each point is described by some mixture of a probability that it comes from the normal model and a probability that it comes from a background distribution. And that mixture, there's one variable for each data point that says, you know, if it's zero, this is a normal point. If it's one, it's an outlier. And the point is somewhere between zero and one. It's some mixture of normal points and outlier. And so for, in the case I did here, you have ten points. And that means you have ten of these nuisance parameters that describe what the bias in the data. Plus you have your two parameters where you're fitting your linear model. So this means that you're fitting twelve parameters to ten points. So in a classic sense, you can't do this. This is not a classic linear model. It's a little bit more complicated than that. Or I also put in a sparsity prior on the outlier measurement. So I put in a prior saying that there are very few outliers in the data set because it would be really silly if every point would require a flyer. Yeah, yeah. So the net result of that is you have this nonlinear model plus the sparsity prior. And that pushes down the effective degrees of freedom to the point where you can fit your model to the data, even though it has two more parameters than data points. And the result is something that I think is pretty striking. In one swoop, you get a nice fit to your data and you get an identification of outliers. Because you can flip the tables a little bit. And rather than marginalizing over all these nuisance parameters for outliers, you treat the model itself as a nuisance parameter and you ask, "If I marginalize over everything but what I know about this point, what can I conclude about the point?" In one brush, you identify the outliers and fit the model to the data without having to do any weird iterative pre-processing or anything like that. Yeah, it's a very cool technique and I like that added benefit of identifying potential outliers, really neat stuff. So if we had someone listening who thought maybe they were working on a problem that they wanted to call in to play the myth, they wanted to say like, "Oh, I'm going to introduce more parameters than I have data points here." Are there any rules of thumbs or maybe characteristics of appropriate problems that you could share? I think the main characteristic is you really have to understand where your data is coming from. And I'm saying in the sense of these Bayesian classically over determined models, one kind of keyword you can look up is forward modeling. And this is something that's really common in the astronomy community. Forward modeling is a sense in which you're modeling everything. In the astronomy case, you might model the detector. You might model the response of each individual CCD pixel. You model the effect of the atmosphere. You model the actual physical processes that are going on in the star or planet or galaxy that you're looking at. And these end up being really complex models in the sense that they have a lot of parameters. But because we understand enough about the physics, we can put in these physically driven priors essentially that limit the scope of what these parameters can do. So forward modeling is something where you're essentially simulating the entire system that you're looking at and seeing what happens with the data points. And then you can use a Bayesian approach given that model. You can constrain what the parameters are and marginalize over the nuisance parameters to figure out what the parameters of interest are. So I should stress that it really, again, it really depends on a good knowledge of the properties of the system or the properties of the data set that you're looking at. Yeah, absolutely. Well, your book Python Data Science Handbook is now available for early release from O'Reilly. I've been going through it in the past week and I think it's a really great reference book. So I look forward to getting a printed copy eventually on hand. But I also see it as a really great introduction for people interested in hands-on applied data science. This will now be on my short list of recommendations when people ask me how to get into data science. Was that your intended audience? Yeah, that was my attempt. I teach a lot of university courses and workshops and also tutorials at tech conferences and things. And that's a common question that I was unable to answer. As someone who says, you know, I know a little bit of Python or I know a little bit of C++, but I want to get into this data stuff and learn how to use Python from the ground up for answering these questions. There are a few kind of partial resources out there, but I was writing this book for those people so that I could give them a good suggestion. And the basic idea of the book is, you know, I assume the reader has some knowledge of programming. They know how to write for loops. They know how to design a program. They know how to run shell scripts and things like this, but may not necessarily know Python, may not necessarily know the tools like NumPy and Pandas and SciPy and Matplotlib. And may not know the statistical principles and the machine learning principles that you need to be successful in this sort of field. So the book goes into those two sections. The first section is about using Python effectively for data analysis, for manipulating and visualizing and learning about and cleaning data. And the second half of the book is, which I'm working on right now, is going to go more into the basics of statistics, you know, just enough statistics and just enough machine learning theory to really start to dive in. One of the things I firmly believe in is that there's no single resource is going to prepare someone for everything. So one of the things I try to emphasize throughout the book is where you can go for more information. Because even the most expert user in the expert statistician and all these people are constantly going to references, whether it's something like a textbook or whether it's something like Stack Overflow or whether it's something different. You know, I never remember how to formulate this function call. I want to do some constantly searching for that online. You've done a lot of great talks, I think, people with a little bit of Googling can find out and see some YouTube videos and whatnot. Are there any particular events coming up where people can see you? I've kind of cut down my travel schedule because I'm newly a father. Okay, congratulations. I have a child at home and I'm really excited to stay home. So I'm not traveling much this fall or spring, but hopefully sometime in the future. Well, good time to work on a book then. Yeah. Jake, thanks so much for coming on the show. This is a really cool discussion. I'm sure the listeners are going to really enjoy it. So one quick correction before we sign off. Jake is actually going to be having one speaking engagement. September 17, that's 2015, in case you find this in the future, at the Stitch Fix Offices in San Francisco. If you're interested in attending, check the show notes for more information. If you happen to be able to make it out, let them know you heard about it here. All right, everybody, until next time, keep thinking skeptically of and with data. [MUSIC] [BLANK_AUDIO]