Data Skeptic

Machine Learning Done Wrong

Duration:: 25m
Broadcast on:: 01 Apr 2016
Audio Format:: other

Cheng-tao Chu (@chengtao_chu) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning, and possibly some good reminders for experts as well. Our discussion parallels his recent blog postMachine Learning Done Wrong.

Cheng-tao Chu is an entrepreneur who has worked at many well known silicon valley companies. His paper Map-Reduce for Machine Learning on Multicore is the basis for Apache Mahout. His most recent endeavor has just emerged from steath, so please check out OneInterview.io.

[MUSIC] Data skeptic features interviews with experts on topics related to data science, all through the eye of scientific skepticism. [MUSIC] Chang Tao Chu is an entrepreneur working on a stealth mode startup. He has previously been at Google, LinkedIn, Square, and Code Academy, working on many data initiatives as well as applied machine learning. His paper, MapReduce for Machine Learning on Multicore is the base that became Apache Mahout. He blogs at ml.posthaven.com and recently wrote a post called Machine Learning Done Wrong. It's that post that I invited him here to discuss today. Chang Tao, welcome to Data Skeptic. >> Hi everyone, so I'm Chang Tao and I'm super happy to be here. >> I have some listeners that are just getting into ML and also some experts, of course. So maybe for the benefit of the more novice people, could you define something I'm sure we're going to talk about a lot in the show, what is a loss function? >> The loss function is a function that gives you the estimated performance of each model. So maybe let's use linear regression as an example. So the way it works is that you assume that the response variable is a linear combination of the predictors. And you want to find among all possible coefficients combinations, which one minimize the remaining square error? So in this case, the rooming square error serves as the loss function as it gives the estimated performance of each coefficient combination. So that the optimization algorithm can find you the best coefficient combination. This loss function sometimes is also referred as the objective function, which give you the objective for the optimization algorithm to find the best model. >> Yeah, a very common loss function as you point out in your blog post is the mean square error. In fact, I imagine it's probably the most common loss function used if we did a survey of some kind. But it's not sufficient for all problems, why not? >> Yeah, absolutely right. Mean square error is a very solid default loss function. So as to why it is not sufficient for all problems, we can probably look at it from at least two different angles. We can look at it from the numeric characteristic perspective and the other from the more probabilistic perspective from the numeric perspective. If you compare it against, let's say, a mean absolute error. It penalized larger individual errors more than mean absolute error as mean square error squared. The difference between the actual and the predict value. This essentially will lead to the optimizing algorithms affected by all item wars. So kids, all I give you like much larger difference between the action and predict value and for mean square error, you square that. So this might not be a desirable effect or this could be the desirable effect. So there's no right or wrong. It's just up to us to really make the call by carefully looking at our data. And pick the model and the loss function. Whose assumption fit the property of the data the best? >> Yeah, I think that's a very key insight. >> And also like we can look at it from the probabilistic perspective. Let's say, let's use a linear regression again as an example, right? So in the case of linear regression. So using rumin square error is equivalent to using maximum like estimation. While if we assume the error upon the response variable is id normal. So let's say at the very high level, right? And that said, if we believe the error upon the response variable is indeed id normal. Then rumin square error would be the probably the best error function or like loss function you can use to optimize. But on the other hand, if the error violates this id normal distribution assumption, the more evaluates, the worse it makes rumin square error to be the loss function. >> So I'm going to encourage people to go and check out your blog posts as well. And I have it in the show notes that we're actually talking through a lot of the finer points of. One of the next ones you mentioned is something that I see very often when I work with students or people I mentor and it's not addressing the class imbalance problem. Could you share what that is and maybe some of the techniques you've used to address it? >> Yeah, but first of all, I don't think the class imbalance by itself is necessary a problem per se. So it is more about whether we believe that the loss function correctly reflects our business objective. >> Uh-huh. >> So like most machine learning algorithms are designed to get as many samples right as possible. Regardless of the class distribution. So that's that if the business objective is indeed to get as many samples right as possible. Then class imbalance is not necessarily a problem at all. But on the other hand, like say, fraud detection problem. That's majority of the transaction are good transactions and very few of them are fraudulent. But our business objective is to get as many fraudulent transactions right. Rather than get as many overall transaction right. So that's a situation where we have a problem, we have a misalignment. So we need to address this class imbalance problem. Common techniques include over-sampling, under-sampling or tweak the penalty matrix if your training algorithm allows it. >> Yeah, it would be very easy to get even a 99% accurate model to detect fraud if you just always assume there's no fraud. >> Exactly. >> Yeah. So another common mistake you point out is assuming that data is linear. Can you provide some examples of non-linear situations and what algorithms are maybe most robust for dealing with these cases? Fraud detection would be a great example, yeah. So imagine you are predicting fraud. And you have transaction features like bidding address, shipping address, transaction dollar amount, number of past transactions and say user sign-up time, whatever, right. So something like a tree-based modeling algorithm could give you something like, if billing equals to shipping and transaction amount less than $100 and not shipping to $4.8, it's good transaction. But while a linear model would essentially say something like, you have X times transaction amount plus Y times billing equals to shipping plus Z times user lifetime blah, blah, blah. It is more about what you believe would be the better model. So would you believe a linear combination model would give you a good result, especially if you combine all the features together, right? As for what algorithms are more robust for those cases? I would say some high variance model like tree-based algorithm or like non-parametric models would be a better fit in those cases. You also provide the advice, I don't know that would be controversial, but it's certainly interesting advice. Forget about outliers. In terms of mistakes being made, do you find people are spending, do you give this advice because you think people are spending too much time invested in looking at outliers when they might not matter so much and distract you from the main problem? Or are you encouraging readers to avoid overfitting by forgetting the outliers? Yeah, so first of all, Gaudia's charge, so the wording was confusing. I should have edited that. So what I meant was more like, it was a mistake that people forgot to deal with outliers. So that said, outliers, I believe, need to be dealt with properly. And how to deal with outliers really depends on what causes the outliers. If the outliers is caused by, say, mechanical error, say your thermometer was broken and you collect wrong data. So then this is definitely something you should want to throw away. Otherwise, your model will be built upon something not generalizable. However, sometimes there's some very interesting costs behind the outliers. So it might make sense to dig deeper to understand what happened. For instance, if I see a fraud transaction detection case, if I see a really high dollar amount, fraud transaction, I would really want to dig deeper into what's going on. Because that's, say, most of the fraud transactions want to keep the dollar amount low enough to avoid being detected, blah, blah, blah. So if I see something unusual and there's some reason behind it, I want to learn, I want to see if there's any new attack ring. So I wasn't aware of before I decide how to deal with the outliers. After I figure out what's going on, I can choose to. OK, so this is something I want to model to memorize and generalize. Or I don't want it to memorize or generalize. Yeah, it makes sense. I find that I often learn a lot about my data sets by looking into my outliers. And I actually have a funny story I'll share. I'll see if you know the punchline before I get to it. I was working on some analysis of location data, and I had aggregated it to sort of a city level. And everything looked very smooth and very in line with a comparison to the US Census. With two exceptions, I had two strong outliers. One was Poughkeepsie, New York, and the other was Beverly Hills, California. Do you already know why? Not really, so share with me. Sure, so I had-- the data I was looking at was at the city level, but it had been aggregated from user-entered zip codes. And I found it because I first looked at Beverly Hills, and I saw many more users than should have been there in 90210, which is, of course, the famous TV show. And Poughkeepsie happens to hold the zip code 12345. So I guess these are the most popular fake zip codes in that data set. Yeah, yeah, absolutely. Yeah, that's a very interesting insight, right? So then you can decide, OK, those are indeed-- or, like, you don't want your model to memorize, right? You also talk a little bit about the N much less than P problem. Can you define it, and then maybe we get into it? The common convention is N represent number of samples, and P represent number of features. This is essentially referring to when you have a lot more features than samples. You have-- makes you have a very large space to explore for your model. This is a very common problem in certain domains, like biorelated or medical-related ML problems. For instance, in medical, you can have certain disease that you only have tens of observations. But then, like, for each observation or each patient, you might accumulate, like, say, thousands of different features, and sometimes it could be time series as well. So then you have a much larger feature space than the samples you have. There are some people who would say, "We're living in the era of big data. It's with commoditized hardware. It's very easy and cheap to capture and store almost effortlessly tons of information." So there are many, many domains that have this N much less than P problem going on. What advice do you have for a data scientist or machine learning researcher working on a problem like this? So, yeah, like what you said, I think right now we have a lot of data. And story is no longer a problem, and there are a lot of framework which allows us to really efficiently process all of those data. That's a good side. And the bad side is still fundamentally when you have N much less than P. So the challenge is how do you find the right model, especially when you have so few samples? I think the key is still how you regularize your model. This is the part where people need to be really creative upon as to how to regularize. You need to be able to build your domain knowledge into your regularization algorithm, which works the best with your modeling algorithm. Let's say if you are using a probabilistic model, you might decide, "Okay, so I have this insight, so I should add this latent variables here." So then like this will help regularize all the different outputs. Or I can add a prior distribution and say, "Oh, I want to add it virtually here." So it will act as a pseudo-count to the multinomial. So those will help you regularize your search space. How do you translate your domain knowledge into those regularization algorithm to be the key? Yeah, that's often the most challenging part of a applied machine learning I've found. Absolutely. So your blog post really astutely points out the importance of standardizing your features before doing regularization like that. What are some common mathematical techniques for standardizing features? First of all, let me clarify. I actually didn't mean standardizing features before applying all different modeling algorithms and nor all regularization algorithms. I mean more like standardizing features before using certain regularization along with certain modeling algorithms. For instance, that was just regression with L1 and L2, right? And I think what you're asking is more about the common techniques for like feature engineering. As for feature engineering, essentially what we want is to extract the most signals out of those raw info while minimizing the noise. And make those signals easy for the modeling algorithms to take advantage of. So this is actually a lot of things you need to think about. And this is actually really a hard problem. So standardizing features is just one way for feature engineering. It works well when you have many normally distributed features and you want to put that on equal foot. Essentially zero mean one variance, right? And aside from that, feature engineering is something I actually haven't been able to summarize in a few simple principles. But nevertheless, there are some modeling algorithm like random forest or decision tree, which require much less feature engineering as they handle like categorical features, continuous features, missing value, correlated features, etc. But one thing I do want to point out is for feature engineering, you really want to look at the true signal value from each variable. For instance, if you see, okay, this is an exponential distributed random variable. So how do you capture that core signal? So the numerical value might not be the best way. You might want to take a log or whatever depends on your modeling algorithm. But you have to think about how you capture the true underlying signal. If I fail to do that, what are some consequences of my inaction? Yeah, so again, that depends a lot upon which modeling algorithm you choose. Let's say if you use random forest, it might not be a big deal. On the other hand, say if you are using linear regression with L2 regularization, for instance, right? And let's use a hypothetical example. Let's say you have real estate data. You are trying to predict number of bedroom based on the real estate price and let's say number of bathroom. You're doing a very different model, give you a very different prediction if you rescale the price from millions to cents. In this case, you didn't add more information, you didn't remove any information, but you change your model dramatically, which is definitely not a desirable effect. Also on the other hand, it might also hurt you if you standardize a lot of features which is not normally distributed. So you need to be careful. Too many transforms, you mean? Exactly. Yeah. Find multicollinearity and explain why someone working on machine learning problems should be concerned with it. The multi-colinear reality is like say when you have multiple variables, they collinear. One thing is linear combination of the others. Why we should be concerned with it? This again depends on the modeling algorithm you pick. Generally speaking, correlated features don't give you much more new signals and could confuse some of the machine learning algorithms. So one example is that if you are predicting the, again, let's say a real estate property price using number of bathrooms and other signals, we all know like number of bathrooms and number of bathrooms are highly correlated. You can end up with having a model which gives you really high coefficient for one feature and a very low coefficient for another because those coefficient can cancel out during the linear combination. For instance, your true model, let's say, is five times number of bathrooms plus four times number of bathrooms, right? So that's that the equivalent model could be like 50 times number of bathrooms plus minus 41 number of bathrooms, which gives you the same model. Let's say if the bathroom and bathroom is the same, let's say, sorry, I should mention that earlier. That might not necessarily be a problem by itself if all you want is the predicted value. But it makes the problem very ill-conditioned and also make the variance of the model much higher than necessary in a way that we can consider remove the correlated features. We can consider, let's say, apply PCA, et cetera, to remove the correlation. But of course, we need to also be aware of the potential problem of each different algorithm we apply and not necessarily a different problem, but the different assumptions those different algorithms make. For instance, regular PCA assumes like no more distributed features as well. And if you have multiple principal components having similar weight, those components can actually rotate in the space they span. Whenever we apply different pre-processing or modeling algorithm, we need to be aware of the assumption they are making and make sure they fit the property of the data. One of the things I've noticed is I've kind of worked at different companies and places where machine learning is being applied is that there are certain algorithms. I think logistic regression always comes up. I would also maybe say decision trees as well that people like these because it's said that they're interpretable. That is someone who has little or no understanding of machine learning can get a decent intuitive understanding of what the coefficients mean in the case of logistic regression. And I guess how to just read and parse a decision tree in the case of C4.5 or something like that. Even though I guess I've seen decision trees that have hundreds of nodes, I don't know how interpretable that is exactly. You shared some insightful cautions about doing this in your blog post, interpreting the coefficients that is. Can you walk listeners through why that may or may not be a wise thing to do? Let's talk about logistic regression first. I personally haven't found any compelling reason why we should interpret the coefficient. And I just found there are so many reasons not to do so. First of all, the example we just talked about, if multiple features co-linear, then the coefficient can shift among those features using the same data. So one day, you can end up with having like say, oh, five times a number of bathrooms plus four times the number of bathrooms. It looks okay, probably reasonable imagine an intuition. But sometimes you might find the coefficient shift. So it becomes like nine times a number of bathrooms plus minus four times a number of bathrooms, whatever. So tear out all the bathrooms, huh? Yeah. So how does that make sense? How could it have negative coefficients? So even for one very simple example, it's not interpretable. And if you try to interpret that, it might give you wrong insight, which give you like worst result. Most of the data set I've found I have dealt with have co-linear features. For instance, like say, you can have like temperature and humidity and time of day and again, temperature. So those are highly correlated. It is really hard to find uncorrelated data unless you do a lot of pre-processing, not to mention if you have, if you are talking about big data, right, you have much, much more features. It is really hard for often not to be co-linear. Also, not to mention, let's say, fraud detection, if you have one feature called transaction amount. It makes a huge difference if you use cent or you use like say dollar, you use like thousand as unit. Your coefficient with its shift, you're just based on your skill, your unit. So unlike all of a sudden you can have very, very important features, your transaction amount could be very important. And if you change a scale, it could be like insignificant without like actually adding more information or reducing any information. So your interpretation would be very, very unreliable. So I just don't find any compelling reason to interpret those coefficients. Yeah, if it's something that really made me pause when I read that it's something I'm definitely keeping more in mind as I'm sharing the results of models with non-technical people or even within teams. There's another interesting thing, so like when we work on like fraud detection and we also know some industry requires you to have some modeling which is interpretable. So I think the credit line history, sorry, industry probably requires that. I'm not too familiar, but that's what I heard. So then at least some of tree-based algorithm, people use that because of the regulation require you to do so less so because of the scientists really want to interpret it. Yeah, it makes sense. I think this and everything that we've talked about today and in your blog postures really sage advice that I imagine only comes from many years of good experience researching and applying machine learning to wrap up. Maybe could you share a little bit of your professional and research experience and maybe how people who have similar aspirations could follow a similar path? Yeah, sure. So I've done machine learning in different companies, again like Google, LinkedIn, Square, Local Academy, and also right now my own company. So actually I can share a little bit about my own company right now, so we came up stealth mode actually. All right. So what I'm working on right now, it's called Wine Interview, so our website's www.oneinterview.io. What we do is essentially for engineer candidates and come to our site, take a coach challenge. So we help the employers to assess how strong those candidates are. So using machine learning. I could see where that'd be very valuable to many companies that are growing. Yeah, exactly. Especially we believe in the engineer skills should be something she'll not tell. And the resumes are essentially very, very noisy. Yes. So that's why we decided to build this product. One of the things I have learned is to build a great product, especially machine learning product, you not only have to be the domain expert, but also you need to have as many tools in your tool back in terms of the modeling algorithm and also like pre-processing algorithm and different regularization algorithm, et cetera. Being able to work on various different problems in different settings really helps me to look at problems from different angles, which I think is really, really interesting for me. And I assume you also work on different problems at different places. So I'm not sure if you have a similar experience. Absolutely. Yeah. Yeah. I feel like the more tools you have in your tool back, then you know, like, hey, so I have this data, then I can find the right tool from my tool back rather than like I only know one tool really, really well, but it might not fit the property of the data. So my advice for people who pursue machine learning as a career, as a profession, it is really important to understand more tools in your tool back and put that in your tool back. So then when you have the flexibility, when different problems show up, when different data show up, they have different properties. So you can pick the right tool. Makes sense. I'll be sure to put a link to your new emerging startup in the show notes, www.oneinterview.io. That'll be there in the show notes. If you're on your mobile device, most podcast players, you can swipe to the right and hit it or head over to data skeptic.com and you can check it out there. And also, there's another thing I want to point out. So even though I don't tweet as much anymore, but I plan to restart, like, blogging and putting. So if you guys want to follow me on @chantout_chu. So please. Excellent. Yeah. I'll put that in the show notes too. Chantout. Thank you so much for taking the time to come on today. No, it's my pleasure. I have fun. Thank you so much for the invitation. Yeah. And until next time, I want to remind everyone to keep thinking skeptically. Love and with data. For more on this episode, visit data skeptic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music) (upbeat music)