Archive FM

Data Skeptic

[MINI] Covariance and Correlation

Duration:
14m
Broadcast on:
30 Oct 2015
Audio Format:
other

The degree to which two variables change together can be calculated in the form of their covariance. This value can be normalized to the correlation coefficient, which has the advantage of transforming it to a unitless measure strictly bounded between -1 and 1. This episode discusses how we arrive at these values and why they are important.

[music] Data Skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes, just like this one. Our topic for today is covariance and correlation. So Linda, I'm wondering if you've ever described something as being highly correlated, or if you've ever read that somewhere? Probably read it on news articles. Yeah, what kind of news articles? Anyone's about the health that I usually read. What sort of health things are highly correlated? How many calories you consume and how much you weigh? Yeah, that's definitely correlated. And we all have this sort of intuitive idea about what correlated means, right? Would you gander a definition? To me, I just think of related. Yeah, that's a pretty good definition. But of course, there's a mathematical definition that underlies it that we use in statistics. Well, maybe you want to talk about the common person's definition before you dive in. Okay, let's go more into that. You said there are two that are related. That's a pretty good vernacular definition, I would say. So related can mean a lot of things, you know, like they could be inverse or positive, like as one goes up, the other goes up, or... Yeah, that's a good one. One goes up, then the other one goes down. Yep, that's positively correlated and negatively correlated, for sure. Or as one goes down, the other goes down. Yeah, that's a really good way to think of it. It's the direction things move in together. Intuitively positively correlated means, as you say, as one goes up, the other goes up. Negatively correlated is that they're doing the opposite, right? As one goes up, the other goes down. Or vice versa. So yeah, to be correlated, you need two variables. That's right. Because otherwise, you just have one, there's nothing to correlate. A student point. Well, there's this other thing called autocorrelation, but that's a topic for another time. Okay, well, now you could dive in and tell me what it means on the math level. Let's rewind a couple hundred years and imagine before much statistics had been invented and we're sitting around and we have this idea about correlation, we'd like to define what it means mathematically. There's a nice property of multiplication we can take advantage of here. What happens when you multiply two positive numbers together? I think it's product is positive. Right. How about negative times negative? Positive. Right. What about positive times negative? Negative. And negative times positive also negative. So you see, if we take advantage of the sign, when the sign is the same, multiplying makes it positive. But when the sign is different, it always goes negative. Is that a nice little feature? Maybe we could take advantage of? You don't think that's as cool as I do to you? No. Okay. Well, that's sort of the starting point. As this really nice mathematical property, so how can we take advantage of that? We can look at for some data and let's start to talk about something specific. You've just gotten completed with your bike trip that I think we mentioned before, right? No, we haven't mentioned it before. Oh, well, tell me about it. So I did a fundraising bike ride called Bike MS, which is in LA. They have them all across the country, but the one I did is from LA to Santa Barbara. So it's two days of biking, 90 miles total, so 60 miles on day one and 30 on day two. Very cool. How much riding is that on each day? How many hours? I mean, the average person probably takes like five hours to do day one. That's great. Yeah. You're going to do it again next year? Yeah. I'm going to do it again next year. It was really fun. I did it with like nine other coworkers. You'd mentioned earlier about calories and weight. I think there's something similar going on here. Did you just have a standard little light meal on this trip? So I was telling Kyle earlier that, so let's say the average person bikes 15 miles per hour for one hour, and if you want to train, I would train for five to six hours once a week. So I'm biking for five to six hours, going 15 miles per hour. So let's say I burn 500 calories per hour, so I'm biking for five hours as 2,500 calories, which is probably about a pound of weight. Well, 300, 600 calories is one pound. That's the metric that sticks out my mind. Anyway, go on. What that means is, and I learned this after my first really long bike ride, is that you have to constantly eat while you're biking. Otherwise, the calories you're burning exceeds the amount of calories that you have in store. And like breaking down your fat and other such muscles doesn't happen as quickly. Like your energy sources can't be replenished as quickly as if you just eat every hour and replenish those 500 calories via eating. So those two things are correlated. The more you bike, the more calories you burn, right? Yeah, I guess so. What about a negative correlation? How about the number of back rubs you get, and how much back pain you feel? So yeah, I mean, I guess the more back rubs Kyle gives me, which is you. I guess the back pain goes down a little bit. I would say a lot, I think. You don't know how I feel. Well, how do you propose we measure it? Well, you could ask me every day what my back pain level is, and if it's above acceptable, you have to give me a back rub. All right, I'm adding skepticism of self-reported data to our list of future topics for this show. In the meantime, I think we can agree that, yes, whatever the scales are, as a number of back rub events goes up, or perhaps minutes of activity, the pain in your back goes down. So as one goes up, the other goes down, negatively correlated, right? Yeah. For each of these, they also have a mean, right? Like, let's say you had a really distinct way of measuring how many calories you burned while biking, and you went and measured a bunch of times, like, oh, I rode this many miles today, and I burned this many calories. What do you think that graph would look like? Yeah, I mean, the more miles you ride, the more calories you burn. So if we took a bunch of measurements, and I put them all as dots on a scatter plot, what would that come out looking like? Well, I mean, I bike to work, so my commute is five miles each day, so there'd probably be a lot of dots around that. Yeah, and some variants, but around how many calories you burn, it's not always precisely the same number every day, but more or less, it would be like a diagonal line, right? As the miles go up, the calories go up, but it's not perfectly on the diagonal, there's a little bit of variation there. Yeah, I mean, it probably goes up at a steady slope. Yeah, I would say a linear relationship, absolutely. Each of those things, though, has a mean value, so if you took the average, and then you take each point minus the mean, you're going to get this nice sequence of numbers that is generally positive if these things go up together, or generally negative if they go against one another. Sure. On the scatter plot of the back rub stuff, it would be minutes of back rubbing on the x-axis, and amount of back pain on the y-axis. And that would also be sort of a diagonal line, probably also linear, but going down is when it goes up, the other goes down. Each value, minus it's mean, and multiply those together, you're going to more often than not get negative numbers. When you take the product of all that, or the expectation of all that, actually, when you take the expectation of that, you will get this value that is what's called the covariance. And I haven't given the exact formula, because I don't like reading formulas on the show, but that's the intuition of what covariance is. It's how much two things vary together, how covariance or covariance? So if I had a graph, does that whole graph of all the little dots and data points, does each point have a covariance, or does the whole graph have one covariance number? Good question, it has one number. It is a property of the two serieses. A series is one of the sets of data, so like the number of miles written, and then a second series would be the number of calories burned, and those correspond, right, they all pair up. Co-variance describes how much those two serieses have a relationship to each other. Or actually, there's another case, what if one is flat, they're unrelated. So for example, let's say you plotted the number of miles you biked against the number of, I don't know, people who check out a certain book from the library on that same day. Those two things have no relationship, so we'd expect that they don't move together the average since they're unrelated, you'd get just as many positive and negatives and they'd be of small impact. As you sum that up, you get to the single number called the covariance, which is a value that's really, really big if two numbers change very strongly together or really negative if they change together a lot but in opposite directions and a value kind of near zero if they don't have a strong relationship to each other. So the point of all this is just to say we take advantage of that little arithmetic figure to get to a single number that tells us how much two serieses change together. But what's kind of missing now is a way to scale this down a bit. So what do we say earlier, how the relationship we expect between miles are written in calories burned. They're just related. Okay, for a reference, I'm going to Bicycling.com, which reports if you will ride between 14 and 16 miles per hour and you weigh about a hundred pounds and you ride for an hour, you will burn four hundred and fifty four calories. So that's like calories per hour. If we looked at back rubs and pain, well we never agreed on what the measurement of a back pain is. But certainly it's going to have a different slope. Whatever the units of pain are, aren't going to be on the same scale. Same scale as what? As the calories per hour. They're different units. So I'm not sure where your point is. Yeah, there's always going to be different units. So why are we talking about this? So covariance allows you to understand a sort of relative comparison of two things. But it's on a sort of arbitrary scale. But what maybe we'd like to do is say like which of two groupings of things is more correlated by sickling and calorie burning? Is that strongly correlated compared to let's say unemployment and interest rates? Those are two things that are also probably correlated in some fashion. But which is stronger? Well, we can't compare them because they're on different scales. Also the covariance could be really, really big numbers or really small. It's sort of hard to use it for anything. So there's one extra transformation we can do here and that's what we can convert it into correlation or the correlation coefficient. And that's through just a little bit of mathematical trickery. I don't want to get into the details here. But you divide the covariance by the product of the variance of each system. What that does is it kind of in the same way Z scores normalize everything out to the same scale. This process normalizes out covariance to a common scale that we call correlation. And that gives you a value between, always between, one and minus one. So something that has a value of one or very close to one is very highly correlated. Negative, same story. Writing near zero is independent that they don't vary together. So they're not correlated. What do you think would be a value of the correlation for bike riding and calorie burning? I don't know. I've never used this formula before. Do they just come out with one? Well, how often does it come out with one? A good question. It almost never comes out with one. In fact, it will not come out with one unless it's an identical series. I have no idea then. It would be hard to guess what exactly is a good value for bike and calorie burning. Because actually, I find correlation a bit difficult to interpret. People always be like, well, what's a strong correlation? Is it .5? And I think it really is impossible to say it depends on your data. If your data has a lot of variance in it, two things could be very highly correlated, but also have a strong variance to them. The monocalories you burn, I'm sure, obviously is related to the exercise you're doing, but it also depends on aspects of your body and what your immune system is doing, all these other types of crazy things, perhaps even how hydrated you are. So they're not really locked sync. There's all these other factors and those contribute variance. So you wouldn't get a perfect one-to-one relationship, but you'd probably get a pretty high number. How high enough makes it significant? Well, it kind of depends on how much variance you expect is in your data already. But at least we know that if it's positive, that means what? Sounds like it's correlated. Yep. And if it's negative? Negatively, inverse. Yeah, yeah. What if it's zero or very new zero? Seems like they're not correlated. They're neither correlated or inverse correlated. They're unrelated, like the price of rice and the number of TV shows on television. Well, if all the TV shows were about rice, I think there would be correlation there. Or would it be inverse correlation? Would be more interested in shows about rice as the price went up or more interested as the price went down? No, I'm not sure. All right. Next time you hear someone say something is highly correlated, what will you think about those statements in light of our conversation tonight? They say it's correlated. They're probably not going to give me a value. They're just going to say it's correlated. Yeah. I really think we use this word in two contexts. We use it colloquially in this binary way, just like the sign of correlation, positive or negative, or I guess sort of trinary, because there's, you know, near zero, not correlated, we might say. So three classes. But you will see them in print sometimes. If you look at a scatter plot, itty-bitty in the corner, someone will often write row equals. So keep your eyes out for that and know how to read it. The important thing is to know the sign, you know, positive correlated, negative, inverse correlated, zero or near zero independent, the magnitude not so easy to interpret, but at least all correlations are on this fixed scale of minus one to one. So you can compare them. You can say that two things are more or less correlated compared to one another, and that's quite helpful. But the other thing I want you to know is how correlated you have to be before it's significant. That's a case-by-case thing. Well, anyways, thank you as always for joining me, Linda. Thank you. One quick announcement before we sign off. Data skeptic is going to be giving away one free copy of Bayesian methods for hackers, probabilistic programming and Bayesian inference. That's the print version of the book by last week's guest Cameron Davidson Peline. If you'd like to be entered to win that book, the contest ends November 20th, so you've got about a week left, to find out the details and how to get entered, visit data skeptic dot com and check out the show notes for last week's episode that's Bayesian AB testing. And until next time, I just want to remind everyone to keep thinking skeptically of and with data. [MUSIC PLAYING]