Data Skeptic

[MINI] Auto-correlative functions and correlograms

Duration:: 14m
Broadcast on:: 22 Apr 2016
Audio Format:: other

When working with time series data, there are a number of important diagnostics one should consider to help understand more about the data. The auto-correlative function, plotted as a correlogram, helps explain how a given observations relates to recent preceding observations. A very random process (like lottery numbers) would show very low values, while temperature (our topic in this episode) does correlate highly with recent days. See the show notes with details about Chapel Hill, NC weather data by visiting: https://dataskeptic.com/blog/episodes/2016/acf-correlograms

[music] Data skeptic mini-episodes provide high-level descriptions of key concepts related to data science and skepticism. Today's topic is autocorrelated functions and correlograms. [music] So Linda, do you know what a time series is? Is this a series of date and time stamps? Yeah, you pretty much got it. Shortest episode ever. Yes. [laughs] Time series is any data set or series that has a measurement of the same process or signal through time, usually at regular intervals. So, what's the process? Oh, I mean, a process can be anything. It could be like the price of a stock is a time series, right? You measure it every day in the morning or the average or whatever. The number of cups of coffee and office consumes in a day is a time series. The number of observed solar flares is a time series. Yeah, so the first thing that comes to mind is measuring temperature. Oh, it's a good one. Yeah, actually that has some really good properties for talking about time series stuff. Where did you grow up again or where did you go to college? I grew up in Florida. Yeah, well, Florida has basically flat weather just like we have here, so let's talk about where you went to college. It got freezing. Oh, really? Yeah, there would be ice crystal on the grass during winter. Oh, so tell me in a few sentences what the weather is like in Florida. I don't know what it's classified as subtropical. But I mean, if someone had never heard of Florida, what would you tell them to expect? Oh, it's like high humidity, so like 100% humid. What about just the temperature? Well, I don't remember exactly, but during the summer, the average high is probably like 80, 90, and sometimes 100. The winter, I think the average high was like 70 or something. That's not that cold. Well, the thing is, I remember wearing shorts during the winter and just drawing a sweatshirt. So we can describe time series in a couple of ways. They're often like a composition of stuff. So what we want to work our way up to is something called a Rima, but we won't get to talk about that today. But you can imagine that a time series is some sort of trend, right, that like it's going up or down. It has seasonal components, as your weather example does, and it has some sort of noise. That's the fluctuation you can't explain, which we call the residuals. You were just describing the temperature in Florida in a seasonal way, right? Yes. So what would be a good way, if I just said, oh, it's July 4th, what would you guess the weather would be based on the time series alone? In Florida? Mm-hmm. I mean, I don't remember, but probably 90s. If you had to guess a particular day and you had all the historical weather data, but you couldn't like check a thermometer, what would you guess the next day's temperature would be? Oh, still 90 degrees Fahrenheit. All right, so imagine it was any random day of the year, and I gave you all the historical weather and asked you to predict today's weather. What would be like the most useful information in the whole history? For me to make the correct guess? Yeah. Time of year. Yesterday's temperature is pretty similar to today's, right? Yes. And even two days temperature, two days ago temperature is similar to today, as well, would you say? Yeah, so I guess you could just, well, the weather people, we just look at the day before. Yeah, do you think seven days ago relates to today at all? Yes. What about 70 days ago? 70? Yeah. Well, it does relate, but it's harder because then the season changed. Really, it doesn't relate that much, just much less than seven days. Yes. Well, what about 365 days ago? Well, that's awful, because that's actually like a year ago. Yeah, so time series has these interesting things. It has, one is the autocorrel of like the local thing, how it would be in a moving average, like how it relates to the recent days, but then there's also patterns that show up in time series is whether like a year ago makes sense, right? In other systems, like maybe retail sales, do you think there's any regularity to retail sales? Christmas shopping, New Year's Thanksgiving, yeah, Black Friday. Yeah, so once a year, big spike in retail. Well, I feel like when they change seasons, they probably have big sales, which encourage people to buy and get rid of inventory. Maybe they have the bi-annual weekly sale or whatever they have, blue ribbon sale, something like that. Do you think today's total revenue at a retail company would relate to yesterday's? Yeah. Actually, it would, but you know what's probably a better predictor seven days ago, right? Derek, you want the same day of the week. Yeah, because day of the week has a strong effect. So that's what's called a lag. Do you want to tell me what a lag is? So lag is a number of days. It's a certain time between your current observation and previous ones. In the case of like temperature, you notice that yesterday really strongly predicted today, right? Or it's very similar temperature, most likely. Most of the time we don't have big jumps. So that's a lag of one. But in the retail sales example, you probably expect a lag of seven because in order for you to see a repeat process, you'd expect seven days since there's such a weekly effect there, right? Yes. So what we would like to have a nice way to quantify this, wouldn't you think? Quantify the lag. Quantify the how autocorrelative a function is in terms of its lag. Okay. So let's think about your weather example here. So I didn't know you were going to say Florida. I thought you were going to say Chapel Hill. You went to school. So that's the data I pulled here that's in the show notes. This is a plot of Chapel Hill's weather. Can you maybe describe it given that this is not a video podcast? Yes. It looks like one, two, three, four, five, six, six little hills on our steep going back six years then. Huh? And what do you mean by their steep? Well, because that's an annual effect. Oh, this is annual. Well, it just so happens that when I plot it like this, you can see the annual trend that it's very sick. All right. The troughs, the cold temperatures are the winters and the high temperatures are the summers. But it's not smooth. Is it? It's up and down. Yeah. Even within the normal seasonal effect. Yeah. That's just sort of noise in the data that we can't explain alone by sort of the day-to-day trend. We'd like to take the autocorrelative function, which is a measure of how much it correlates with itself. And there's a special type of plot. So if you scroll down, you could see a plot. Have you ever seen anything like that one? No. Well, let me explain it to you. I'm going to come over there. There you see the legs going out for almost all those years. Now this is a autocorrelative plot. Could you? It's kind of hard to describe. But what do you see? Well, there's an axis that says zero, and then it looks like wavelengths that go up and down. Yeah. So people should definitely check out the show notes, which might show up on your mobile devices. So as long as you're not driving, maybe swipe one direction and see if your player will show this. But what this shows is a horizontal line for every day saying how much that day correlates with the current measurement. So correlation when we talked about in previous episodes of zero means there's no relationship. Correlation of one means that they're positively correlated. Correlation of minus one means they're completely negatively correlated. The reason you see that wavy pattern is because as you identified earlier, very recent days are very highly correlated. But the longer you go back in time, the less correlated it is. In the case of weather here, you get this interesting phenomenon that at a certain point you're at the opposite cycle of the season and it negatively correlates with the present. This also has a property that's a little deceptive because here it looks like all the recent days are all very correlated, which they are mathematically, but wouldn't you like to know which one correlates the most? Yeah. So they introduced this thing called the partial auto correlative function. But its purpose is, and it's below, is to help you understand what's called the AR parameter. So how auto-regressive this is. So here you see, maybe describe what you see here. The first two lines are very high, they're almost at one. The first one's at one and the second one's just right below one, so like 0.9 or something. So you know what this teaches us that if you want to predict the weather, the most important day, so the second day, yesterday, is the most predictive day. And then two days ago, not very helpful in predicting the weather. Hmm, depends where you live, but for North Carolina, North Carolina has much more variability than LA. So that one's kind of interesting, right? It tells us a little bit about the auto-correlative lag involved in weather, at least for that city. I did one more that thought was interesting. We're putting in an offer on a house, right? Yep. Exciting. So I went ahead and looked at the prices in that neighborhood. Is this something that might be interesting to you? I'm interested. Alright, so this first plot is the price. What do you see there? This is an X and Y axis, so the bottom X axis goes from 2011 to present, it looks like. And then the Y axis says median sale price, so it ranges from 200,000 to 800,000. So this, to be clear, what we're looking at, this is the average listing price. So it means it's what people are saying they want for the house. So hopefully that goes up, right? But we'd like to know, is there a trend? Is the neighborhood moving up, regardless of if we make the home better? The trend appears that it is moving up, however, there's such a greater range of like, it's almost like less stable. Yeah, there's a range here. One thing we'd want to do would be to detrend this data. That's something else we'll have to talk about later. Because if you look at home prices in the whole country, they're all going up. So we should subtract that out. Because if we're just going up at the average rate, then we're actually staying at zero. And I haven't done that sort of analysis yet, because this is all about auto-correlative functions. But I think we could do like eight mini episodes on this topic, and maybe we will if we get this house. We can do a lot of time series stuff that could be a new era for the data skeptic podcast. But below is the auto-correlative function. So what the next plot is telling us is how much the price of homes varies by the prices from days gone past. So just as we saw with weather, the most relevant estimates are recent days. So recent days best predict the current day, and then it declines with time. The day, you mean the cost? The how well the prices are predicting the next day's price. So what is it correlated with? So this compares the current day's price, on any given day, to what the average price was the day before, and the day before that, and the day before that, and every day before is an additional lag. This shows us exactly why we want the partial auto-correlative function. So if you scroll down to the next one, and here we see similar thing to the weather. That it's the most recent day that is the best predictor. It has all the information about the moving average trend. Interestingly, if it doesn't seem to be like a seven-day effect here or anything like that, the day of the week that you listed on doesn't have an impact. But that makes sense, right? It's not like Wednesday is a better day than Saturday to list your home. People do it when they do it. Yeah. Or at least it doesn't affect the price. It's not like how we can have an open house two days sooner, so we'll charge more of the house. So what we see here is, so the two most recent days are the best predictor. And after that, not so much. It seems like even around the first 100 days, they vary a little bit further from the zero line. Yeah. But there are very weak correlations and probably due to statistical noise based on the boundaries that you see there. So these boundaries are statistically not significant? Yep. Anything less than that is considered statistically insignificant. Doesn't mean it's not a real correlation, just means it's not statistically significant. The effect size could not be measured by the available data. So the effect size is either very small or doesn't exist. So in the future, we can get into more of the seasonality stuff and talk about ARIMA and ARMA. But this is a starting point. What's your first takeaways from your first experience with time series? I don't know. I will say this partial autocorrelative graph, I feel like you should change the scale then. Yeah, we should just scale it. If this area in red is not statistically significant, then visually, I do not care. Well I'll let Edward Tufti know. Did he define? No, no. He's just a famous person. So what do you mean? Do you want to delete those points? No, they should just be, they shouldn't take up this much space, they should take up like this much space. Oh, so like scaling the axes. So what's your justification for that? Because you said it's not statistically significant. But that's why the line of statistical significance is plotted there so you can see that. Yeah. Because you do want to see those residuals if there was a pattern, simple statistical significance wouldn't show it. Like if you look above at the, not the partial, but the regular autocorrelative plots, you see there's some pattern here and that tells me that there's some amount of this phenomenon where something that has a very recent trend kind of geometrically goes into past days. Because if yes, if you only depend on the previous day, then sort of recursively the day before that depends on the day before that and so on. So you might see similar trends or effects that aren't statistically significant, but tell you about ways you need to detrend the data in the future, that there's some other signal you can pull out. Okay. So it looks like even if it's not significant. You can still figure something out. Yeah. It's still sort of a diagnostic if you know how to read it well. I think you learn to read it well just by doing a bunch of these. Well, also you want to make sure there's actually data there. That's true. Would you be interested in more weather time series or more stuff about this house we're thinking of buying? This house. Yeah. So I should do a series of time series analysis through teaching Linda about forecasting home prices. Sure. All right. Cool. Yeah. Looking for the show while also the home of the data skeptic home sales project. All right. Pulls in some data for us. I'm excited. All right. Well, thank you again for joining me, Linda. More on the house next time. Yeah. And as always, I want to remind everyone to keep thinking skeptically of and with data. Good night, Linda. Good night. More on this episode, visit data skeptic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. (upbeat music)