The Chi-Squared test is a methodology for hypothesis testing. When one has categorical data, in the form of frequency counts or observations (e.g. Vegetarian, Pescetarian, and Omnivore), split into two or more categories (e.g. Male, Female), a question may arise such as "Are women more likely than men to be vegetarian?" or put more accurately, "Is any observed difference in the frequency with which women report being vegetarian differ in a statistically significant way from the frequency men report that?"
Data Skeptic
[MINI] The Chi-Squared Test
(upbeat music) - The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. - Welcome to another mini episode of the Data Skeptic Podcast. I'm joined as always by my wife and co-host Linda. - I'm Linda, hello. - Thanks for joining me, Linda. - Thank you. So today we're gonna talk about the chi-squared test. Are you familiar with this at all? - Chi, how do you spell it? - Well, you spell it with a Greek letter that looks like an X, so you spell it. - Oh, oh, I didn't know. - Most people say CHI. That's not like using it in most programming languages and stuff, or how you write it in a textbook. - Oh. - So chi-square is a type of distribution, but we're gonna use it for a specific purpose. We're gonna talk about the chi-squared independence test, which is a form of hypothesis testing. Do you remember in episode 24 when we talked about the T-test? - A little bit. - And I was talking about human heights and things like that. So the T-test, these are kinda like sisters in a way. T-test, you can use, well, just go back and listen to that episode to find out how to use it. The chi-squared test is in a different situation. It's when you have categorical data, like labels of something over a few different sub-samplings. Maybe I should mention some more details. I'm not going to talk about the arithmetic procedure for how you actually implement the chi-squared test. You can go find plenty of resources online. And most of the time, actually, software packages do that for you. So whether you use R or Python, in R, it's one command. It's chi-squared.test or something like that. Also, I'm not gonna talk about things like the Yates continuity correction, because the software will usually figure out whether or not you need that. I wanna talk about how and when you should use the chi-squared test, okay? - Okay. - Do you know what categorical data is? - No. - So there's numerical data. That's like, how many miles per hour are you going? It's a number, it's continuous, you can measure it, and like, one is always faster than the other. But something like, what is your favorite color? Is a good example of categorical, 'cause there's no real order. It's not like red goes before green, goes before yellow. I mean, there's the color spectrum, but basically, people just have colors. There's no relationship to one another. - Okay. - Or like, whether or not you're a meat eater, a pescatarian, or a vegetarian. Those are three categories, but there's no order for them. They're just labels. If you wanted to say like, are men more likely to be vegetarians than women? That would be a categorical hypothesis you could test, because you put all the men in one group, all the women in the other group, and you get the frequency of how many times each person fits in each. Now, in order for the chi-square test to be appropriate, you have to first make sure all of your observations are unbiased, they're independent, and identically distributed, you sample them well. You have to make sure your sample size is big enough, and there's a couple of rules of thumb. The one I typically wanna follow is that every cell has to have at least five observations. Do you know what I mean by cell? No. So a picture of table is the rose male and female, and is the columns vegetarian, pescatarian, and omnivore. And then you write the number of people who said they were in each in those blanks. And each little blank like an excel is a cell. So the number has to be five or greater? Yeah, as a rule of thumb. Some people argue it should be 10, and some people talk about the overall sample size. And I guess the moral here that I don't wanna get into the statistical debate, but if you have a very, very tiny sample size, you need to be very considerate of whether or not that's going to affect your test, and not just follow a rule of thumb, but consider, like one important thing I would suggest is imagine if you had one extra data point. If that one single data point can change the results of your test, then your test is very fragile. So yeah, if you have really tiny numbers, odds are the chi-square test, or any test, probably not going to be a good one for you. The rule of thumb is kind of like five for each cell. So I thought it would be fun to do more of a hands-on project. So I got some data out of the LA City data portal that has crime statistics. Kyle sent me an email. So on the left, they have the days of the week, Monday, Tuesday, Wednesday, Thursday, Friday, all of them. And say there's something. Yeah, all of them. Then the other columns, say vehicle, then another column has bike, where the row and the columns intersect. There are numbers under bike vehicle day. Those are the reported number of thefts of each of those type of vehicles. Do you think the day of the week will affect the distribution over which type of thefts occur? So let me just backtrack for these numbers. Are you saying these day the people reported the theft, or these days that you think the theft happened? Great question. And actually, that's a good data provenance question. I would have to follow up. Let's work on the assumption that I'm pretty sure is true, that the count is on the day that the report came in. Which presumably should be the same day as the theft more or less. Okay, so then you asked me, do I think there would be more thefts on a certain day of the week, is that right? Or week day versus weekend kind of situation. Well, I guess weekends have more thefts, 'cause that's when people go out. So there's more targets. So do you think it's more proportionally more? More vehicle and more bike thefts? Or is it that the bike thieves are consistent, but the car thieves have a weekend spree? Well, I think bikes would also go up, 'cause probably more people bike on the weekend. So you think that even though there might be higher volume on the weekend, that the proportion is the same? Is the proportion the same for bike and vehicles? That was your answer. That's my question, what's your thought? My guess is for both of them, the proportion goes up on the weekends. Or goes up when people are out. But I mean, with respect to one another. From bike to vehicle, we respect. Oh, that is probably different, I don't know. So you would think that they'll both go up on the weekends and maybe vehicles go up even more than bikes or vice versa? Yeah, I mean, I guess vehicles would go up even more, just 'cause, I don't know why. I guess bike thieves, the people to suddenly be like, "Monday, I'm in the office. "Friday, I'm gonna get your bike." (laughs) I think I really know. I mean, I just feel like the car theft people are more consistent than the bike people. You've just insulted our bike thieving listeners. I'm gonna get complaints of this. I just think that there's less money in bikes. So the people who are going to steal them in volumes are low. So they have to work harder, they're probably more entrepreneurial. I don't know. (laughs) Would you mind giving a quick summary of what's about the average day to day and do you see just eyeballing it in any major spikes? So for vehicles, average looks, ranges between 1100 and 1300. So I guess the average would be 1200. On any given day, that's how many thefts. And bikes hover around 100 to 110. So the average is probably 106. All right, just to give some clarity, that is the total number of thefts by day for the period that my data covers, which is a partial coverage of 2014. And the chi-square test is all about counting frequency of events. So that's why I did it that way. It's not that there are 100 bikes still on a day, it's that over this period, there were that many-- - Oh, I thought every day on Sunday. - Yeah, clarification, sorry about that. That would be a lot of tests. - Yeah, I was like, wow, that's crazy. - We gotta move. - I didn't know there were that many bikers on the road 'cause what I was thinking. - So, but anyway, where I was going of why something, even though it's kind of consistent day-to-day, it's important to note the point we were making about the relative frequency, because if you were to test something like demographics, the population of Caucasians is very high in the US, followed I think number two by African-Americans, and then Hispanics and Asians, Pacific Islanders, and all that stuff trailing off. So often those smaller demographics, the overall total is much smaller, but that doesn't mean that the proportions are different, so you could look at incidents of some disease, or whether or not they're a smoker, or whether or not they get in car accidents, you have to kind of look at relative frequencies, and the chi-square test sorts a lot of that out for you because it looks row by row relatively. So, if I want you to eyeball your hypothesis that the weekend vehicles have a surge in thefts, what do you think that the data kind of reports that? - Well, for vehicle, definitely. Thursday, Friday, and Saturday have the most deaths compared to other days of the week. - The numbers do go up, but it's also a little deceptive because the bikes are such a smaller number, and it's relatively consistent. So the question we want to ask is, yeah, is that perceived incremental increase, a real phenomenon, that a statistically significant one, or is it just perhaps due to chance that Thursday, Friday, and Saturday seem to be much greater than the bike data, and we can use the chi-square test to determine that, and let me do that right now. The chi-square test reports back a p-value of .55. Do you remember our p-values episode? - Not much, but what does this mean? - This means that the data does not support the alternate hypothesis, so we need to reject the alternate hypothesis and accept the null hypothesis that there is no statistically significant change with respect to day of the week. - For vehicles and bikes, or together? - That the day of the week doesn't affect the number of vehicle and bike thefts. Not the total volume, it can affect that, but it doesn't affect their proportion. - Does it matter if you just run your test just on vehicles, excluding bikes? Will you get the same answer? - What do you mean? - Because then it's no longer two groups being compared. - You have to have two groups? - Well, you have to have two categories at least. You can have actually more than that, but if you don't have two categories, there's nothing to compare. - So what question is your, are you asking with these two categories? - Does the day of the week affect the proportion of the types of thefts that occur? - Well, types, you only mean vehicles and bikes. - Correct. - So then, you ran the chi-test, and it said when you're comparing vehicles and bike, it does not. - Correct. The day of the week does not have a statistically significant impact. - Well, what would a lower p-value be, zero? - The sort of almost arbitrary rule of thumb that we follow is that anything below 0.05, we can reject the null hypothesis. So that means there's a one in 20 chance that we're wrong, that the difference we detected is due to chance, but it seems like, okay, probably if it's only one in 20, it implies that our data is not due to chance. Some people also will take a stricter thing, depending on what your test is. Like, if you're gonna invest your life savings, you might want a p-value of 0.01% or 0.001% or something. You want a lot more certainty over things if they're very important decisions, but 0.05 is this sort of arbitrary line in the sand that I think Pearson came up with. - So we're just comparing vehicles and bikes. Let's say vehicles, we keep the data as it, and then bikes, we fake the data. Could it come back from the high test, tell us a different story? - Yeah. - Or do you have to change both columns? - No, if you change just one, the story could be different. You wanna try and do that? Okay, let's make a copy. Okay, so I doubled the number of bikes that's on the weekend, or not, sorry, not the weekend, just Friday and Saturday. I reran the test, and now I get a p-value of 4.39 times 10 to the minus 11. Very, very, very close to zero. So we now reject the null hypothesis, except the alternate hypothesis, which is that the day of the week has a statistically significant impact on which type of thefts are going to occur. - On which type of thefts? - Yep, between vehicle and bike. - Okay. - But I like what you proposed, actually. I think that is a very worthwhile exercise more people should be doing. It's almost not enough to just run your test, get your p-value, and decide whether you accept or reject your hypothesis. It's nice to play around and say, like, all right, what would it take for me to get the other value? How much would I have to tweak my numbers? In my case, I just said, let's double the bike thefts on the weekend, which is a pretty strong scalar, right? Doubling something's a lot. What if I was like, oh, let's make it 10% more? Would that have changed things? So it's a very worthwhile exercise, I would say, to do what you suggested, and just kind of play around with the data and understand how sensitive your result is to slight changes. That's not a formal part of the chi-square test. That's just something I recommend for all hypothesis testing in general. - Will you post this table online? - Yes, the table will be in the show notes so people can go look at the data. There's also, I don't know if everyone knows this, every show notes, some of them are just text, but they all have source code that shows any of the work that's been done. So if you wanna repeat the analysis, you can go and follow the exact code I post on GitHub, and you can run through the same analysis yourself and play around with the data if you want to. So yeah, chi-square, I didn't walk through kind of the arithmetic parts because you can go learn those anywhere. I talk more about how to use it. Keep in mind that rule of five, it's arbitrary rule, but it's helpful because if you have very sparse cells, you can often get spurious results. Generally sample sizes of 10 or more are looked for, and let's wind up with one final discussion about the data itself and being skeptical of it. Now I told you I got this out of the LA city data portal. You asked a very good question, which was, is the date the report date or the actual incident date? And I did not know. What I also don't know is if the nature of police procedure can affect this data. So for example, maybe the worst cops get relegated to weekend duty and the best cops get to have the weekends off to spend with their family and their friends. And the worst cops don't report things as well, or don't like to fill out paperwork, and it could change the numbers or skew the data. And I'm not making accusations about the cops per se. Just saying that there could be nuances like that that affect it. So for example, what do you think about report rates of vehicles theft versus bike theft? - I think vehicles will definitely be reported more 'cause they're worth more, whereas bikes people just shrug. - Yeah, you feel like cops aren't gonna do anything. - Yeah, well it's a cop can do. - Yeah, it's a small one with a $200 bike, even a $500 bike, $800, who cares? - Let's get a team of seven people on this. We gotta find this bike. - Yeah, and most bikes, they have to be in great with the registration number and most aren't. And so if you don't have a registration number, they're gonna, when they find the quote unquote said bike, they're not gonna know as yours. - So it could be that your initial instinct that the weekend made a difference is true, but our data doesn't support that because citizens fail to report bike less. - Maybe. - So even if you think, oh, there's nothing I can do, the cops can't help, even if they wanted to try. It's still worthwhile to always report these things so we have accurate statistics. Okay, let's wind up with a shout out. I'm gonna do some shout outs in these mini episodes for the next couple of weeks because there is a growing community of excellent data related podcasts that I wanted to tell my listeners about in case you don't know about some of them. One that I'm kind of new to is called the partially derivative podcast. I will put a link in the show notes. It's kind of like a like dignity for data science. They cover the week's top news stories and headlines and they talk about what beers they're drinking and stuff. So it's a fun show and it's great because there's no overlap really with data skeptic podcasts there. They actually do a lot of stuff I wish I could cover, but you know, I can't turn around the shows as quickly, you know, to cover news headlines and get guests and stuff, different format, but interesting stuff if you're into data science. So everyone should go check out partially derivative. - I have a question. You encourage people to report bike theft. - Right. - I have a question. Did you report your bike theft? - No, I did not. - Well then, that just shows you. People do not report bike theft. - Just say it, Linda. Say it. - Say it. - Say the word. - Say it. - Starts with an H. - It looks great. (laughing) No hypothesis. (laughing) - Kyle's a hypocrite. (laughing) - I'm an hypothesis. - Well, our audience can decide what they think. - Fair enough. (laughing) - Thanks for joining us. (upbeat music) (upbeat music)