This week's episode dicusses z-scores, also known as standard score. This score describes the distance (in standard deviations) that an observation is away from the mean of the population. A closely related top is the 68-95-99.7 rule which tells us that (approximately) 68% of a normally distributed population lies within one standard deviation of the mean, 95 within 2, and 99.7 within 3.
Kyle and Linh Da discuss z-scores in the context of human height. If you'd like to calculate your own z-score for height, you can do so below. They further discuss how a z-score can also describe the likelihood that some statistical result is due to chance. Thus, if the significance of a finding can be said to be 3σ, that means that it's 99.7% likely not due to chance, or only 0.3% likely to be due to chance.
[music] The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking and data science. So welcome to yet another episode of the Data Skeptic Podcast Mini Episodes. I'm here as always with my wife and co-host Linda. Howdy. Thanks for joining me as always Linda. So our topic for today is z-scores. Are you at all familiar with this? Well, you told me that would be the topic. Alright, and I prepped you a little bit. And Yoshi laughs. Well, a z-score for those who are unfamiliar. Also called the standard score. Relates to the bell curve, which I prefer to call the Gaussian distribution. Do you know much about the bell curve, Linda? Well, I think it's a way people, it's a common way that describes data, I guess. Did you experience the bell curve while you were in college maybe? I don't remember, but yeah, I think any time. To put grade on a curve. Certain. Oh yeah. I think that was in high school actually. I did have some teachers that graded on a curve. So the common shape we call the bell curve, it isn't always actually, there's a couple of different types of bell curves. But mostly more or less, and I'll just hand wave it for this episode. That's the shape we call the Gaussian distribution. And there's sort of a special case of the Gaussian distribution, which actually isn't that much of a special case. It's just sort of a transformation of all Gaussian distributions. We call the normal distribution, which is the case where there's a mean of zero and a standard deviation of one. But that's a little bit maybe too detailed, let's back up and just describe what the bell curve is. It has a mean, which is the same as its median, which is like the average case of where people are. And then it has a standard deviation, which describes like how peaked that bell curve is. So if you think of something like how much sugar do you think is in like a Snickers bar? A lot. Probably like 20 grams. 20 grams, okay. Now if you got 100 different Snickers bars and you somehow had a way of measuring exactly the amount of sugar in each, do you think they'd all have 20 or would some have like 19 and some have 21 and so on and so forth? They should be pretty exact. Pretty close, right? It should be very close, yeah. Very precise manufacturing process. But now on the other hand, let's say you went into all the bakeries in town and you bought one chocolate chip cookie in each. How much sugar do you think would be in each of those chocolate chip cookies? I don't know, 12 grams. And do you think some would have 11 and some 13? If they didn't stir, evenly, sure. Well, almost for sure, because you're going to go all these different bakeries. There's going to be a much higher variance. They don't have the machine precise methodology that the Mars Corporation does. I think Mars makes Snickers, right? I don't know, you should do your homework. I should have maybe, but yeah. So when you have something less precise, there's more variance, which means a wider standard deviation. So that's sort of the two primary parameters that defined the, well, actually those fully defined a standard normal distribution, the mean and the variance. So mean being like where the center is and variance being how wide it is and how much, how the degree to which observations differ from the mean. That gets us very close to what a Z score is. The Z score is any given observation how far it is from the mean of its population. So I thought maybe a good way to talk about this would be human height, which actually, I don't believe human height is normally distributed. But let's not get into that here. Let's just assume it is. And let's also assume that this random statistic I got on the internet, which told me that the standard deviation is three inches is correct. According to Wikipedia, and this part I trust, in the US for all people living in the US ages 20 and above, males have an average height of five foot nine and a half inches. Females have an average height of five foot four inches. Does that sound about right to you? Yes, okay, and the standard deviation of three inches means that if you go out three inches either direction from those two means, that accounts for about 68% of all the population. So rounded up just a little bit and let's say 70, 70% of people are three inches away from each of those means. That's some reason will you? Yeah, I mean, what it tells me is that I'm outside of that, I'm five foot, so I'm very short. Well, you jumped the gun. I was about to talk about the significant disparity in our heights, seeing how this is a radio. Well, it's not radio. It's a podcast, but an audio thing. We've never talked about height the whole four. So maybe you could share yours and not share mine. I am five foot. And I am six seven. So it's really? No, you just lie. No, I'm somewhere between, I'm like five, eleven and a half. I could say six, but I'll just play it cool. I'm five eleven and a half. So yeah, we're about a foot difference. You are slightly below the mean and I'm slightly above. And thanks to your suggestion, I actually set up a nice widget on data skeptic.com website where anyone who wants to can go in and get their z-score. So let's put yours in. Female standard deviation is two to three. Five foot zero inches. You have a z-score of minus one and a third. And so your percentile is nine point one percent basically. So you are taller than nine point one percent of the female population. I think that sounds about right. It's very rare on taller than anyone. So you trust the numbers. I, on the other hand, I will put in my five foot eleven and a half inches. I have a z-score of point six repeating two thirds. So I am taller than 74.75% of the male population. I don't really have an MBA career opportunity here, but I'm vaguely on the taller ish side. I'm within, I'm still within one standard deviation though. So that's where we get back to z-scores. Z-scores are the number of standard deviations away from the mean that each of us lie. What was yours again? Mine was about nine or ten percent. So your z-score is minus one point three three. So you're a little bit more than one standard deviation, but not quite two. I'm not quite one. So we're both sort of, you know, within the average expectation. There's an important thing here we should talk about called the sixty eight ninety five ninety nine point seven rule. And the reason you might want to memorize those numbers is those are the approximate amount of the population that lies within one standard deviation of the means. So for one standard deviation, which I'm a part of for the male height population, sixty eight percent of people are within that. Ninety five percent of people are within two standard deviations, which includes you, and ninety nine point seven percent of people are within three standard deviations. So for example, if you want to, let's say, just walk out on the sidewalk, and how likely would it be that you meet a woman who is six feet tall or greater, she would be two point six six repeating standard z-score. In that many standard deviations away, there's a little bit less than a one percent chance that that would happen. So a z-score, it just describes how far away from the mean and observation is, and this is really useful for when you are trying to assess the likelihood that a result was due to chance. Do you remember when we talked about particle fever, the movie we went and saw? What was the movie about? About the large Hadron Collider. Oh, yes. So that was Kyle's movie choice. So he remembers. You remember, too, though, on another podcast, because you were like, oh, I remember them being so excited, they got five sigma. Oh, yeah. Yeah. Yeah, yeah. So that means that their result was, so the likelihood that their result was just like due to chance due to coincidence was six nines and then a bunch of other numbers. So very, very close to one, very unlikely that it was just a random result. So wait, five sigma was five standard deviations away? That's right. Oh, so why don't they say five standard deviations away? Because I don't know, z-score sounds cooler, I guess. But I said sigma, they didn't say z-score. Yeah, well, it's the same thing. So some of this comes from an almost antiquated concept that people probably won't talk about much in a hundred years, that it used to be these calculations were really tricky to do before we had computers. So you'd have these big lookup tables, and you wouldn't actually compute something directly, you would transfer your answer to the standard normal distribution of mean zero standard deviation one. You'd look up the value in a lookup table, and then you'd transfer it back. And that's kind of where z-scores come from, is that everyone could agree that there was this one sort of standard model of the Gaussian distribution. But they're still useful. It's still helpful to say how many standard deviations away from the mean something is. And in particular, in an upcoming episode, next week, we will have a guest talking a lot about z-scores, and in his context, he means how likely the result is due to chance, or due to the alternate hypothesis that someone, in this case, cheated at something. Good to know. So what did you learn today Linda? But I am very short. And I am? Nah, you're fine. You could have pushed up my ego and you didn't. You said you were six feet first. Yeah, well, I'm almost. If I were a good shoe, I guess. Well, that's too tall. Well, thanks again for joining me Linda. Thank you. [BLANK_AUDIO]