This mini-episode discusses Anscombe's Quartet, a series of four datasets which are clearly very different but share some similar statistical properties with one another. For example, each of the four plots has the same mean and variance on both axis, as well as the same correlation coefficient, and same linear regression.
The episode tries to add some context by imagining each of these datasets as data about a sports team, and why it can be important to look beyond basic summary statistics when exploring your dataset.
(upbeat music) - The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. - Welcome to another mini episode of the Data Skeptic Podcast. As always, I'm joined by my wife and co-host Linda. - Hello. - So our topic today, Linda, is Anne's Combs Quartet, which I showed you just quickly before we started this episode. Can you maybe describe what it looks like this being an audio podcast so our listeners can get an idea of what you're seeing through your eyes? - Sure, there are four charts. - Let's call them, they're stacked two by two, let's call it A, B, C, D. If you read like upper left is A, then go to the right for B, then lower left C and then D and lower right. - Right, so they look like kind of scatter plots, which is dots, and then they have an X and Y axis numbers. I guess I could describe A, which kind of looks like, kind of looks like generally it's linear, but they're more scattered, kind of that's a scattered one. B kind of looks like it's forming a bell curve, except it doesn't fan out of the ends, actually, sorry, it's like a hill. - Yeah. - And then C just looks like actually a straight line, except for one outlier at the top. - Good word. - And then D looks like a straight up and down line with one outlier in the far upper right. - I think that's a great description. Anyone who is at or by a computer, which should be everyone, you're all listening to this on a computer of some kind, go Google this, right? Yeah, they have other phones or something. - Well, they're on their phones, they might be driving, so please don't do it while driving. - Excellent point. Please pull over immediately and look this up, 'cause you should see this visual. Most data people have seen this, but might not remember its name, but hopefully your descriptions have jogged their memory if they're just audio only people. So yeah, we're gonna talk about Anne's Combe's quartet and why it's important. So this was defined by the statistician, Francis Anne's Combe as a way of kind of proving a point that certain outliers can shape data and that statistical properties of data sets will look the same for very different data. So for example, what would you guess is the average Y value in all four of these scatter plots? - Like seven? - Pretty close, yeah. It's 7.50 is the average. - Great, so just to be clear, Anne's Combe quartet is a series of four scatter plots. It is not a musical quartet. (laughs) - Fair point. Although that would be a good name for a quartet, perhaps. If you took the mean of all the Y's in each of those plots, it's all the same. Also, all the X's have the same mean. Also, they have the same variance, the same correlation, and if you put a linear regression with ordinarily squares through them, you get the same linear regression for all of them. So if you had a limited scope of the world, like you couldn't see these images, you were just measuring them in terms of the means and variances and stuff, they would be identical to you, but we can look at them with human eyes and they look very different. If we were, let's say, had hundreds of thousands or millions of data sets we were analyzing and we were just comparing means, having an algorithm do that, all four of these would look the same, even though they're quite different from one another. So I thought maybe a way to wrap a nice analogy around these would be to think that this is data about some sports team. Each data sets of teams, we have four teams. The X axis is the year's experience of the player and the Y axis is the average number of points that player will score in the game. So could we maybe go through the quartets and kind of describe in that context how each looks? What do you think of the A team? - A team's probably the most normal team, which is generally as people with more years, they seem to score more, but not necessarily it's kind of scattered. Like I said, it looks like a scatter plot. - Yeah, there's a variance to it. So yes, there's a trend. More years of experience means more average points per game, but some guys are a little ahead or below the curve. It looks like very natural data. What about the B team? - Well, I'm gonna assume there's a girls or women team. We just said it was a men's team. So I would like to point out and give equal opportunity, both genders who play sports. Anyways, this one looks like a hill or kind of like a bell or-- - An upside down parabola kind of. - Yeah, upside down parabola. - This data, if I were to describe this team, I would say like as your years experience go up, it improves, but there's definitely like an over the hill moment, like certain athletes, they hit their peak and then they start to decline as they get older. We're seeing that here that after the sort of middle, the older players on the team are starting to score less average points per game. What about the C team? - C is the weirdest. You must be like robots or the coach is like very vigilant because as you have more years, you're very linearly scoring more. - That's right. Yeah, it's very predictable except for one guy, right? - One guy or girl. - Who's an outlier? - Yes. - So if we go into the usual Alice in Bob context that we use in math problems, Alice here is scoring a lot more points per game than you would expect for her years experience. What about team D? - Team D is probably the least likely. Everyone has the same amount of years experience and they're scoring time. - The variance looks good, that's believable. - Yeah, I guess the variance is okay. - It would just be odd that everyone on the team was the same age except for that one outlier who's very experienced. - He's more than twice everyone's age and yet very good. - So in team C and D we have one all-star, right? Those outliers are all-stars. They're scoring. - Yeah, I guess sure. - They're really like, if this were a real sports team, those would be like the Michael Jordan's, I guess. We would talk about that person. We'd be like, wow, that's the leader of the team. - Or like Serena Williams, you know, just to give a women's example, okay? - It's not a team, she's a solo. - Well, she's an outlier being very good. And sometimes they do play on teams like doubles. So I haven't checked out her doubles playing tactics, but yeah. - Good point. If you were gonna bet on these teams and you just looked at like the team's average points per game or the team's average score, all the teams would look the same. But when you look at the individual players, that's when you start to see that there are different trends among each of these teams. So let's say like you were someone who was going to place a bet. Some sort of legal bet. I'm not encouraging any under the board activity. Some legal bet on these teams. Which of the teams do you think would be a better bet? Or is there any you would rule out and not feel safe betting on? - I would probably just prefer the hill or the linear one 'cause they seem very safe. - The A or the B team? - B or C. - I would get rid of team C and D because there's a star player on both teams C and D. And if that player is sick or injured, it's gonna drag down the average of the whole team. Whereas in the A and B teams, you could lose pretty much any one of those players, even if it's the best player, it's not gonna hurt the team too much. - Yeah, okay. I would say your strategy's better. So I would then, I would like to change strategies. - Okay. Yeah, if you just looked at these higher level statistics, the mean, the variance, the correlation, or the linear regression that you'd get from applying those methods to these data sets, all the teams are identical. But when you have a closer look at a different property, sort of the composition of those teams, you realize that there's a better strategy in terms of betting. So these things all look the same at a glance, but they're really not. - So I also wanted to point out something that Kyle pointed out earlier. The average X value is all the same for all of these. And same as the average Y value. So each chart independently, like let's take a chart A, you're gonna average all the Y values, then average all the X separately. They are all the same, then take the X and Y, say A, B, C, D, they're all the same. So that's something to point out. - Yeah, that's exactly right. The averages, the variances, the correlation and the linear regression are the same for all four of these cases. So if you intend to tell them apart, you need to look at other things besides those sample summary statistics. So summing it all up statistically from just some of the standard metrics, these all look the same. But a visual inspection definitely shows us these are very different data sets. And that's the important lesson of Ann's Combs Quartet. - Well, I think that's gonna tell me and inform me for my future betting purposes. - Excellent. - And I don't bet. - Yeah, sometimes the winning strategy and betting is to not play. - Great, then I would say I'm winning. - Excellent, well I'll see you next time. - Goodbye. (upbeat music) (upbeat music) (upbeat music) (upbeat music)