There are several factors that are important to selecting an appropriate sample size and dealing with small samples. The most important questions are around representativeness - how well does your sample represent the total population and capture all it's variance?
Linhda and Kyle talk through a few examples including elections, picking an Airbnb, produce selection, and home shopping as examples of cases in which the amount of observations one has are more or less important depending on how complex the underlying system one is observing is.
[ Music ] Data Skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes, just like this one. Today's topic is small sample sizes. So I'd like to thank listener John, who works at and writes from a large social network site. I'd request just paraphrasing a bit saying he wanted to know more about sample sizes, specifically understanding the power level of small sample and methods to deal with a small sample size, if any. I would find it very helpful, and I'm sure others would too. And I agree, John, so thanks for writing this in. Linda, let's start by just asking, do you know what I mean when I say small sample sizes, or sample sizes in general? Well, what are sample sizes? So I feel like it's a measurement. That's exactly right, actually. When you want to measure something, you'd like to have many observations so that you can develop some sort of model or understanding of your underlying data. Let's say, you know, we have an election coming up in this country next year, right? Yeah, I guess so. One might want to predict who will win that election. Oh, everyone's going to guess. Right. Yeah, everyone's going to guess. How many viable candidates do you think there are? I don't think they officially nominated them. Well, that's true too. But in the end, it usually works out to two when nominated by the Democratic Party and won by the Republic. Yeah, that's pretty typical. Let's say you knew that, and then we were down to two candidates, and you asked four people who they were going to vote for. Do you think that's enough for you to call the election? No, if they're all part of the same family, they're probably all voting on the same side. Yeah, so you're raising the point of it's not representative, which is a really important factor about sample sizes. If legitimately you could sample in a way that's independent and identically distributed and have a good measurement of the general populace, then you can buy with small sample sizes in some cases because you can have really good theoretical models for that. But in a case of something like elections, it's actually quite hard to really sample in a way that's representative of all the people who will vote and how they will vote. So I thought we could break this down into a couple examples and just talk about some of the things that I think you sample with in your everyday life. I would say we travel a bit, would you agree? Sometimes. Where are some of the places we've been? Well, we just got back from New York City. Right? Let's just know that. Tokyo, Berlin, Barcelona. Yeah, we've been in a lot of places. So when we travel, we often have a debate about where to stay, right? Sometimes we have a friend or an acquaintance or something we can stay with. They're a family member, that's always nice. But then it comes down to like, well, we don't know anyone. We're going to have to get some sort of hotel or reservation of some kind. Is there a particular way you look at finding places to stay? I use Airbnb. Yeah, you do. You found us some pretty good deals on that site. Sometimes. Right? So this is a vendor neutral podcast. We're not endorsing anyone. But how do you go about looking for a good deal and deciding between an Airbnb versus a hotel? Well, I look at what amenities they have, like whether they have a kitchen or a microwave, fridge, things like that. Yeah. And then I do a price check on hotels. And then I do a price check on Airbnb. And, you know, I just determined which one to use based on the price check. Yeah, so when we travel, we tend to go to places where there's a lot of hotels and a lot of Airbnb options. Would you agree? We don't like go to the middle of nowhere. We have kind of one choice. Yeah. So how many locations do you look at before deciding you have an understanding of what's available? Well, we could just use New York. Yeah, good New York, yeah. I looked at least, I looked at least 30. Well, that's a pretty good number. 30 Airbnb places than at least 12 hotels. How come you didn't look at 15 Airbnb's and six hotels? Well, the hotels all started looking the same. And then they quickly, their price starts rising outside of the budget that I want. So I only looked at the ones inside the budget that I wanted. So you stopped when dot, dot, dot, when it was outside my budget. Even with the ones inside your budget, there were probably more to look at, I would guess, right? Well, if there's a lot to look at, I could just keep lowering my budget. Even that, you have to discover what's the availability, the volume within your budget, right? Well, yeah, I mean, I just pick a number to start with about, usually my number is around 100. 100 dollars a night. Yeah. I don't want to pay too much more than that. Right. And too little, just is odd. Yeah. But I mean, if you keep looking, how do you know you won't eventually find a $4 night hotel? You know, just maybe you need to invest more in search. I don't think someone would love me. Could be a really rare case, but maybe a, you know, a palatial thing. How do I know I won't play that? I don't know, but I'm not looking for it. But you have kind of a confidence that your sample is big enough that you have a good representation, right? Within my budget and within my quality, sure. Yeah, so let's switch gears and talk a little bit about when we go to the grocery store. You're pretty big on produce, wouldn't you say? Well, how would you describe our fruit basket, Kyle? Oh, full. Okay. Well, we just went to the grocery store yesterday. Indeed. And, you know, there's generally like a pile of stuff, like mini, mini mangoes, but you only bought two mangoes. How did you settle on the two? Well, the mangoes were the most expensive. They're $3 each one. So you want to get some good ones out of the whole bunch? You didn't check every mango, did you? No, I just picked the first two that looked okay. The first two that looked okay. But, I mean, we could have stood there all day. You could have gone through every mango and like really found the best two. How come you stopped? I don't really know what the best mango looks like other than ones that don't have any bruises or open cuts. Oh, so in that case, like, you have even a limit of your detectability of you. There's a certain variance in here. Oh, man, look at five or look at 10. I don't know the difference. Yeah, so your measurement tool, your own sort of perceptions has a certain variance to it. So we've been looking at homes, right? Maybe we can bring it back to talking about that. How many places have we been to so far? I don't know. How many do you think? I was going to guess like maybe 30 we've seen. Sure. Not that much. I mean, just we've been on how many times? No idea. Like four times maybe. Yeah. So if we solve five houses each time, it's about 30. So we're pretty much experts on real estate now, right? We understand everything you need to know. Nope. And even if I saw a hundred, I probably wouldn't be an expert. Yeah, I would have to agree. And we'll never be like precision experts. But the more we observe, the better of an understanding we have of the distribution. Distribution of what's out there? Yeah, of value and cost and all those sorts of things. Yeah, maybe. Would you have felt comfortable making an offer on the first place we went to? No. But it had a spiral staircase. Well, actually the first place is very nice. So yes, I would. So why didn't we buy it? Yeah, we should have moved, huh? Oh, I don't know. Are there any other places you feel like we missed out on? We should have bought? Yeah, a lot. In particular. The mid-century one. The mid-century one was very nice. But do you feel with every time we go out and look at more places, you have a better understanding of what's available and what the market's like? Sometimes. But honestly, the variation is so big. It's hard to say. Is the variation big because we're looking too big or you're seeing just a lot of difference from place to place? Both. This is getting into like how many samples do you need to confidently know there's going to be a good place. And let me just start right here. How likely do you think it is that in the next six months we're going to find a house in the West Los Angeles area selling for $85,000 that has a nice yard and two bedrooms and two baths? That we're going to find a house we like? Yeah. Selling for $85,000. Well, first of all, we won't find that house. It just doesn't exist. Yeah, it doesn't exist. Unless it's like a shed. Yeah. That's why I picked that price. What about at a price of $85 million? Could we find it in this house? Oh, we get to find lots of houses. Plenty of houses. So there's some distribution of cost and value of house we'd like and we're starting to understand the distribution of cost and amenity or location and all these things. So because there are so many independent variables here, you know, location, walkability, outdoor space, bathroom, it would be nice to have more than one. Yeah, I agree. Parking on the street availability. We saw, we saw some nice houses today that just had terrible neighborhood parking. Yeah. All of those things are independent variables that make our search a little bit more difficult because we're observing something that has a large number of dimensions and we're trying to get a large enough sample that we understand this very highly multidimensional space. So when you deal with small sample sizes, it really comes down to what are you observing? And I feel pretty safe saying that buying a home is a complex proposition. Would you agree? I don't know. Well, I still haven't bought a home yet. Yeah. Right. I can't say. I wind this up by just two quick commentary things. I should mention the most two most popular statistical tests, the t tests and the chi square test, both of which we've talked about in previous mini episodes. The t test actually works pretty well with small numbers. I would encourage you to go back and listen to that episode. I didn't go specifically into small sample sizes, but there's some stuff on the show notes that we're looking at and in general, I find that for a good data set that doesn't have too many independent degrees of freedom, the t test is pretty robust. And for the chi square test, as we talked about in that episode, the general rule of thumb is that all of your contingency cells, so you want to look at a contingency table of values that you have at least five observations for every possible option. And that's the sort of, I don't know who came up with it, but it's a sort of reasonable rule of thumb. And if not, you can use exact tests. And those are sometimes computationally difficult, but very helpful at small sample size values. So maybe we could sum this up and just give one last anecdote about how many samples one should take. You've been making ice cream for a bit now, haven't you, our T's and the life screams? Yep. Is there anything you're trying to perfect? The texture. The texture. And what are the variables involved in that texture? Water, fat, air, and water. Okay. Four dimensions. How many times do you want to try a particular recipe just to make sure you nailed it? Yeah. I mean, maybe twice. It's really good. Twice? Okay. So you come to me. I mean, it depends over how long you mean over five years. How many times I should try the same recipe or over 10 years? I don't know. Over 30 days. Yeah. I just mean like number of observations in general, because I would assume that your ice cream does not vary with time. The ingredients that are good today will be good in 30 years. No, you said how many times should I make that ice cream? Yeah. And I said it depends over the amount of time. How so? Well, I don't want to make one ice cream 30 times in one day. Oh, yeah. Good point. But I guess I feel like you're comfortable with a small number of samples there because you know you're domain very well. So you know the variables that actually have an impact. So for example, you don't really capture the barometric pressure when you make ice cream, do you? No. Because I would guess that has very little impact on the quality of the output. I don't know. I think it would actually. Yeah. Maybe we should get a barometer for the house. Well, I'm sure it's probably a similar day to day. Aha. So in that case, it doesn't play a huge variance role in your calculations. So I don't know. I guess to kind of bring it back to what John was asking or at least what I think he was asking is that how do you deal with small sample size as well? If you're fortunate that you have a situation where what you're measuring, you have a really trustworthy underlying model of it, that it's something that's maybe a physical system. So there aren't too many degrees of freedom. Then there are techniques that you can get by with and a lot of these statistical tests like t-test and chi-squared are really robust or at least the exact test for chi-squared is really robust to small sample sizes. But if you're dealing with the domain and my favorite one of recent is looking at houses where there are a huge number of degrees of freedom, almost an immeasurable number, then small sample sizes generally won't do justice for making good decision theoretic conclusions about the observations you've made. Well, I'm a big planner so looking at more houses just makes me feel better anyway. Alright, well I'm glad we're on the same page then. And thanks as always for joining me, Linda. Thank you. And until next time this is Kyle Polich for our data skeptic.com. I just want to remind everyone to keep thinking skeptically of and with data. Good night, Linda.