Archive FM

Data Skeptic

[MINI] Selection Bias

Duration:
14m
Broadcast on:
03 Oct 2014
Audio Format:
other

A discussion about conducting US presidential election polls helps frame a converation about selection bias.

The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. Welcome back to another mini episode of the Data Skeptic Podcast. I'm here, as always, with my co-host and wife, Linda. Hello! Thanks again for joining me, Linda. Thanks again for joining me. No, thanks again for joining me. So, our topic for today is called selection bias. Do you know what this means already? Nope. Well, that's good, because probably some of the listeners won't either. So let's get right into it, and let's start with an example. When's the next election, presidential election? In two years, so I think in 2015 or 2016. So fast forward to 2016, and imagine we want to do our own informal Data Skeptic Podcast poll to try and predict who's going to win the election. And let's say we walked over to Bel Air, and we asked a thousand people what candidate they intend to vote for. Which party do you think would come out as the leader? I'm not familiar with Bel Air, so I don't know. Alright, let's go down to Orange County, and do the same thing there. Orange County and LA is typically associated with being conservative, so they will probably vote Republican. And what if we did the same poll? We went down to the Santa Monica Pier, and we asked a five hundred or a thousand people who they intended to vote for. What party do you think might come out ahead? Well, probably more than half of them are not from the US. They cannot vote. Alright, good, so we'll skip those. That's a good point, though. We ask only the US citizens who are registered to vote. I don't know, I don't know where most of our tourists are from. But the point being, and maybe we just stick with the Orange County example, if you ran that poll, you would say, wow, it's a Republican win for sure, right? This is because, in your sampling, you exhibited what's called a selection bias. You biased the outcome of your data, that as you skewed it from the real result that you would get if you knew the entire population's response, by sampling in a non-random, non-identically, non-independent, non-identically distributed way. Whereas the real election is a sampling, I mean, you can't really call it random because not everyone's going to vote, it's obviously biased towards people who vote, but that's sort of implicit in the process, right? Only people who cast votes have their vote counted, except in Chicago on a couple of occasions. But that's neither here nor there. Similarly, if you went and stood in front of the Democratic National Committee headquarters and ran this poll, you would probably get a poll that skewed heavily Democrat in response, because you're putting in what's called a selection bias. Now, I actually didn't know a few things about selection bias when I did some background on this. When I always said selection bias, I meant what I described, which is technically just called sampling bias, because it's about how you sample incorrectly. Another good example of sampling bias that actually screwed up most of the polls that people were doing until Nate Silver came along and started doing them in a more methodologically sound way, was they only surveyed people who had landline telephones. And surprise, surprise, those results used to be pretty good, and not so much anymore. Do you have any suspicion as to why that might be? I think only the older generation has landline phones now that or people that have no cell service in their house. And what are the political leanings of people who have no cell service, do you think? Well, if they don't have a cell service, it might be rural, and rural tends to be conservative. Oh, I was going to say liberal because they want change, because if you have no cell service, you definitely need change. But in any event, that's another good example of a sampling bias. So there are other types of selection biases that I actually didn't necessarily know about before I did some background research for this. An interesting one is what they call attrition bias. So let's say you signed up, oh, you happened a little bit about this, gym memberships, people often sign up for long-term commitments, right? Or at least that's the business model of a lot of gyms. Yeah, a lot of gyms want you to sign up for as many classes or months as possible. And in your gym going experience, are there any times when they get an influx of new signups? So I used to go to the gym regularly in LA, and the most influx of new people is the new year, because everyone is making their new year's resolutions. But after three weeks, the number goes down significantly. It's just mainly like January 1st, 2nd, 3rd, 4th, 5th. There's a lot of people who've never been to the gym, and they just show up, and it's very crowded. So let's say someone working at the gym wanted to do some sort of poll to see how effective their holiday marketing campaign was, and they conducted this in July. The only people they ended up talking to are the people that are still going to the gym, which would not include those people who dropped out in the first few months. Your results would be biased by attrition. It would only include the people who are still in a program. Similarly, you might do a long term. This happens a lot with medical studies. Someone might say, "Oh, well, you know, we did this program, or like Alcoholics Anonymous. They post a very high success rate, because they assume people who stop going to Alcoholics Anonymous are cured." Which, I mean, hopefully some of them are, but it's a dark figure you never know. The only stats you have are on the people who are still participating. So of an informal survey, it's biased to only those people who are there to respond. Another interesting one is the causal bias. So, for example, let's say you wanted to survey people who have Type 2 diabetes. Do you know of anything that Type 2 diabetes correlates with? Obesity. Yes, it does. So if you did some survey of people who have Type 2 diabetes, then because of that correlation, you're biasing your results to people who are obese. Now, not everyone that has Type 2 diabetes is obese. Some people are not. But a more than expected average of people are. So you're going to get any response that's reflective of an obese population, more so than it is of the general population. Is this similar to your polling bias? Yeah, it's... How's it different? A causal bias is when you're sampling based on some parameter, which is okay. Like, you know, in elections you say, "I only want to survey people who are registered voters," which makes sense. Or maybe people who say they will be registered by the time of the election, something like that. But in the case, no one really opts in to Type 2 diabetes. So if you're saying we're going to do a study about people that have this ailment, then you're implicitly, whether you even know it or not, introducing this bias towards obese people, which may or may not matter depending on the results you're trying to determine. So it's not always a bad thing because, for example, you work at a major fashion retailer, correct? Yep. Do you guys ever survey your customers about their online and/or in-store experience? I'm sure we have... I'm just sure it's lost in the valves. Data, turnover, and old, just people have come and gone. Sure. But the results of any such survey would only be reflective of your customer base, correct? You would not necessarily have a way to survey your customers who didn't make a purchase, because they probably didn't give you their contact information. Yes. That doesn't invalidate the results, it just means you have to take them into context. So a selection bias, which is the one where you kind of pick out who you're going to ask, is pretty much a no-no and something you always want to be concerned about. If you're going to do any sort of sampling, you want to try and select as unbiased a way as possible within the best of your abilities. In cases where you can't, you need to be a cognizant of this fact. And it's often something I see people, especially in business, using to support an idea which the data might not suggest. So for example, if some person develops some new product and they want to talk about what a success it was, they might say something like, "Well, you know, the top 30% of clients said this was the right number." And they just fine-tune that parameter to get the answer they wanted, and people saying that the product was really good. Where actually you should be sampling either uniformly across your customers or maybe in some weighted fashion based on the amount they spend or something like that. So if you do introduce a selection bias, or if one's not avoidable, in certain situations you can still normalize the data. So for example, let's say you were going to try and come up with some new type of toothbrush and you wanted it to be available to both genders. It wasn't like a men's toothbrush or women's toothbrush. I don't think they have such things, do they? Well, I think we're going to turn around and see someone on the market shortly. And let's say you just did some informal survey and maybe you stood in front of a CVS or whatever, and it just so happens that more women go to CVS than men, and so your responses, you get like 200 responses from women and 100 from men. Now ideally your customer base will be 50/50 male/female. Yet in your data you have twice the representation of females as males. So a process you can apply is called normalization, where you would weight each group appropriately to whatever you believe is the ground truth distribution. Another example might be if you wanted something that was more reflective of the census population of a particular city, you can find out what that is pretty easily through the census. And if you have, let's say, more than average or less than average of some ethnographic group, you can weigh those responses proportionately to the representation to try and counteract the accidental selection bias. So is this a novel or interesting concept to you? I think it's just step one of looking at data. That's true. Ideally is step one. If you're about to collect data, you want to consider very carefully how you do it without introducing a selection bias. And after the fact, it's reasonable to try and criticize your own work and say, "Could there be a selection bias here I didn't intend?" I think I'll be more aware of it moving forward. However, I would love to learn more about normalizing data. Oh, yeah? Maybe we should talk more about that in a future mini episode. Yes. Okay, one last fun one. You've been training our bird Yoshi, have you not? Depends what you mean, but yes, my bird Yoshi. I mean, you are our bird. This bird says my name, so I think it means mine. What else does Yoshi say? She says, "I love you." Yes, she does, but how does she say it? She goes, "I love you." And does she do any tricks? She weighs, she turns around, and then she opens up her wings like pretty bird. Oh, that's cute. People can see the video on YouTube, right? Just search for Yoshi bird, she'll come up. There's a few Yoshi birds, but sure. So, Yoshi often, when you're trying to train her, you will sort of offer her a seat or something, make it clear she's going to get a treat. And what does she usually do in a situation like that? Usually, she looks at you and cannot figure out what you want. She's willing to do anything you want, except she's like, "What do you want? What do you want?" So, what does she usually do then? She does her easiest trick, which is she picks up her foot and waves. So, she's just trying anything? Yes. So, what if in a coincidence you actually wanted the wave? Well, then you could reward her. Alright, and what if you wanted, or she opens up her feathers, the pretty bird trick? And then she did the wave? Yep. Well, you're supposed to negatively reinforce that, so you turn around and don't look at her. And how do you know when she's finally learned the trick? That she does it consistently. How consistently? At least 75%. Okay, yeah, that's not bad. But this is really more of what they call the sharpshooters fallacy, what I'm describing here. Named after the fact that you shoot the side of a barn or something. And then you go and paint bull's eyes around all your bullet holes and say, "Hey, look at all my great shots I hit." But it's sort of related to the selection bias in that if you only remember or only save, if you only count the instances where she did what she wanted. You're cherry picking the data, and you're just selecting what you want to get the intended result. So, another, not the best example, but a reasonable example. And it gives a chance to talk about Yoshi, our unofficial third co-host. So thanks again for joining me, Linda. Thank you, Kyle, Deaf, and Yoshi. Aw. Thanks for listening to the Data Skeptic Podcast. Show notes and more information are available at www.dataskeptic.com. You can follow the show on Twitter @dataskeptic. If you enjoy the program, please leave us a review on iTunes or Stitcher. A review is the greatest way to show your support. (upbeat music)