Archive FM

Data Skeptic

[MINI] Data Provenance

Duration:
10m
Broadcast on:
09 Jan 2015
Audio Format:
other

This episode introduces a high level discussion on the topic of Data Provenance, with more MINI episodes to follow to get into specific topics. Thanks to listener Sara L who wrote in to point out the Data Skeptic Podcast has focused alot about using data to be skeptical, but not necessarily being skeptical of data.

Data Provenance is the concept of knowing the full origin of your dataset. Where did it come from? Who collected it? How as it collected? Does it combine independent sources or one singular source? What are the error bounds on the way it was measured? These are just some of the questions one should ask to understand their data. After all, if the antecedent of an argument is built on dubious grounds, the consequent of the argument is equally dubious.

For a more technical discussion than what we get into in this mini epiosode, I recommend A Survey of Data Provenance Techniques by authors Simmhan, Plale, and Gannon.

[music] The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. [music] Welcome back to another mini episode of the Data Skeptic Podcast. I'm here as always with my wife and co-host Linda. Hello! So today's episode is a write-in suggestion from listener Sarah L. She wrote me and more or less said, you know, you're supposedly the data skeptic, but you've never actually talked about being skeptical of data. And in a way, she's sort of right. So it's a little embarrassing that we got 36 episodes in, and I've never talked about being skeptical of data exactly. What do you think, Linda? Yeah, I don't know. I'm not on the ones where you're interviewing people, so I'm not sure. So we've talked a lot about data science and about how to be skeptical with data. But the show actually does have a double meaning. You can use data as a tool for skepticism, but you can also be skeptical of the data, which is a big topic, and I'm going to touch on it many more times. But the specific area I want to touch on today is what I call data provenance. Now, do you know the word provenance, Linda, from the art world, maybe? You just talk, I just know about it, because you love that word. You mention it as much as you can. Mainly, I think you mean where the data is from. Well, I want to talk about the art. You know, what about provenance and art? Well, you wouldn't even say where the art is from. So, like, if a new painting surfaces, there's a question of, like, you know, let's say it's by Gaudy, well, he didn't paint, so it wouldn't be by him. Let's say it's a Picasso, right? So, somebody finds a Picasso in, like, a closet. Is it really a Picasso or is it a forgery? You have to trace it. Where did it come from? Can you find a time where, like, maybe Picasso was photographed with it? And then, did he sell it? Is there a receipt? Can you find who he gave it to? Can you connect it from that person all the way down to say, like, yeah, you know, Joel sold it to Andy, Andy sold it to Kim, Kim sold it to Bob, and then it wound up in Bob's grandson's apartment. So, what's your definition of provenance? So, oh boy, I don't have the Miriam and Webster dictionary in front of me. So, let's just go right into the topic of data provenance and what I mean by that. And that is not so much who bought and sold the data as it's used in the art world, but what is the trail from where the data was initially measured or captured until you got it? So, for example, let's say someone shows you a chart that says eating a certain combination of foods is a really good way of preventing cancer. And it's just a chart, and it shows, like, yeah, the more cabbage you eat, the less cancer you're likely to get. What would you think of that? The more cabbage you eat, the less cancer? Less incidence of cancer. I don't know. Maybe. Maybe it replaces all the other stuff they're eating. Yeah. It couldn't be true. It couldn't be plausible. Not necessarily that there's causation, like cabbage is magic, but maybe people eat cabbage are negatively correlated with smoking. So people eat more cabbage, smoke less cigarettes, and therefore have less exposure to lung cancer or something like that. So, there could be a correlation there, or maybe some crazy causation, but you need a really overwhelming evidence for an extraordinary claim like that. But more, I'm talking about if someone presents you an argument and they say, well, we have all the data. We had, you know, a thousand patients, or we found this from a study. How would you go about evaluating whether or not you trust the results? Let's say the math is correct, like the arithmetic. They did say like, yeah, there's a negative correlation here. Is that enough to believe it for you? Believe that the data is correct? Yes. That the conclusion is correct based on the data. Oh, I don't know. I think this, this is a trick question. Wow. Depends on how you answer it, but bear may not be. Oh, I mean, I just feel like from living with you, you would say, no, well, actually, you should do this, and you should do that, blah, blah, blah. But I mean, me personally, I don't know where to take this. I'm like, would I believe it? I'm like, yeah, people write news articles and they say 50% of the population believes this. I go, okay, sure. So you think it's correct that 50% of the population does believe whatever it is? It's just, you know, I think a normal person doesn't think about it. They're just like, okay, 40% of Americans don't believe in evolution. They believe in young earth creationism. Maybe, yeah. Well, yeah, actually that could be true. I don't know for sure. It could be very sad if it was, but when it comes to asking questions about the provenance of your data, I think there's a couple of areas. First you have to ask where was it collected from? Let's say there's a report that talks about animal abuse being on the rise. Did that data come from, let's say, the Humane Society or the police responding to calls or did it come from an activist group like PETA? Which of those three places do you think is going to have more reliable data? I guess the Humane Society. Yeah, it's a toss-up. I would say PETA's data is probably not good because they have an agenda. So they're interested in proving one point. That doesn't mean that point is wrong. It just means I would maybe be more skeptical and apply more scrutiny to their claims, which may or may not be correct. But the Humane Society seems like a good one because as far as I know, they're just a good sharing. They don't really have some weird agenda of trying to prove something that might not be true. The police seems like a more objective source, though, because they're totally unattached to it. Well, within the government, I think there is a department that handles animal investigations. I imagine so. And it's not the police, technically. So... Yeah, so the police might not... So I don't trust the police because I don't think the police can determine whether or not it's animal abuse unless they know about animals. That's a really good point. So even though the police might track that data, you could go into some database and count the incidents where they had to, you know, respond to a complaint, they might not be the best at evaluating that complaint. That doesn't mean their data is bad. You just need to know where it came from and interpret it your own way. Now that also introduces some other interesting aspects of data provenance. There's always a question of the categorical policy that someone uses. Let's say there's a police force that's very overworked, like, I don't know, New York City and they're responding to calls all the time. If they go and they say, oh, it doesn't seem like animal abuse, we're just going to leave it alone. They might not record it because they're busy doing something else. Well, I think you should keep in mind that they have some kind of protocol. I assume it varies by state. So if you call them record or not, they probably have to document it somehow. Good point. So some states may have stricter protocols of recording more data. So it would look perhaps as though the state with the more strict data collection requirements has a higher incidence per capita of something like animal abuse because they just have better different rules on how to record it. You could also have issues if you look through time, like maybe from 1973 to 1991, there was a certain person that did a really bad job with bookkeeping and then they retired and then the new person came on board and did a much better job. It might look like there was a spike in the data. And there's data that could be deliberately misclassified. I think there's accusations. I don't know if they're true or not, but people accuse certain police forces and sometimes schools of misreporting data when it's going to affect their budgets. So like if you say, oh, crime rates go above a certain number, you don't get some additional funding. Well, it's easy to underreport as you approach that barrier, things like that. Yeah, I'm sure that happens. Corporations do it too. Yeah, corporations do it a lot. And you can see employee turnover that could affect the way data is collected or maybe, you know, there was some bug in the system for a period of time and everything was recorded differently there. So you want to ask questions about your data set, who recorded it, and how did they collect it? Have you ever collected a data set? Let me think. I mean, honestly of the last few places I've worked, I guess the closest I've come to looking at data is just looking at Google Analytics. Oh, that's a good example. So what do you know the provenance then, right? You know where it came from. Do you know much about how they're tracking it? I mean, I assume there's different ways to implement Google Analytics, but Google Analytics has a URL that shows, you know, which page is tracking and stuff. Yep. I think it's generally accepted. Yeah, essentially you put a little piece of JavaScript code like at the bottom of your page. I can't remember if they say to put it at the bottom or the top, but basically it's on the page and it makes a call out to them so they can record data about the user. But if maybe your web designer forgot to put that on a few pages or they have some crazy JavaScript error that prevented, you know, unrelated, but it prevented that code from getting executed, you could have issues in your data there. But by and large, Google's not trying to do anything but track it accurately because if they didn't, their product would be not so good and people wouldn't use it. So they don't have an agenda. An agenda is something you always want to ask when you're looking at the provenance of the data. So I think it comes down to what you want to conclude from the data and how important your conclusion is because although the data set you have may tell a very clean story, it could be that it's based on factually incorrect measurements and observations. So if it's something really, really important, you need to decide you want to make really, really sure that the data is good. And that involves asking questions of where did it come from and who recorded it and perhaps is it heterogeneous or homogenous? Let's say you wanted to look at 100 years data of agricultural data and there wasn't a single organization that collected it the whole time and you're patching together different organizations, disparate data sets. You now have non homogenous data, it's different types. So they might have kind of recorded it differently or reported it differently. And it would not be necessarily a good idea to just link it all together. Or if you do, just consider how that could affect your analysis. So the next time you see a table in a magazine or a chart on a website and it's something astounding and you want to know more about whether or not to trust it, what might you do? I mean, one option that you just went over is that you could look it up where it came from. Whatever it doesn't say. Well, that's just suspicious. Excellent point. And maybe we ended on that excellent note. Be suspicious of conclusions based on data that don't report the provenance of the data. [MUSIC PLAYING]