Data Skeptic

[MINI] Noise!!

Duration:: 16m
Broadcast on:: 22 Aug 2014
Audio Format:: other

Our topic for this week is "noise" as in signal vs. noise. This is not a signal processing discussions, but rather a brief introduction to how the work noise is used to describe how much information in a dataset is useless (as opposed to useful).

Also, Kyle announces having recently had the pleasure of appearing as a guest on The Conspiracy Skeptic Podcast to discussion The Bible Code. Please check out this other fine program for this and it's many other great episodes.

[music] Welcome back to another mini episode of the Data Skeptic Podcast. As always, I'm here with my co-host and wife, Linda. Howdy. So Linda, our topic today is noise. What does that mean to you? [humming] To me, well, okay, there's two things. There's the noise you might hear on a static radio. And then there's also noise that I know from a data perspective, which detracts from an actual trajectory of some kind of data. Well, the show's over, you already know what we're talking about. That's all I know. It's something in the background that you can take out. Wow, you would hit all the key words. Let me pause for a moment, though, and tell you. That this topic was inspired/suggested, more inspired, by an email I got from a-- I want to call him a friend. He's sort of more of a recent acquaintance, but I like the guy a lot, a guy named Carl Mamer. And I'm bringing him up especially because he hosts something called the conspiracy skeptic podcast, which I had the distinguished pleasure of appearing on this past week. I think it will go live very soon, maybe before this episode goes live. So I would encourage everyone listening to please go check out the conspiracy skeptic. I believe I will be episode 51, but subscribe nonetheless, because it's all great episodes. Carl has on people, and they talk about stuff like JFK conspiracies, and moon landing, and Bitcoin, and all that good stuff. And it's a lot of fun, and it's much like my alternate week episodes where I have a guest, and he kind of talks about the ins and outs of those sorts of things. So do you plan to listen, Linda? I don't know what you said, so-- Well, there's only one way to find out, and let's go listen. The topic was the Bible Code, which is this claim that there are hidden things about the future encoded in the original Hebrew text of the Bible. So I explored how plausible such a claim would be from a data perspective, and talked a little bit about the history, and how one might measure such a thing. So everyone should go listen to that, because it's a fun little data discussion. I actually was kicking myself afterwards, because I feel like I left out a couple of points, but I think it's a really good discussion, and Carl asked me some great questions, and I hope all my listeners will enjoy it. But anyway, on to our topic for today, which is noise. Or maybe I should say signal and noise, and as Linda very eloquently set up for me, this is a term that comes from the signal processing world, where a great analogy, like you said, was sort of like you have some audio signal. So think of an old transistor radio, and hopefully my younger listeners know what that is, and have had the experience, at least maybe in a car, of turning that tuner dial on the radio. And what happens between stations, Linda? Static. Did you know about 25% of that static is made up of the cosmic background radiation? No. Yeah, that's really cool, right? Well, if you want to listen to that stuff, sure. It's the signature of the Big Bang. Anyway, the static is what we call noise, and we call it that not just because it's noisy, but basically information we have no interest in. It's just sort of random data. And then you tune it to a frequency, and you start to hear that great AM announcer talking about old-time radio shows or whatever it is, or some great scab and on some FM station. And you want to tune it to minimize how much noise you get and maximize the signal you get. So we're not actually going to talk about any signal processing today. We won't get into impedance or amplitudes or attenuation or anything like that. But radio is a great analogy in that people kind of understand that like, oh, yeah, there's some signal and there's some noise. So that terminology is very useful in other fields as well. So have you ever encountered someone in your day job perhaps, Linda, who says, oh, this data is noisy? No, I don't work with any data people. I'm sorry. Well, how about this? Have you ever read the comments to a restaurant or YouTube video or anything like that? Yeah. So let me tell listeners, I dislike Yelp. We live in Los Angeles, so there's a lot of Yelp reviews. Tons. And just for my experience, this is not a scientific study, but I've noticed like 90% restaurants have four stars and above. And I hate that because I go to Yelp because I want to quote and quote, find the best ones in LA. Well, they're all the best. They're all four. And they're all four. And so I try and read this star. I try to be like, okay, if it's 4.5, that's above average, like the little fractions of stars, which Yelp doesn't tell you what it is, just a visual clue. So I try and look at the fractions. But yeah, I just don't like Yelp because it's hard to see, you know, the difference between good restaurants and bad. I mean, that's just because we have a lot of reviews in LA. I know when I go home to North Carolina, we don't have that many reviews. So I can't speak to other city. If you were to read through some of the reviews, would you say there are some that look like good reviews and others that are not? Well, if I'm only looking at four star restaurants, a majority of them are really good. But I mean, yeah, sure, there's ones that they have a review and it doesn't seem that valid or it just seems like spam or advertising or one off experience or the owner of the restaurant posting a review. Well, I don't see that much. Well, that's because they hire other companies to do that stuff. But anyway, all those garbage reviews that kind of pollute that signal you're looking for, you'd like to get a clean reading of how good that restaurant is and everyone that goes and puts an arbitrary, always forward, everything, no matter what answer. That's noise in the data. It makes it harder to get to the actual data you're looking for. Yeah, I'm trying to find a good restaurant so I can eat. Another way you could look at it is, let's say you ask someone to fill out a quick, you know, after they finish their checkout, you want a quick survey and some people will spend the time to read the questions and put intelligent, you know, one to ten or one of the seven Likert scale answers or whatever and other people just kind of click randomly. Everyone who clicks randomly is contributing noise. This is data that you're getting recorded, but it doesn't help you get to the actual thing you're trying to measure. So any data you have is actually the result of measurement. Now sometimes it's a very precise measurement. Like if you have a scanning tunneling electron microscope, it doesn't get much more precise than that. You can trust those readings, but then you have things like those police speeding radars that are not particularly accurate or you have like, you know, call people on the phone and ask them their opinion. There's a wide standard deviation there. So some of the results you get are genuine good measurements and others are just nice. And it's hard to tell the difference. So in any data exercise, one kind of has to evaluate of the data I've collected how much seems to be good signal and how much seems to be noise. And it's hard on a row by row basis to know which is which. So sometimes you just need to estimate the amount of noise in your data and try and find a way to filter it, like filter out, as you said earlier, the background noise of the data. So if, for example, in this audio recording, we can hear crickets in the background. Maybe you can't because you turned your headphones down, but I'm a rocker. So my headphones are at max and I can hear crickets on the patio. And later I consider that to be noise in both the classic and the data sense. So later I will apply a high pass filter to hopefully remove those crickets because I want to get just the signal and not the noise. I can imagine that. I do not hear crickets. So I just wanted to clear them the headphones up. So here's maybe a good example. Have you ever driven around and seen in the pavement, there are these kind of circles that are not quite a lane wide and generally they look black with like they've been filled up with tar, like somebody drilled out a circle and then filled it with tar. Have you seen those? They're often like right before stoplights and on the highway. I think what you're talking about are sensors in the road and those are large solenoids. To me, they look like wire that someone just dropped down when they poured the road. Yep. So those are solenoids that are there to take measurements of what's going on the highway. And for a number of reasons, despite many best efforts of really good engineers trying to design the systems and install them at a reasonable price and all that, those are notoriously in noisy data sources. If you try and measure anything with precision, you get the blurriest measurements you can get. Let's talk about how those work. A solenoid is inducing a slight magnetic field and when it gets disrupted, such as when a large piece of metal, aka a car passes through it, you can measure the change in flux. I think it's flux anyway. I'm not 100% sure. Basically, it can measure when a large metal passes through there and roughly speaking for how long. But something like a long truck or a truck with two trailers will set it off or maybe something attached to a boat to a car will set it off or some car might be a little light and not set it off or the thing took some weather damage and it's not as sensitive as it used to be. So it's incredibly noisy data you get out of those, but yet you want to find information from them. So what would you do if you had data that you know is somewhat trustworthy, somewhat untrustworthy? I mean, I'm married to a data scientist, so I think of the world a little differently than you people who do not live with one day in and day, but I think what Kyle's trying to get at is that maybe people would try to figure out when these sensors were accurate versus not accurate and then try to go off the best data. That's my best guess. Yeah, I think that's a pretty good way to say it. If you understand something about your underlying model, maybe like the conditions under which a sensor would fail, you could filter, but at some point you'll get to a place where you say we have this data, it's at the best possible conditions and resolution, and we know it's still noisy, yet if we can predict the amount of background noise, perhaps through the right filtering or signal processing or pruning, we can do our best to separate out signal from noise, but I think the key takeaway that I'm hoping listeners can hear is what I mean when I say noise and when you maybe hear future guests say noise, it essentially means that in any data set you have a mixture of clean signal and random noisy data and you're always after that clean signal, but that measurement of ground truth is often quite unattainable, so it's important to have some expectation or understanding about how noisy your data might be and take that into consideration. What's pruning? Ah, pruning. What would your dad say pruning is? Well, Kyle's trying to reference the fact that my father and mom, they like to garden, so pruning is actually cutting back plants, but I'm asking Kyle, what is pruning from his eyes, which is from a data scientist? So much like I use the word noise that I've stolen from the signal processing community, I also use the word prune, which is stolen from the gardening community. There's a common data structure in computer science called a tree and modeled after the biological tree. It's basically the idea that you have certain nodes that can form branches and lead out to leaves. Think of maybe like a website. The home page is sort of like the base of the tree and you can click on to various pages. Maybe those pages have sub pages and those pages have sub pages and it kind of makes this tree. If you're exploring a website and you maybe want to find, let's say, the name of the person who started the website, you might click on a certain page and realize, oh, it's not here and it's probably not any of the pages below this. So rather than clicking on all the links and looking, you would say, I'm going to prune this path. I'm not going to go down here. But another page, maybe it's called contact us or our founders or something like that, that might have the information you're looking for. So you would not prune those branches. And the gist of it is that in any search, you can do an exhaustive search and look through every possible branch until you get to all the leaves and explore the whole system. But often that's either impractical, impossible, costly or whatever. So if you can find some path to eliminate and say, I'm not even going to go explore that path, I'll prune that path and it's like cutting off one branch of a tree. So we have two topics tonight. Well, thank you as always for joining me, Linda. It is a pleasure and I've gotten much good feedback recently about having you on the show. Well, the pleasure is all mine. And do you have any takeaway from this evening's episode? Well, I think it's interesting that a logical explanation, which is, you know, when you see something that you don't think is worth exploring and you choose not to continue down that path, that data scientists have actually made a phrase for that. No, gardeners did. We were stroller from them. Pruning's is definitely in gardening. How do they use it there? Well, I can't be positive, but I know pruning is a strategic way. For example, in rose bushes, you may want to cut back the number of roses so that the ones that they produce are actually better, bigger, for example, they're just using more energy in the ones I have, rather than spreading it thin. Oh, that is interesting. Yeah, that is a little different than mine. I mean it. But yeah, I think it's funny that data scientists have their own phrases for these strategies when me coming from my background, I just view it as that's just the way it should be done. And how did you view the term noise before tonight? Oh, I mean for me, the only time I might not use noise is when I'm reading something and people have a lot of words and they don't get to the point. This is a great example. That is also noise, yes. Because you want the point, which is what I would call the signal, and all the ancillary, like humdrum, whatever, the dragging on the extra thousand pages in that novel from your perspective. Yeah, that's noise. So thanks again for joining me. I want to remind everyone, please go check out the conspiracy skeptic podcast. I have a lot of fun being on the show and if you wouldn't mind, reach out to the host, let him know that if you enjoyed the show, if you didn't, keep it quiet. But maybe he'll get a lot of positive feedback and it'll ask me back sometime in the future because it'd be fun to go and explore another data-related conspiracy if I can think of one. Thanks a lot.