Archive FM

Data Skeptic

Contest Announcement

Duration:
12m
Broadcast on:
08 Oct 2014
Audio Format:
other

The Data Skeptic Podcast is launching a contest- not one of chance, but one of skill. Listeners are encouraged to put their data science skills to good use, or if all else fails, guess!

The contest works as follows. Below is some data about the cumulative number of downloads the podcast has achieved on a few given dates. Your job is to predict the date and time at which the podcast will recieve download number 27,182. Why this arbitrary number? It's as good as any other arbitrary number!

Use whatever means you want to formulate a prediction. Once you have it, wait until that time and then post a review of the Data Skeptic Podcast on iTunes. You don't even have to leave a good review! The review which is posted closest to the actual time at which this download occurs will win a free copy of Matthew Russell's "Mining the Social Web" courtesy of the Data Skeptic Podcast. "Price is Right" rules are in play - the winner is the person that posts their review closest to the actual time without going over.

More information at dataskeptic.com

(upbeat music) - Well, welcome to a bonus episode of the Data Skeptic Podcast. I wanted to add something to the feed to announce a contest that I'm going to be launching related to the show. I am incredibly flattered by all the number of people who are listening. Everyone who shared the show left a review or written me personal emails. I can't tell you how grateful I am and how flattered Linda is as well that people are doing that and supporting the show. The more you guys do that, the more energized we are to continue this. So I wanted to launch a little contest to give something back and hopefully inspire a few more listeners to share the program with other people. I want this to be a contest of skill, not of chance. So we're going to put you to work on little data modeling exercise if you're so inclined. You don't have to be a advanced data scientist. You can make a guess if you want and you still might win. So let me explain the contest. Well, first the prize. In a couple of weeks, we haven't recorded yet so assuming nothing crazy goes wrong. I have Matt Russell scheduled to come on and be a guest on the show. He's written many books, but the one he and I are going to talk about is Mining the Social Web, which is a fantastic book from O'Reilly. So I'm going to be giving away a second edition. That's the latest to one lucky listener who wins this contest. And here's how it works. So since this is a contest that I hope will help the Data Skeptic podcast get into the hands of more future listeners to enjoy, I wanted to do something that was sort of a forecast based on our popularity to date. So the popularity metric I'm going to apply is cumulative number of downloads. In a moment, I'm going to give out four different data points that are from the history of the downloads so far. I will also put this on the website for anyone who wants to go there to see not only the data, but also some suggestions I've made for modeling it. And I wanted to have listeners try and predict when I'll reach a certain milestone. And I'm going to watch very carefully. And when I cross that line, what I'm going to do is go to iTunes and see who the most recent reviewer was. So this is also known as the Price's Right Rules for any Americans familiar with that program. The review that was most recent without being after the time at which I hit my download goal will be the winner. And I will contact them to arrange shipment and all that. The goal that I'm looking for is to get to 27,182 cumulative downloads. Now why that number? You know, I have to pick some arbitrary number. Why not 50,000 or some round number? Well, I really hate celebrating arbitrarily round numbers. So I picked the first five digits of Napier's constant, better known as the irrational number E, which is present in logarithms and in a lot of exponential functions and modeling and the Gaussian distribution and that sort of thing. So when I hit that number of downloads, I will check the iTunes reviews and the person who reviewed most recently to that time will be the winner. So it's a contest of skill in that I'm hoping people can try and make a prediction of when I will hit that download point and post the review strategically so that they might be the winner. As I said, you can just, you know, go guess and put your review up at any time and maybe you'll win or someone who's quite good at modeling might just win this contest. I am going to try and check for cheating. My sources of ground truth here are iTunes for the time stamp of the reviews and Libsyn, which is the host of the podcast that tracks my downloads. So if you were to say go post a review as soon as you hear this and then write what would probably be a two-line bash script to make 20,000 downloads happen in a few minutes, you would force yourself to be the winner. So if I see any very suspicious behavior like that, I will probably disqualify whomever the most recent winner appears to be from the contest and perhaps just award it randomly to all the submities who left reviews. I actually don't think that will happen but I just want to announce it just in case. Anyone not interested in the contest, feel free to jump forward now and skip to the normal content that will resume on our weekly Friday schedule. But I thought I'd take a few minutes to pontificate about how one might approach this for anyone who's maybe new to forecasting and linear modeling and that sort of thing. So here's the four data points I'll give you. And as I mentioned, these can all be found at datasciptic.com. On June 21st, we had a cumulative total of 130 downloads. That was very early on in the program's history. I think it was just a few days after I put the podcast in iTunes and I don't know if I'd even told any of my friends about it yet. So that's pretty close to zero, but that's a starting point. By July 31st, we had 2,313 cumulative downloads. By August 30th, we had 6,146 cumulative downloads. And by October 4th, we had 13,400 downloads. That's even, I didn't round that. What you will need to do is to forecast out when I might hit 27,182. There's a couple of ways you could approach doing that. So the way I start out a problem like this is I just do a simple plot of those dates where maybe that June 21st date would be, I'll just say that's time t equals zero and then I'll set the other dates to the right labels on the x-axis and the y-axis, I will plot as the totals I gave. Now, four data points is not a lot. It's really impossible to confidently fit this data with the right curve. Now, what do I mean by that? The right curve is all about finding the trend for where these downloads are going. Are they increasing, decreasing? Are they constant? The first step I usually take is just trying a linear extrapolation. Regardless of your math background, I think most listeners will recall the concept of y equals mx plus b where m is the slope of the function you're trying to understand and b is the intercept with the y-axis and that you plug in enough x's and y's and you can understand the relationship between these two variables. In our case, the variables of date on the x-axis and downloads on the y-axis. So y equals mx plus b is designed to fit linear equations because you have only one m and it's constant so it can't describe any other things like exponential or parabolas or sinusoidal data or things like that. And in truth, linear is not a bad place to start with a lot of things like this. If I had one listener and they downloaded one episode a week, that would be a linear growth podcast. Or any arbitrary number. If I had 100 dedicated people, no more no less and nobody quit, nobody joined, that would be a linear relationship. So you have so few data points here, it would really be frivolous to try and say, I'm gonna let the data alone tell me the underlying distribution. So you kind of have to pull in some outside domain knowledge and say, what do I know about podcasts or any type of media for that matter in terms of growth? Or what do I intuitively think about it? Naturally, at some point it starts at zero. There was not a data skeptic podcast at some point and that meant it had zero downloads there. I'm certainly not downloaded by everyone on the planet, nor will I be at any point. So you have an upper bound and somewhere in between is the growth rate. Starting from zero, it's relatively easy to assume that growth rates are better than linear. Meaning they kind of have a convex upward slant. If you have two listeners and you go to four, that's not a huge absolute increase, but you have just doubled. And if you go to eight, that's, you know, a geometric increase which won't keep up forever, but it certainly is easy to double and grow very quickly when you're at a small scale. But it also is natural that at some point, growth has to top off. As I mentioned, I won't ever reach a point where everyone in the world listens to this and I'm okay with that. So where will that drop off be? One might try and forecast what is the maximum audience for something like the data skeptic podcast and then figure out how successful I can be at penetrating that audience and expect some sort of like a curve that perhaps would look like the arctangent function or a logarithmic function, maybe. Something that has initial growth and then slopes off. I'm not gonna tell you what curve I'm seeing so far or where I think it's going, but that's part of the fun of the exercise. Take those four data points, plot them, and guess when you think extrapolating from the data I've given you, I'll hit 27,182 downloads. For anyone who's gonna put some actual effort into this instead of just guessing, although I do respect guesses, I'll give a few more pieces of advice. One thing I've noticed about my downloads is it will come as no surprise that when I release a new episode, I see a sharp spike in downloads that day. That tells me that there are people who are subscribed whose podcatcher software, be it iTunes or Stitcher or podcast or a public or whatever, are polling for new episodes and when they see when they download it right away. I thank all my subscribers and I hope to find more all the time. But there's also sort of a background noise of downloads, a certain fixed number of downloads I get every day that are seemingly independent of the new episodes I'm uploading. Now, why would that be? Presumably that is people who maybe find the website, maybe they were googling for something like Chicago Street Light Crime and the great podcast I did with Zach C-Skin comes up and this is someone interested in civic open data of which the city of Chicago is a great leader and they listen to that podcast said cool and they weren't interested in the other episodes so they went away. Or someone who just came and picked a random episode, listened and said, no, this isn't for me and went away. Or users who find the podcast recently and wanna go back little by little into the back catalog and catch up on all the old episodes. So there's lots of reasons why I get these downloads that aren't necessarily the day of or day after a release. And that too seems to be a growing trend if my sort of extrapolation seemed to be correct. I do get more and more of these downloads that aren't related to releases but at a slower growth rate. The main growth I've seen and appreciate very much is that on days of releases, I'm seeing more and more downloads that day and maybe about half as many the next. That could either be accelerating, constant or decelerating. So you'd wanna make some judgment there and plug in a model appropriately. An important parameter in anyone forecasting when I'll hit that goal is how many shows I will release between now and then. So to make this somewhat easier, I am going to commit that I will not put any more bonus content out between now and the time I hit that goal. So you only have to concern yourself with the Friday episodes. If you're not a data scientist and you're not good with multivariate analysis or generalized linear models and you wanna take a much simpler approach, I think you can do that too. I would encourage everyone to open up Excel or Google spreadsheet or something like that. Put in these four data points I've given and plot them as a scatter plot. Let your eye try and predict what the trend is here and allow some of your assumptions about the domain that is podcast downloads to come into play. Do you think this is something that grows exponentially or linearly or in what fashion? And given that what sort of curved you see going through all four of those data points and how do you see that curve extrapolating out into the future? So while I generally say go with the evidence and the data rather than intuition, intuition is one way to compete in this contest. So I think that's all I'm gonna add. If anyone has any questions, please reach out on Twitter and I'll be happy to answer them. Best of luck to everyone who wants to compete by leaving a review on iTunes. Apologies to anyone who doesn't listen via iTunes and finds out to be a hassle, but it is fair in that everyone can create an iTunes account just for the purpose of leaving a review if they're so inclined to participate. And maybe next time we'll do something on another platform. But all it takes is a good guess and a review to leave to win this book. I'm also gonna give one other optional prize to the winner. If they're so inclined, they're welcome to join me for a short bonus episode where they get to give, like most of my guests, two recommendations, the benevolent and the self-serving recommendation. So they get a little bit of a chance to promote themselves and something they like on the podcast as well. So good luck to everyone. Thanks for sharing and helping the podcast grow and I will see you next Friday. (upbeat music) (upbeat music) [ Silence ]