We'll welcome to DataSceptic. Before we start our interview I want to make a quick announcement because we're going to be launching a contest in conjunction with DataSciGuide. So first of all, thanks to Renee for joining me to make this quick announcement. Thank you. Renee, give me a quick description of what DataSciGuide.com is all about. DataSciGuide is a new Data Science Learning Directory where you could find and rate data science content like books, courses, or even blogs and podcasts like DataSceptic. Excellent. So we're going to be giving away a free copy courtesy of the DataSceptic podcast of today's guest book that's just come out recently. And we'll come back on at the end of the show to announce the details of that contest. So join me again in just a few minutes, Renee. DataSceptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes. My guest today is Cameron, David St. Pelon. Cameron has a master's degree in quantitative finance from the University of Waterloo. Think of it as statistics on stock markets. For the last two years he's been the team lead of data science at Shopify. He's the founder of dataoregami.net which produces screencast teaching methods and techniques of applied data science. He's also the author of the just released in print book, Bayesian Methods for Hackers, which you can also get in a digital form. Cameron, welcome to DataSceptic. Thanks, Cal. Thanks for having me. Oh, my pleasure. So I really enjoyed just to start off going through all your videos at dataoregami.net. They've been really helpful to share with my team and collaborators. So there's many topics there and from a lot of your talks online as well, there were so many things we could have spoken about. But the topic I settled on to ask you on the show about in particular is A/B testing and specifically how you guys use that at Shopify. That's great. A/B testing is such an interesting subject, especially Bayesian A/B testing. Absolutely. So I think it started maybe for listeners that haven't encountered this yet. Can you give us a high level definition of what A/B testing is and why we care about it? The history comes from medical trials. So a physician or a medical company would be interested in how effective is their new drug or their new treatment. So they would give half the patients or some fraction of the patients one type of treatment and the other half, either a different treatment or nothing. And after n number of days or n number of months, they would measure, okay, was this drug effective? What were the differences between these two groups? And the most important point is that there's no contamination between the two groups. So we don't have people crossing over between groups, they're identical except for the drug we give them. A/B testing in modern times has been really applied in the web domain. So you have sites like Facebook and Netflix and Google, even including Shopify. We'll show half of our visitors one website or one version and the other half some other version. And we compare things around conversion rates or click-throughs to see what the differences between designs or flows or even something simple, like budding colors. And there's an interesting and important distinction that I really appreciated you making was the difference between the binary and the continuous outcome. Could you share some of your insights on that? Right, yeah, yeah. I see a lot of headlines, just browsing from the web, especially out of the marketing and growth hacker, seen a lot of headlines around. We improve conversion rate by 300% or 5,000%. These numbers seem so ridiculous to me and that seems so implausible that you get a number like 300% improvement, relative improvement. Yeah, what were you doing before? Exactly, yeah. It seems like you've stumbled upon this miracle treatment. An A/B test is really designed to measure the binary outcome. That is, did this have any effect? And that's a binary question, so it's either yes or no. But yet, people take the results of an A/B test, take the results of this binary decision, and try to apply a continuous solution, or try to apply to the continuous problem, which is how much was the effect, which is a very, very different question and much more difficult to answer. If you think about the space of solutions in a binary problem, there's only two, zero or one, yes or no. But the space of solutions in a continuous problem is essentially infinite. You need much more data, first of all, to solve that problem. And this is where you get these ridiculous headlines, like 300% or 1,000%. So the real outcome we should be after is just whether or not we've made an improvement? Is that-- Exactly, exactly. Yeah. So I know there's a tricky component to this as well in terms of the sample size you use. We started talking about medical treatments, and I think it's obvious there, from an ethical point of view, you want to limit your sample size. The problem isn't necessarily as bad in the web domain, where we have hundreds of thousands of visitors a day, but at the same time, there is revenue on the line. So you don't want to expose, let's say, 50% of your population to a treatment and maybe see a substantial drop in revenue or something for that 50%. Do you have strategies in how you decide the appropriate size for your treatment group? This might come down to, should we show the treatment to 50% or 75% or 10%? In a frequent setting, this can sort of mess up your calculations. Others, there's some workarounds, excuse me. In a bazing setting, it falls naturally into-- I have less confidence in the posterior distribution of one of the groups if I have only 10% falling into that category. In terms of things like revenue lost, that's sort of the net loss when it comes to an AB test, but you have to think about-- you make up for it. If you do have a successful experiment and you've increased revenue, you'll make up for it in the long run. So you have linear growth of that one or two percent increase in revenue. When we look at AB testing, at least from a statistical point of view, there's this underlying assumption that we have this independent and identically distributed heterogeneous-- or sorry, homogenous sample of users coming in and that if we just put them in test and control groups, they're all kind of equal. But I actually think that's not really true because I actually happen to be in Las Vegas as we're recording this and I'm thinking about the casinos, if they're trying to maximize the amount people play, they never know when a really high roller is going to walk in the door. And it's going to end up in either the A or the B group and have a major effect on that group. Are those challenges that get dealt with in AB testing? Yeah, that's a really good question. You have to think, an AB test is actually an approximation to if you were to divide the universe into-- and this is crazy, but this is the best we can do-- if you were to divide the universe into and in one universe, you apply the treatment and the other universe, you don't apply the treatment. That is the ultimate experiment. That's the absolute myth-real standard, just like the absolute best. A real-life AB test where you assign half the population to one group and half public to the other, that's an approximation to this amazing universal standard. So you're right, you might have a high roller enter into the A group and then it artificially inflates their metrics, but that person could have equally as likely been into the B group. So on average, everything works out OK. That's the approximation part. So if you're sample size, if you only run it with three individuals and one of them is a high roller, of course you're going to have that very biased result, but if your sample size is 1,000 or 10,000 or on the web setting, it's easy to get a couple hundreds of thousands of people on average things will work out. So that high roller, there might be like 10 or 20 high rollers and there'll be about even between the two groups. Ah, so is it that the law of large numbers gives us a hand as we get big enough samples? Exactly. I think that's my favorite, favorite law, the law of large numbers. Way better than the CLT. The law of large numbers is so trustworthy. Absolutely. The law fails. [LAUGHTER] The law fails. So I'm curious to hear your thoughts on how long you want to run an A/B test for. If you have a substantial website, you can get a lot of traffic in a short period of time, but you might fail to capture seasonal effects, whether those be weekly or monthly or time of day effects. What sort of considerations go into how you plan the A/B test you're going to run? If it's something with lots of traffic, I'm thinking like page views on the homepage, you want to have that running for at least a week. So you can sort of get that weekly seasonality, definitely not less than a day because that's quite dangerous. You might have lots of conversions early morning and less in the evening. You want to see some change in your population. If it's something with less traffic, you want to run it longer, but not too long because then you don't want to introduce these like annual seasonalities. It's sort of tricky and based on the experiment itself, an experiment that runs for too long. So like if your experiment's running for six months or a year, that's probably a little bit ridiculous. And at that point, the product itself has changed so much that people are interacting with it differently. This might not be true in the medical setting. So don't quote me on it. They might actually do very long running experiments, and I'm sure they do actually. But in Web, you want to air on the side of shorter, but not too short. Yeah, and in talking about sample size, I imagine if we pulled the data as every event took place and ran our statistical test, we'd see a lot of fluctuation early on and hopefully converging into some very reliable solution. As we're, you know, in those early stages in seeing that either an ARAB group is a clear winner or a clear loser, there's sometimes an instinct to either, you know, double down or abandon the whole test. Do you have any rules of thumb for how to decide the appropriate length to run an experiment for and if you should make any midstream changes? I tend to avoid midstream changes just because it's just one of the factors you have to think about when you do the analysis, even if it's a very minor one. So the way we do it at Shopify is so we apply the Bayesian A/B test, and so we can, even early on, especially early on, we can see the uncertainty in our estimates, and that's less true with the frequent test. You can do it, but you tend to ignore it. So by just displaying the posterior, we can see early on how fat they are and how much overlap there is between the two tests. If you were just to naively compute the ratio of, like, let's say it's a conversion rate, naively compute the ratio of number of conversions divided by the population size. Even though that distance difference might be large, like a few percentage points off, if you look at the posterior, which is like underneath this ratio, which is hidden from this ratio, you'll see there's lots of overlap and you can be like, okay, this is like, test isn't even close to being over, we're still very far away. There's too much uncertainty in both of these estimates. So you see a lot of A/B testing dashboards that would display like a line chart between the two groups. I know like a few major web companies do this. They'll have time on the x-axis and then conversion rate on the y-axis, and it'll be super volatile initially and sort of smooth it out and converge to something. I don't like those dashboards, I tend to avoid them just because of early on, they must be so volatile, they must confuse business owners so much. Absolutely. See how we're starting to touch on one of the key points here, maybe we could go a little deeper and tell me about first what makes distinctively an A/B test a Bayesian A/B test and what the advantages of that are. Bayesian A/B test, it implies applying the Bayesian methodology or stats to the problem. So instead of relying on the CLT, which a frequentist will do, we rely on just a simple mathematical relationship between the beta and the binomial model. This is the simplest Bayesian analysis. We assume that we don't know what the probability of a conversion rate is, but we have a prior for it and when we see conversions come in and we see a population increase, we can apply the beta binomial relationship and if you haven't heard of it, I'd encourage you to check it out and get eventually posteriors and these posteriors, they look like bell curves and then both these posteriors, there's for two groups, there's two posteriors for N groups, there's N posteriors, they represent how certain you are as to where that unknown probability of conversion is, that's sort of like it in that shell and then to compute like okay, this is experiment over, you compute, this is a mouthful, what's the probability that the probability of group A is greater than the probability of group B and that sounds sort of weird, but if you write it out mathematically, it's nice and pretty. The advantages of Bayesian A/B testing is it's exactly that, the business owner, the business manager or the statistician or whoever is involved in this experiment, they can directly ask what's the probability that this group is better than this, the other group and that is exactly what they want to know, that's what's the probability of me being wrong or me being right if I choose one right now or the other and that's very, like everyone understands that, that's very clear, you can report that up to the CEO and that's perfect. I'm not saying the CEO is the least literate but it goes all the way up. Yeah definitely, I've been frustrated a lot in my life where, I mean, people would love to have point estimates for everything but those are simply insufficient in really describing the world at least statistically, how have you found that describing it via distributions like this and the variances they have, how effective has that been communicating that message in a business to maybe people that don't have a full statistical background? Yeah, that's another, that's a really good question, you can't hand someone a distribution and expect them to know what to do with it, that's a very rude thing to do, you have to give them like a point estimate, especially if they're not well versed in sort of this style of stats. So you need to derive some sort of summary stat from your A/B test or your distributions and that's typically this probability, what's the probability of either, what's the probability of you being wrong or conversely what's the probability of you being right by picking this group right now. We also do compute, or we do try to solve the continuous problem that I mentioned earlier by looking at this uncertainty in these two groups as to what the probability of a conversion is, we can compute what's the expected lift, the expected relative increase in this experiment and for that we tend to err on the side of being less optimistic and that's to avoid problems where the best estimate might be like 200% but we tend to be conservative and say yeah, that's probably about 5% increase and that's by looking at the distribution of the relative increases and choosing a value that we can rely on at least. What sort of decisions do you generally find yourself using A/B testing to solve for? It can be as minor as moving a button or including a link in a page or it could be as radical as completely changing the checkout design. Both these are valid reasons for using an A/B test. One obviously has more money involved and time involved than the other one but they're both valid use cases. You can spend both those. So I know I have to be a little bit careful about not revealing anything sensitive but sometimes simple things like the layout or some of the imagery around a checkout process can increase a consumer's confidence that the business they're engaging with is very protective and concerned about their privacy and things like that and that small changes can have big impacts. Do you find that that's the case in your work as well? Yeah, we've seen that too definitely. Small changes, sometimes very surprising. You don't expect much to happen but then there's quite a bit of interest in it. So I imagine there's sort of a multi-armed bandit problem here in that we could literally test anything and I don't know myself of any way to figure out a priori what's the best thing to test. What sort of the business process for deciding what's most important to test and how to evaluate it? That's a good question. I think the best way to solve this is to test as much as possible. So that means making it very, very easy to start an experiment, to complete an experiment, to add an experiment, to the framework or to the analysis portion of your experiment pipeline. That encourages developers just to try everything, just to try whatever they want and that sort of opens up the funnel to lots of different experiments. It can be somewhat tricky because you might have conflicting experiments later on but it usually doesn't come to that. So I think the multi-armed bandit's solution here is to just have many people pulling arms and find the best one that way. Makes sense, yeah. Are there any special considerations you worry about in having multiple experiments going on at the same time? Right, so that's been a serious discussion here. I tend to be okay with it as long as the number of concurrent experiments isn't too large on the same experience. Other people like to keep it low like one or two per page. The most important thing about all this is keeping the experiment assignments independent so they're not conflicting. So you don't have correlations between experiments. If they're independent, you can rely on stats to sort of wade through these other experiments and get to the core. Do you worry ever that things like the equivalent of the placebo effect or the Hawthorne effect can come up in A/B testing? Right, yeah. So there's a great paper and anyone who's doing experiments has to read this paper. It's out of Microsoft and it's maybe five years old now. The name escapes me but it goes through maybe six or seven mistakes that the Microsoft A/B testing team made early in their research. And one of the ones they described was the novelty effect where I think that's called the Hawthorne effect. Yeah, yeah. Another literature. Cool. Basically what it is is you introduce a new feature and just people just like it or they're interested in it because it's new and they click on it, they're actually converting or doing anything else. They're just clicking on it because it looks new. Most websites have this problem. So dealing with that is a serious issue and you have to carefully craft your experiment to avoid that. In terms of the placebo effect, that can happen in medical trials. I can't imagine happening in the website because most users don't know the part of an experiment on the website or they don't know which part of the website is being experimented on. Oh, yeah, good point. In a medical trial, you have to sign off and say, "Yes, I'll be part of this experiment." And a website, it's sort of hidden. It's implicit. And so I know there's a lot of tools out there and at the same time an A/B test isn't something that one necessarily needs a tool for. The mechanics of it are pretty simple. Can you share a little bit about your experience with approaches and tooling to implement a test? So we have a custom A/B testing framework here that we developed in-house. One important point of it is that the assignments are independent of any analysis. So the assignment framework just does that. It just makes assignments and shoots them to our data warehouse. It doesn't do any checking of conversions or anything like that. All that analysis happens in the data warehouse where we combine it with other data sets. So we combine it with data sets around, like, did you convert or did you buy this or stuff like that. So that sort of separation of concerns is super useful and very important to have. So you can have your assignment engine just fire off assignments and completely independently of what's going to happen with the analysis afterwards. There's tools like Optimizely, which I think sort of tie them together a little bit more. But it's nice to have them separate so they're two distinct components of your architecture. During the analysis phase, once you've kind of collected all your data, you're going to have the results for your test and your control group or your A and your B. And there's going to be some effect there or let's assume you see some effect. Some of it will be due to just noise from the imbalance that may be some particular highly converting user went into the A group by chance. And then you'll have some difference that's due to the actual treatment you created, the improvement or enhancement that you're testing for. And you can't directly measure which of those things caused the improvement. So I'm curious how much of that becomes a part of your analysis phase and trying to separate out these two not directly separately observable features. Right. So early in my, in my experimenting career, I guess, this occurred to me and I was very scared about it. So I rushed off and did a bunch of math and simulations and really worried about it. The conclusion I came to was that we have to do, or the practitioner has to distrust A/B tests with small sample sizes, at least if the variance of the class imbalance problem is on the same order of magnitude as the variance of the treatment effect. And what that means is if your treatment effect is like 1%, then the variation within the classes has to be less than 1%. So I did a nice blog post on data origami.net. It's called the class imbalance problem in A/B testing. I basically just do a bunch of simulations and test this hypothesis. And there's some math there, I think, to work it out. But it's a really serious problem, so you need to make sure that you don't stop your experiment too early with a small sample size, because you might just be seeing that class imbalance problem. Yeah, that's a good observation. I'll be sure to link to that blog post in the show notes if any of the listeners want to go check that out. So ultimately I guess the objective of any A/B test is you're trying to maximize some particular metric related to a change. Can you tell me a little bit about how that then relates back to what you informed of the business and how they move forward with it? Right, so an A/B test, the simplest A/B test is the conversion test. And that just measures, like, did you convert, did you not convert? But the business owners, they might want to know, okay, did that user, how much revenue came from that user, which cohort brings in more money or more revenue to the business. So with Bayesian A/B testing, it's really easy to, after the conversion analysis, to apply a decision function onto the analysis, or onto the posterior, really. And this gives you a different view of the data and you can see, okay, given these conversion rates and these other factors, what sort of revenue is the company bringing in? And that's really possible because it's super easy to apply a decision function in a Bayesian A/B testing setting. Do you mind if I ask what's the most successful test you've done? I know there's probably some confidentiality involved, but in terms of the realistic improvements, not these thousand percent type things, but is there anything in particular or sort of a general class that you've seen has a large net effect on positive gains? I probably can't discuss the best A/B tests we've done, but I can't discuss one that we found that was really interesting. I think the most interesting one was I really got into the idea of behavior economics, specifically thinking fast and slow and nudge and these sort of ideas. And I like to think of the Shopify business as a little mini-economy and do little experiments as a government figure in that economy. So one experiment we did was we wanted people to fill out a shop meta-description, and this meta-description shows up on the Google rankings, so it's important to fill that out. So Google detects that, it ranks your website highly. But we found that not many people were actually filling out that box. Originally, the box just looked like a blank box, a text box. And we decided, okay, well, let's try to put some filler text in there. The text that sort of goes away when you click on it. So we put in, and this is sort of me going back to behavior economics. Let's put in a small nudge that is, it's a negative nudge. So we put in the box, enter your description here, or you might lose SEO juice, something like that. And we found that it was hugely successful. We got about like a 10%, absolute improvement on people filling out that box. So it did really well. Next, we're like, okay, well, maybe it was the text itself that encouraged people to fill it out, maybe it was just text. So we said, okay, let's try positive text. So we said, enter description here to rank higher on Google. And the negative was, enter your description here to avoid ranking worse on Google. And then the control was filled there where it was the empty box. So we did these three groups. And again, we found the negative just blew away all the other groups. There was some effect from the positive, but the negative just really killed them. So we rolled this out, we put into production. We didn't do too many more after this because we didn't want the admin to have negative copy everywhere and like always like harping on the user to include this or you'll lose money. So we sort of keep it as like a secret weapon. Oh, how interesting. It's really cool. Well, we're on the topic. Tell me a little bit about Shopify and what services you guys offer to your customers. So Shopify is an e-commerce or more generally a commerce platform now. Small businesses, medium businesses, even large businesses now, they can sign up a Shopify and have a store running really quickly. They can use our e-commerce portion, so setting up a storefront, selling online. We do all the shipping, payment gateway, stuff like that, inventory. You can also do retail. So if you want to point of sales stand, we can do that stuff. We also sell, we also sell our merchants to sell on Facebook Pinterest. We have mobile SDKs, stuff like that. One of the interesting services we're working on now is the Shopify home and home is an overview of your shop. It's less of a dashboard and more of here's what's really cool. Here's a quick overview of how your shop is doing. So half of the page is just like simple metrics, some graphs around sales today and stuff like that. And the other half is data generated insights about your shop. So me and my team will data mine, data sets like your visits, data sets, your sales data sets, and publish really interesting insights that we found inside this feed. So the merchant, looking at this, they can say, "Oh, great. This is a really cool idea. I'm going to try this." And that idea is based upon your store's data. Oh, that's really fascinating. And just in the event we have either a local merchant or a medium, even large enterprise or perhaps a aspiring entrepreneur listening who might need services like that, where's the best place for them to go to learn more? Shopify.com. All right. Cool. So Karen, this has been great. Really glad you came on. The show to share some of your insights. I want to bring it back to discussing that your book is now being released in print. So we're recording this a little bit sooner to get it out, lined up just right. Where's the best place for people to check that out if they're interested in a print copy or a digital copy? The print copy definitely, I would say, search for it online. It's by Addison Wesley. You can find it on Amazon. With it is also included other chapters that weren't included in the digital version, the original digital version, including a chapter just devoted to Bayesian A/B testing. So it's a nice tie-in with what we've been discussing today. If you're interested in the digital version, which is open source, you can find it on GitHub under Bayesian methods for hackers. Awesome. Can you give us a couple of quick bullets as some of the other topics? Sure. Yeah. There's a whole chapter on decision functions, which is probably my favorite thing about Bayesian testing or Bayesian analysis in general. There's a chapter on logistic regression. There's a chapter on picking good priors on some neat applications applied to privacy. Yeah. Very cool. Yeah. I think it's a great resource. So I was glad to have you on to share a little bit about one of the chapters and I hope a lot of listeners will check out some of the others. Great. Thanks, Kyle. Excellent. Well, thanks so much for your time, Cameron. Have a good rest of your day. You too. So courtesy of the Data Skeptic Podcast, we're going to be giving away one free print version copy. It's a free ebook as well, but if you like the print versions as I do, we're going to be giving away one free copy of Bayesian methods for hackers, probabilistic programming and Bayesian methods by today's guest, Cameron Davidson-Pilon. So as we announced at the beginning, I'm back here with Renee from DataSciGuy.com. So welcome back Renee. Hi. To get started, maybe tell me a bit more about DataSciGuy.com. Sure. It's the site for people that want to learn data science. So you can search online and find all kinds of resources, online courses, books, podcasts, blogs, conferences, tutorials, and on all kinds of topics, machine learning, data visualization, Python, R, statistics, but as great as it is that there's so much content available online, it can also be pretty overwhelming. So where do you start? Which of the many topics should you focus on? If an online course you're taking suddenly gets too hard, you might start to wonder whether you're even cut out to be a data scientist, but maybe that course is hard for everyone. And the course might not be well taught, or it should have had more prerequisites listed, or maybe it's the opposite. If you already have a lot of relevant skills, what book should you buy that will help you learn an advanced topic without being too basic in a waste of your money? So that's why I built DataSciGuy. I set it up so that not only is it a directory of all of these data science resources, but then when you review content, you include your level of expertise on that topic. So if a beginner reads the reviews, they might relate more to what another beginner has to say about the content. So then eventually I'll use these reviews to build learning paths and a recommender system. It could say, if you liked this course, you might also like this book. Or it looks like you're an intermediate level learner on this particular learning path. So this might be a good tutorial for you. But for now, that's all future stuff. First I need to collect some reviews. So the sites are work in progress, and I'm currently fleshing out the directory of data science content. And you can check it out at DataSciGuy.com. Well, excellent. I appreciate you teaming up with me to help me find a lucky person I can give away one of these books to, and hopefully in the process get you some more of those reviews to kick off that recommendation project. Great. And I really appreciate you mentioning me on your great podcast. Oh, my pleasure. Let's tell the listeners a bit about how they can participate. What does one have to do to get entered to win this book? Okay. You can enter the contest by signing up for an account on DataSciGuy.com. You just click get account in the menu at the top. And then you can click on all content, or you can search for a book or course or other content you would like to review. You rate the content, write a short review, and then submit it. If it's your first review, I'll need to approve it in order for it to show up on the website. So once it's visible, you can click the tweet link. I'm going to add a special link to this site just for this contest that includes a pre-written tweet with the win.ds book hashtag. And then you're entered into the drawing once you've got the tweet and the review. We'll start this off as soon as this episode goes live and we'll run for two weeks. So please get your reviews done by November 20th. Unfortunately, I'm going to have to limit this to winners of US residents only, or at least people who have shipping addresses in the US. So I know I can handle that and get the book to you maybe next round. We can do something for international listeners as well. So I'll be looking for that hashtag. That's hashtag winds book and there'll be some details in the show notes if you want clarity on that. The time the contest ends will poll all those people and we'll use something like random.org to generate a sequence number and we'll pick the winner from there. And then I will DM them some details about how I can get the book to them. Okay. And I'll also put up a blog post that explains all of this. You can go to data-sideguide.com and click on updates to find the blog that has these instructions. Thank you again Renee and best of luck to everyone who wants to participate. I look forward to reading all these reviews. Thanks Kyle. Until next time, this is Kyle Polich for data-sceptic.com reminding everyone to keep thinking skeptically of and with data. [Music] [BLANK_AUDIO]