Data Skeptic

The Science of Online Data at Plenty of Fish with Thomas Levi

Duration:: 58m
Broadcast on:: 05 Dec 2014
Audio Format:: other

Can algorithms help you find love? Many happy couples successfully brought together via online dating websites show us that data science can help you find love. I'm joined this week by Thomas Levi, Senior Data Scientist at Plenty of Fish, to discuss some of his work which helps people find one another as efficiently as possible.

Matchmaking is a truly non-trivial problem, and one that's dynamically changing all the time as new users join and leave the "pool of fish". This episode explores the aspects of what makes this a tough problem and some of the ways POF has been successfully using data science to solve it, and continues to try to innovate with new techniques like interest matching.

For his benevolent references, Thomas suggests readers check out All of Statistics as well as the caret library for R. And for a self serving recommendation, follow him on twitter (@tslevi) or connect with Thomas Levi on Linkedin.

The Data Skeptic Podcast is a weekly show featuring conversations about skepticism and critical thinking in data science. So welcome back to another episode of the Data Skeptic Podcast. I'm here this week with my guest Thomas Levy. Thanks for joining me, Thomas. Thanks for having me, I'm happy to be here. I thought maybe we could start with some of your academic and professional background if you wouldn't mind sharing. My original background is I've got a PhD in particle physics and string theory. I did my undergrad at Dartmouth. I did my PhD at University of Pennsylvania. I did a fellowship at the COVID Institute for Theoretical Physics during that for about six months. I did a postdoc at NYU and then a second postdoc at UBC, which for American People's University of British Columbia. So that was about five years for the PhDs, six years for that. I don't know, wrote 15, 20 papers, was an actual string theorist and then about three years ago switched into industry. And now I'm the senior data scientist for the dating site, Plenty of Fish. Excellent. And in a way, I'm disappointed I should have had John to talk about physics because it sounds like you've got a lot of knowledge to share, but maybe another day, our topic for today is the work you're doing over there at Plenty of Fish. To my understanding, you guys are one of the leading online dating sites. But for any listener who might not be familiar with your company, can you tell me a little bit about Plenty of Fish and how you guys are different from other sites? Absolutely. So we're actually, by a lot of measures, the world's largest online dating site. We're free and by free dating sites normally break up into free and paid. And the main sort of separation is, can you actually interact and send messages with other users without paying a subscription fee? So we're free. We're at the largest free site. We get about somewhere between three and a half to four million daily unique logins. At any given time, there's about 26 million people registered. And I think our highest user ID, I think we just passed the sort of 90 million mark. Now, that's not, you know, we don't have 90 million active at any given time, but we've had 90 something million sign up since being the site. It's been around for about 10 years and we're one in the US for one in Canada, we're one in the UK. I think we're one in Australia. So there's tons of people to give you an idea of the volume. We're processing between 25 and 30,000 messages a minute. Wow, that's huge. Exchange. Yeah. So it's quite a bit of volume. So what are some of the data science problems you've been able to work on and they're currently solving? So I've done a bunch of things. The first sort of projects I did were redesigning a sort of broad and fake profile detection for the site. There was a couple systems in place that were working well, but they required a lot of human intervention and things like that. And it was getting to be at the scale, we're at quite difficult. So myself and one other person designed basically a one AI system to catch general scammers. And that's a self-learning system. So basically scammers don't want to get caught, so they try to change up their tactics and things like that. But systems are coming in on and so on. And we look into that and basically the system learns new patterns and tries to catch them. And then I developed another one to effectively hunt botnets and things like that. Cool. Other things I've done, I've looked at some of the matching algorithms. I've done some studies on user behavior. So for example, you know, someone's first like few hours on the site or first day on the site, what are sort of key actions that they take that predict like maybe even two weeks down the line, whether they're going to be active members or delete their profile. So that's very important for us for, you know, things like onboarding and changing up the flow and understanding what our users are using. I've also been developing a new thing which, depending on when this podcast actually goes live, may be a feature that's released, which allows for basically free text contextual search for things people are interested in and giving you matches. So if you're interested like me in things like skiing and video games and craft beer and stuff like that, you can type that in and it will basically find you relevant matches in your area for in this case, it would find you people into like outdoors sports, you know, video games and comics and things like that and foodie craft beer, wine and so on. And it does that all kind of actually automatically. So we can talk a little bit about that in more detail if you're. Yeah, yeah, absolutely. I'd love to know a little bit more about it. Basically on people's profiles, you have sort of a few different areas where you give, you know, very basic metadata, you know, your age, your profession and things like that. And then we allow you to write like, you know, a short essay about yourself, we call a description, but there's another area where we allow you to say what are you interested in. And at the moment, that is basically completely free text entry. Some people will have one or two interests and people will have like a hundred and something, on what they like to write about. And I wanted to use that for matching and my original motivation for this is I'm not single now, but when I was single, it was quite difficult for me to find, for example, like the nerdy girl who was into skiing and outdoors sports. It's not that easy to just sort of filter that on a site. On ours and on other sites, a lot of sites will allow like what's called keyword searching. You know, if I type in skiing, it'll basically just show me every profile where someone's written down skiing, but it won't show me someone that's into snowboarding or climbing or in some cases, if I wrote skiing and won't even show me people at the ski, depending if it stems or not. So I kind of wanted to actually see if this was a solvable problem. So to give you an idea of the sort of problems you have to tackle with this, let's think about like biking, right? How many ways can people write that they're interested in biking? So there's biking, there's bicycling, there's cycling, there's mountain biking, there's road biking, there's variations of those and there's misspellings. And really you want to group all those things together. And it sounds very easy when, you know, I say this now and it's like, well, why can't I just buy hand group things together and match? But when you're talking about, you know, 19 to 25, 30 million profiles with people writing all the different kinds of things, it rapidly, like for example, who's going to sit there and list every action movie and group that into the right thing, you know, by hand, we just don't have the capacity for doing that. So you want a way to do it automatically. So that's one of the problems. The other problem you deal with is the idea that people are not any one thing. So if I just say I'm into skiing, that's not really a great description of me because I'm into skiing and a bunch of different things. And you want something that captures sort of, it's called a mixture model. The only thing that captures that people can be a mix of different things. So you might be a little bit into skiing and, you know, a little bit into foodie stuff and a little bit into, you know, reading novels and things like that. And you want to get all that sort of cash in. It took a few months, but I basically hit on a couple of ways to do this. And it's hopefully going to be going live in the next few weeks, actually. We're just building the front end for it right now. So hopefully I don't try and cross any, you know, competitive secret lines or anything, but I've asked you a few people to follow us about that because it's a really interesting topic. You had mentioned like stemming, so for anyone who's just getting into data science, could we share a description of that? Sure. And that's not going to cross any lines. Oh, I'm going to get there. So stemming is basically, if you think of words like, let's think of the word party, for example. So I can write party, parties, and various examples of that, or movie, movies. What stemming does is it gets to what's called the root of the word. And there's various algorithms for this. I use something called a porter's stemmer, which is standard. You know, it's so standard at this point that I don't even think I can describe the algorithm because it's something I read about three years ago. And now I just use basically it will reduce, for example, movies and movie and so on to M-O-V-I. So similarly, like skiing, ski, skier, all that kind of stuff will get reduced down to ski. And the idea of stemming is basically there's lots of times when you have like words like, if I say I'm interested in movies and somebody else says, I like to go watch a movie, that is effectively the same thing and you want to get at the root word. Now, there are times when you don't want to stem. For example, for this, I didn't assume I should use stemming to a priori and I actually built models where I stemmed, where I didn't stem, and I treat some of those things a little bit basically to see which ones gave me the best results. So it's a relatively standard thing in NLP or natural language processing when you want to go to like a root of the word. And you had mentioned also the comparison, I think it was like cycling versus bicycle riding, a stemmer won't match those two. So how do you kind of get that semantic connection to link? So that's now you're into sort of more complicated stuff. So there's lots of ways that you can do this. One way is you can look for co-occurring words. This is not what I actually use, but you can actually do this. So for example, if somebody writes cycling and climbing on a profile and somebody else writes biking and climbing on a profile and you see lots of commonalities across that, then you can kind of say, well, I wonder if biking is like cycling and so on. That's one way to do it. At this scale, that's still not very effective because you suffer from what's called the cursive dimensionality. There's just so many words that people can write that trying to look for co-occurrence and saying, well, this person wrote 45 words, 12 of them match this other set of person and maybe these other 33-membered shouldn't match. It gets very complicated very fast and it's not really effective. There are various sort of online dictionaries that have groupings of these things or try to map that I guess somebody is done by hand or something like that so you could use that. What I actually did is a technique called topic modeling and the idea of that is you're trying to find the latent topics in what's called a corpus. So a corpus is just a set of documents. So this technique was originally developed by David Blay, Michael Jordan, an engineering I want to say. And then to say it was originally developed by them is not really, they wrote the original paper but it's an algorithm, a bunch of work trying to make latent semantic analysis fuzzier and things like that. People who did LSA, fuzzy LSA and so on, they sort of built upon this. And the idea is you're trying to find these latent topics. So I think the original example was they had a bunch of journal articles. So if you take a bunch of articles and like science or something like that and you want to say well you know some of them are clearly going to be about physics, summer chemistry, summer neuroscience and so on but even within those topics there's a lot of mixes of them. So for example, actually my PhD advisor who's a string theorist also has a lab in computational neuroscience and information theory. So a lot of the papers he writes has a lot of, even though they're in neuroscience, has a lot of hints of physics and higher level mathematics and so on. So when you're categorizing that article it might be like 70% neuroscience, 20% you know physics, 10% information theory or something like that. And you want to basically not only discover what the mix of the documents are but you want to discover what the topics are themselves. So there are various ways of doing this. So for example, there's things like late into your clay allocation and NMF for non-negative matrix factorization and so on that fall under the general rubric of topic modeling. I basically wanted to use this for our users. So in other words, people write all their interests in and I can treat that as a document in a sense and basically every person is going to be a mix of these things and I'm going to let the users define what the topics are and that's sort of what topic modeling does for me. So for example, if everybody who writes that they're interested in snowboarding also writes that they're interested in puppies, there would be a topic that has snowboarding and puppies very high in like the first two words in that topic would be snowboarding and puppies. Now that's not actually what happens, right? People that write snowboarding tend to write climbing and hiking and things like that. And so that becomes a topic that gets centered around action sports or outdoor sports and so on. I should mention these things don't actually give you the names of the topics. You just get topic five, topic seven and so on. But by looking at the words, you can very quickly sort of see what at least some of these topics are. So you get ones around team sports, ones around outdoor sports and so on, but each person is a mix of that. And to me, the reason I like this is that I'm not pre-biasing the results in any way. So I'm not saying that snowboarding and skiing and things like that should go together. I'm letting the users own profiles basically define what goes together. And then I'm using that very information to then match people. So it's very sort of organic, I guess, is the way I would say it. I would guess also that since it's data driven, you don't have any sort of ontology you're trying to superimpose on it, that it's also pretty robust to new things. So when the hottest new musician who hasn't even come out yet starts showing up in profiles, that'll kind of bubble to the top. There's a couple of cool things in this. So one is I can actually build this in languages that I don't speak. So I basically only speak English, I have a horrible understanding of French and that's about it. But if I wanted to build a model in German, there's nothing stopping me from actually building it and using it to match people. Now I would look at the model and not really be able to tell you what most of the things are, but it's completely robust to actually work. Now for something new coming out, you still have to update the model. And there's a couple options for that. You can do online updating, which would show that bubbling to the top. The main experience at this point, what I found is in terms of scaling it out and making it faster, it's not as worth it to online update the model as just to batch update like every few weeks or so and just replaced with a new model. The reason for that is one, actually computing a new model is quite intensive. That's not a fast step. That takes several hours, although in these days of data, several hours is actually quite fast. It's not easy. The other thing is that not every model that you compute ends up being great. Some of it is you're searching through parameter space. So how many topics do you use? That's one of the things you have to input. One of the things, because I'm using late in Dirichlet allocation right now. So there's an integral that basically can't be computed exactly. So you have to numerically approximate that. There's various approximation schemes like that, like do you, for example, use like Markov chain Monte Carlos, do you use some kind of variational algorithm like expectation maximization, things like that, those change things up a bit. Also like what words do you filter out of the vocabulary? To me, I wanted to actually build the model and test it out a little bit and kind of inspect it before I actually go live on users. The other thing is whatever I build, it has to be able to run fast enough that people can search in basically real time. When people change their profiles, it has to update on their profiles in real time. If I went and launched my own dating site tomorrow, and my strategy was I'm just going to randomly sample two individuals out of my database and make that as a suggestion, that's probably going to be pretty weak in making successful matches. And I'd imagine all these techniques we've been talking about go into helping you guys match a little better. But it opens the question, how do you know the degree to which you're being successful? What's your metric in evaluating whether or not the matches you present to users are being appreciated by users? So this is actually, it's quite a deep and involved question. There's actually kind of a bunch of some questions that you're getting out of here that I'm actually currently struggling with. The most obvious way is when people leave the site, we have an exit survey. So if you're deleting your profile, it says, why are you deleting your profile? One of the options is you found someone here. And if you did, we ask you, if you can, to tell us what the username is. You've actually built up quite a database of couples that have formed on the site. So we can look at either in training or in practice, whether these various matching algorithms were trying and so on, whether they're doing better at either predicting the couples or once they're in practice, whether they're actually leading to more relationships and so on. The tricky thing about that is people that answer an exit survey and even people that actually delete their profiles, opposed to going inactive, is already sort of a biased subset. So there's an issue there. And the second bit is that there's a large number of things that happen between us presenting the user a possible match and then actually going on a few dates and leaving the site. So there's not exactly a direct causal link that you can meet. I mean, there's a direct causal link, but there's a lot of steps in that chain. So you might want to think about proxies for this and even whether you're optimizing over the right thing. For example, I created another matching algorithm because we were basically when we looked at onboarding, one thing you might imagine is that when people get into a positive conversation initially, that tends to make them much more active on the site even down the line. In that case, I designed a matching algorithm where the criteria for success and failure was a reciprocal reply to a message based thing. So it tries to predict if I'm user A, I send a message to user B, will that person want to reply to that message based on our profiles and things like that? So you can optimize for that. You can optimize instead for not just a reply, but a conversation of a certain debt. Maybe you want to move three, four, five, pings back and forth kind of thing. So a more significant interaction. These are things that actually we're struggling with right now. There's a separate problem here, which is what's your criteria for success? So from comparing to matching algorithms, how do I determine that this one's performing better than the other one? And so one I can look at is overall relationships and so on. So I could split test and then wait two, three weeks and say, well, this one created this many relationships, this one created this many. That's definitely a measure. It's tricky because it's not really a conversion based thing. You don't know if it's just that more people are being exposed to one or the other. You need to me some kind of like denominator there or something like that. Something you can look at instead of just the relationship it is. So for example, a matching algorithm, let's say it shows 50 profiles, right? So you go to one of our matching pages, we show you 50 photos and descriptions and so on. You can click on as many as you like. And if you like that, you can send a message. So I can do that. I can represent that as a funnel, right? So I present 50 people. It's now basically a multinomial trial. So you can open up none. You can open up 12. You can open up two, whatever it is. So it's how many successes do I get in that 50? And then of those that I open up, how many of them result in a message? How many of them result in a reply and then how many of them result in a conversation? And you can track the funnel down and at each part in the funnel, you can basically test the algorithms off each other and you can decide what you want to optimize for. That requires actually quite a bit of tracking on those pages and so on. So that's actually something we're building out right now is the ability to fully track and test all of that so we can fully sort of test one algorithm off the other for the full funnel. Yeah. With respect to that tracking and given the really large volume of data that we talked about earlier, I'm sure there's a lot of pipeline challenges here and how you a mass store archive and are able to aggregate all that data. Do you find the algorithmic approaches to be the real biggest challenge in your job or is it developing that scaling hardware pipeline? It's almost equal. So I like I should caution and say, I'm not the person who's like doing the hardware side pipe stuff. I basically in a lot of ways act as a somewhat as a client for that, although it's I interact with the team and work with them quite a bit. What I do there is I try to come up with questions that we want to answer, things we want to test, studies that we want to do. And then we sit down and say, okay, well, what's all the data that we need to do this? Do we already have it and if we don't, what do we need to, you know, tracking for it? You know, we're not just a website. We're actually quite a bit shifting much more towards mobile. In the last few years, the dating industry, like online dating, but also sort of social media, all that kind of stuff is shifted drastically towards mobile. And that introduces a whole other layer of complication because if you want to, for example, do testing on Android and iPhone and things like that, native testing is a lot more complicated. You have to deal with release schedules, app store, approval process and all that. And we're actually doing that. So like we're running like native tests on Android in app. I work with them. We define those things. And then we basically figure out what do we need to do this. So we're actually in a somewhat early days trying to build out this whole pipe. POF, we've very much been on the sort of relational database model. We've actually pushed that very, very far, in fact, much further than I think possibly most other companies that said, we're definitely kind of bursting at the seams and breaking there. So we're switching over to a more distributed model and we're investigating. So we're building out a Cassandra cluster. So when you move to a lot of these things, basically there's no great one single answer. And so you kind of have to, you try stuff, you break it, you try something else, you break it, you eventually come to a compromise solution. So there's a lot of work. There's actually an entire team that's called data ops. And it's about four or five people and we're hiring more for that. So if people are actually who are listening to this podcast or interested, feel free to get in touch with me or us because we're always looking for people that are interested in this sort of stuff. And that whole team is that's basically that and the split testing platform is their entire job. And that's all they do. So there's a tremendous amount of work there that I work with them quite a bit on. That said, there's also a lot of stuff on the actual data science side, data mining, cleaning data, all of that stuff. So it's really tough for me to say that like, you know, one requires more work than the other. And I really want to emphasize that it's a team effort. Yeah. It's not me working in a vacuum. I'm the only data scientist, but there's a research team, there's other research developers. So I work closely with them. And then the data ops team, some of which float between what you, some people will call a data architect, data engineer or data scientist. You know, some of these are other people with, with PhDs in computer science or image analysis, who have a lot of experience with machine learning and data analysis and stuff like this. So it's very much a team effort there. So I would say it's probably equal and the problems are really difficult at this scale. Like once you really get to the point where things can't just be in a nice sequel database, you can just, you know, join as many tables as you want. You really have to worry about these things. It really creates a lot more issues and problems. And it makes, to me, it's actually really quite interesting because you really have to think hard about what you want to ask, what data you need and how you can actually get it. And that's a very interesting problem in its own right. Absolutely. When I look at the challenges you guys are facing trying to make good matches with users, it's sort of an optimization problem. And every optimization problem has this objective function you're trying to solve for. So I think if someone wanted to be really, really cynical, they might claim, well, the objective function you guys really have is you want to keep people on the site as long as possible, right? Keep subscriptions going and then keep ad revenue going that you actually don't want to make a successful match. I can't believe that's actually what happens in practice. So what's the reality of how you guys are looking at solving that problem? Yeah. So I would say a few things about that. One if there's other dating sites out there and so on and so if we were giving poor matches consistently, then that would probably drive people to another site. We're known as the dating site that gives the best matches possible. That means that for every time someone's out with their friends and they're saying, oh, I just broke up with my boyfriend or girlfriend or partner and I'm newly single, someone's going to go, oh, I met my husband on plenty of fish and it's a really great site and you should go there. That kind of word of mouth advertising and sort of viral spread for us I think is much more valuable than trying to squeeze three more days out of a person on the site. The other thing I'd say is it's quite difficult to actually match people. So even if you're trying your very best and believe me, we are trying our very best. We've committed a lot of money, hardware and very smart people to do this. It's still a very, very difficult problem. No matter how well you're doing or how hard you're trying, I'd love to say that you sign up for the site and in the first five minutes I give you one person and this is going to be your soulmate for life, but that's not really very likely to happen. I mean, it does occasionally happen. We'll give people a say, oh, this is the first person that messaged me and we're getting married and so on, but the reality is it's quite difficult. So no, we're very much like, we might talk about the things we talked about earlier. Should I drive towards a positive conversation or an excellent answer on the exit survey or something else? But those are all still trying to get proxies for positive interactions and eventually finding the right partner and relationship. Absolutely. I'd love to delve a little bit deeper into why matching is such a hard problem. I hate to rely on like an old adage, but like the opposite's attract concept. I think there's a certain amount of, when you see a happy couple, sometimes I see two people that are together and I say, oh, this is obvious. Of course they belong together. And other times I look at you while I say, I don't know why they got together, but I'm so glad that they are. It just works, but from the outside almost seems irrational. So maybe there's a certain percentage of successful matches that are just pure randomness. What's your perception on how solvable the matching problem is? I think in some cases you're dealing with incomplete information, right? So we don't have every single little bit about people, right? We have what they enter and try to do your best with that. That said, there's a lot of things that even in your description are implicit that people don't realize. So even your opposites attract kind of thing. There's a lot about people that you wouldn't really even consider as potential partners. For example, the average person that's very highly educated, that doesn't smoke and doesn't do drugs, let's say, they're very unlikely to even consider someone with a high school education who is a chain smoker. So a lot of the things that you can do, and what gives you a very good initial lift in a lot of these optimization things, is weeding out people that we know you're very, very unlikely to be interested in. Now, of course, there's always outliers, right? But what we're trying to do is give the best experience and the best sort of optimization for the middle of the bulk of the set. So you can actually get quite far with that. So it's something that's not immediately apparent to the average user on the site. They go and they see some photos and some profiles and so on. But behind the scenes, there's actually quite a lot. So for example, if you don't smoke, you're very unlikely to be shown smokers on the site. You'll occasionally see them, but you're much less likely. No matter where you are on the site, you're less likely to see that. Those things actually get you quite far. So if there's 3,000 potential matches in your pool, like let's say in your area, vaguely the right age, if I can eliminate 1,500 of them off the bat, because I know that you're very unlikely to consider them, I've already made the search much more efficient, then you can do what do most people actually match on. And it's not any one thing. So the model that we actually use for matching is actually quite non-linear, like we run neural nets and things like that. So it's actually looking for all those combinations. So I think it's solvable in the sense that to me, what I'm trying to do is I'm trying to get you to that first date and that first interaction, and I want that to be as positive as possible. I don't think that I can predict your soul mate just based on demographic data and some words in a profile, I think that might not be a completely solvable problem. What I can do is make it much more likely that in a given time frame, you're likely to meet the potential to actually have a positive relationship with. That I think is a very solvable problem. And I think when you phrase it that way, which is to me the way a data scientist tends to think, which is I'm trying to optimize and increase the probability of something happening. I think when you phrase it that way, it's very solvable. So all I'm trying to do is increase the probability that you'll find your positive relationship or your partner as quickly as possible in a given time frame. That makes a lot of sense. So you'd mentioned having incomplete data, which can make your job a little bit more difficult. I would also speculate that you're getting a certain set of questionable data as well. So I might fill out my profile and you'd say, I'm just a notch more successful than I am, say I'm 20 pounds lighter than I am, maybe have an old picture, whether I'm intentionally trying to manipulate the system or doing it unintentionally, portray an image of myself that isn't quite reality. In some sense, you get garbage in garbage out to any regression or whatever process you might want to look at. But how difficult of a problem is bad data for your work? So it definitely comes up. So for example, men tend to lie about height and income. Women tend to lie about age and weight or body type. You can look at a population distribution on our site and we've done this. And there is a giant spike for women at age 29 and at six feet tall. And why is that? Because every guy who's like 5'10" and 5'11" is saying they're six feet because they think and it's true, a lot of women when they do a search will say I want at least six feet. And for men, they'll say I want her to be under 30. And so you can see it's a nice little bell curve that follows a normal population and then it's like giant spike and then giant dip. And so you can look at this stuff. Now, the problem is there's not really a great way to adjust for that because there's a lot of people that are 29 or say they're 29 who are actually 29 and there's a lot of people that are six feet or say they're six feet and are actually six feet or six one or something like that. So it's not very easy to say, well, you know, these 30% of people, I'm just going to mark down to 5'11" because they might be the wrong 30%. In that case, you just have to sort of try your best with it. Now there are other cases where you can filter out outliers. So for example, there's like a default date, I think, for like your birthday, like on all sites, right, and when we start looking at how birth month and things like that might influence people's behavior, you will filter out a lot of people that say, for example, they're born on January 1st because if that's the default day, then, you know, that's not like you still have all the other people born in January who say they're January 2nd and so on. You're taking out a much smaller outlier set and in this case, you're not really bonking or breaking the rest of the population dynamics. But when you're talking about like the age of like 29 or the height of 6' or a certain income level, there's not really a great way to filter around that. And so unfortunately, yeah, that stuff does sort of pollute a bit, but you know, when you're talking about having a whole bunch of attributes of which age, for example, is only one and height is only another, the hope is that like that one bad data point will not be born out. So like one thing you can do is you can look at your models, your optimizations and you can just see if there's like any weird spikes in it. So if for example, it's like it's doing phenomenally only on people who are exactly 6' tall, right? That's a problem with your model and that's like, okay, we need to move away from that and maybe drop height as a possible attribute or down weight it or look at building using a different algorithm or something like that. So we do do a lot of stuff like that, right, where you'll post inspector, you'll run it on a, you know, a training set or a validation set and just make sure not just that you're getting like good sensitivity and specificity, but you'll actually like, I'll at least, you know, I'll pull up a hundred or a thousand examples and just kind of eyeball them and see if they're making good sense to me. And if they're not or I see weird spikes, then I'm going to go back to the drawing board and say, why is this happening? Was this a mistake in my preparation? Did I not clean the data properly? And do I need to revisit that? So I'm not sure if that really answers your question, but sort of like the answer is really you do the best of what you have. Yeah, absolutely. So I'm curious if you've seen any surprising sort of correlations that have been useful in your matching efforts? Like earlier, we talked about that you can detect the topic of skiing and maybe detect mountain biking or correlated and that people that tend to like these things are potentially good matches. You know, while I'm sure you have the data to back it up, it's not surprising. It's, I don't want to call it obvious, but it's intuitive. Yeah, I'm wondering if you've encountered anything that has a high predictive power that wasn't intuitive, like perhaps left-handed people really love other users that enjoy the office, the TV show or something like that. No, not to that level, but I think one of the things I remember when I first started and I'm coming from a very, very rigorous academic science background was when I first sat down, the line I was given was basically astrology actually does matter. And you can actually show this. And I remember I sat down and I was like, "Oh, Jesus, what am I doing in this job? This is all of these matters and this is data science," and so on. But it's actually quite interesting. So there's two reasons why astrology matters. And the answer is it's not really astrology. So one is astrology matters because it's something called confirmation bias, right? Where if you get enough people that think if they're a Gemini, they should match with a, I don't know, whoever Gemini matches with, let's say Virgo, there's probably someone listening to this that's tearing me to pieces right now. Then of course, they're going to seek out those matches and because we actually display a horoscope on the site because people like, they find it interesting. They search specifically for those matches or only reply to that, then you're going to get a lift there. But to me, the more interesting one is it's actually correlated with birth month and it's different in different countries. And you might say, why is that the case? And it has to do mostly with, for example, when does the school year start or what's the age cut off for a school year? And why does that occur? Because if you're one of the oldest people in your class, then on average, you're going to be a little bit taller, a little bit bigger, and probably a little bit smarter. Like you will have been reading for a little bit longer, you will have done some arithmetic a little bit more. That means a few things. One, you're treated differently by your peers. So there's been a lot of research shown that people that are taller or more attractive or things like that for men. And I can't remember what the attributes are for women, but they're similar things. People behave to them a little bit differently. Like taller, you tend to be a little bit more alpha, things like that. Also, you'll get a little bit more attention. So there was actually an interesting study done, I believe, on NHL players, because they're also most NHL players come from a set of months. Like if you ask when their birthdays are, it's clustered towards a set of months, and I forget when it is. And it's for a very similar reason in that when they played like junior hockey, they were a little bit bigger and a little bit faster on average. That meant that they got a little bit more coaching attention initially, because the coaches just unconsciously tend to focus on the better players. So then they got a little bit better and then it kind of exponentiates in still balls. And so you end up with all these people now, that's why they're all around the certain birth month in the NHL. Now, of course, it's not every single person, but statistically speaking, they're clustered around that. And there's a peak there. And for us, the astrology birth month thing is actually quite similar. And I can't remember exactly what it is. It's not like a one to one thing. What you can do is for women and men, you can sort of draw a correlation matrix. And you can say people that left the site successfully in a relationship, if someone was born in January, what's their correlation positive or negative compared to the set with all the other months. And you can look at that at changes in time, obviously, and it's different for different countries and different age demographics, but it's actually quite interesting. And I remember it being very surprising, because I would have said initially birth month is, who cares if I was born, I'm born in November. And somebody else is born in January, why would that ever matter? But it's actually interesting. It's a very sort of latent implicit sort of thing, and how it factors into our behavior, which I find very interesting. It's one of the things I find most interesting, not just about data science, but data science at a site like Plenty of Fish, is it's a lot of data about actual human behavior and interaction and so on. And it's kind of an interesting window into that. And I think the sort of birth month correlation thing was one of the first examples of that for me that I found very surprising and very interesting. Yeah, I've actually want to go back to your background, ask a little bit about transitioning from physics to data science, because I'm myself not a physicist, but maybe an armchair physicist. I was kind of actually admiring physics in how it tends to pursue the simplest of models. And I see physicists don't like extra terms creeping in or constants that don't need to be there. And there's also the benefit of, it might be hard to measure, but you do have access to ground truth being what will reality do. And data science is not always as clean or as easy to access ground truth. There's a lot more fuzziness. Was that a major change you were expecting? I mean, it certainly was a change I was expecting. I would even say the bigger change is going from academia to industry and it being like a business oriented industry. In that in academia, it's always about, it's going to take you six months to do this calculation. And that's the right calculation to do, then that's what you're going to do. And it's about producing a scientifically interesting result. In industry, it's much more about producing a useful result for the business. And there's a lot of compromises that you have to make in that it might be great if you can spend a year finding the absolute best predictive model for something. But if in six weeks, you can get to something that's 80 to 90% of that, then probably that's where you want to focus your time. Because there's a million other things that need to be done. And the business is not going to wait. I can't go and say, look guys, it's going to take me two years to finish this project. So that's definitely a big change. Now, for the data and stuff, physics is messier than people think it is. It's true, I was in string theory, which is sort of the most abstract sort of like clean, not clean, but I guess yeah, clean is the right word sort of thing. But a lot of the reasons why you look at these simplest models type of things. Like, for example, I did a lot of work in black holes, not on like astrophysical black holes that we actually observed, but on the theory and fundamentals about what actually goes on inside a black hole, how do you resolve quantum mechanics and gravity inside one? Is there a singularity? And so on, you oftentimes work with, I don't want to say toy model, but examples of black holes that don't exist in nature because they're almost too symmetric and too clean and so on. And the reason we work with those is because those are the ones that we can do computations for. So if I was to say, Oh, I want to do this exact computation, but for like an actual black hole that we've observed, like I would get two lines in and be like, well, this is now not a solvable equation. And like, it's not like it's like, I can't do this in a day, it's like not tractable. Instead of giving up, what you do is you try to find examples where you've cleaned it enough, where you, but you're still preserving the fundamental nature of the problem and the system that you want to study. And I think it's actually quite similar in data science and that, you know, I'll have 300 possible features, some of which are messy and so on. And the first thing I'm going to do is first clean the data, eliminate missing values, weird outliers, you know, for two weeks, this field meant something else on the site, get rid of those people, that kind of thing, map it to the right thing. And then say, okay, well, I have these 300 possible features, you know, that's way too much to actually build a model off of because of curse and dimensionality and overfitting and all that kind of stuff. So now I'm going to try to get out what are the essential pieces that I have to feed into the model, right? And then from then, what are the essential bits and things that I want to use for study, like sorry, to select the algorithm, I don't want to pick an algorithm that's very sensitive to outliers for a set that has a lot of outliers that I'm who are important. On the other hand, maybe I want to get rid of the outliers and use that algorithm and so on. And that's very, very similar to decisions that you make when I was in physics and string theory, because most computations you can't do exactly. So you're always trying to do approximations, things like that. So you'll do a Taylor series or Laurent series or some more esoteric computations and so on. Am I getting at the essential bits of the problem that I believe this result and I believe it will generalize? And that's a very, very similar thing to building a model in data science. You know, I'm going to select only these 10 features and I'm going to build a model on that. And my hope is, is that that will generalize enough to whatever actual problem I'm trying to solve. It's very similar to the matching thing, right? Even if I had all the information we have, it's still incomplete. I don't know how people, for example, you know, speech patterns, body language and stuff. There's no way to capture that on a site or an app, at least not easily, not yet. What can I do without that that I could hope that I can sort of generalize and get, well, you know, someone has a high education, high income level, and this type of job, they're likely to be very confident. And you know, maybe that's a good enough approximation for that. So I think they're very, very similar in that sense. Yeah, absolutely. Algorithmic selection was one of the last topics I wanted to touch on. You'd mentioned earlier, doing a little bit of work with neural networks. I was wondering if you could highlight maybe a few of the other methodologies or algorithms that you guys have found most useful at plenty of fish. Oh God, I don't know. I mean, my thing with algorithm selection is I think people focus too much on, you get it two ways. There's some people that just want to know every algorithm out there. And there's some people that become very fascinated with one type of algorithm. There's a lot of buzz right now around deep learning and neural networks and so on. And they're very, very good for solving a class of problems. Yeah. So what I always try to get at is look at the type of problem, like what is it that you're trying to solve. And for me, it's also looking at what the constraints of the thing. So for example, if you're going to put this thing in production, how fast does it have to operate? I think the classic example of this, I think is, do you know what the Netflix prize is? Yeah, of course. So they had this, I think it was like a five year long competition. And it was, can you beat our recommender algorithm for rating and recommending movies and so on? And there was a team that eventually did it. But I do not believe that is actually the algorithm that Netflix is using because it's too slow in production. Yeah, you're exactly right. I think they also complained that the implementation might have had more complexity than they wanted to tackle. Yeah, exactly. And so that's a consideration. So, you know, for something like scam detection, we want to be able to catch these people as fast as possible, ideally like on sign up and profile creation and things like that, or even before profile creation. So if I'm like, well, this is going to take eight hours to run on a new user, it doesn't really have a lot of value. And there's always a balance there. So, you know, some things we've looked at, obviously neural nets of various types, you know, SVMs, even really simple models, I don't know, you can get really far sometimes with straight up regression, logistic regression, some more advanced regression models, like step wise and lassos and things like that. I do a lot of my initial exploration either with regression or things like decision trees. The reason I like those is they're very human interpretable. Yeah, I can look at a decision tree. And I can see, oh, these are the most important features. This is what it's splitting on. This is a rule. And that oftentimes informs me for building a more predictive model. If I look at like a neural net or an SVM, it's a spit out of weights or a vector or something like that. It's very hard for me, at least for me to eyeball a vector that's, you know, 97 numbers, some of which are bigger and smaller than others and be like, Oh, you know, these people are matching because they have similar education level or something like that. Yeah. So, you know, then you can get more advanced things. I've looked at like various random forest bagging and boosting models like I've gotten far with boosting trees and some things. But again, it really depends on what the type of problem is. And boosted trees has worked really well for me on some things and not on others. And then in other cases, like for example, when I did that onboarding study where we were trying to find what are sort of key actions users take initially that predict whether they're going to have a good or bad experience on the site. In that case, absolute predictivity is not the most important thing. I'm not trying to identify the users. I'm trying to identify the features and their decision trees is the way to go or something like that, because I can interpret it. And I can say like, I can go to product and I can be like, guys, this is the most important aspect, right? People who do this are, you know, 90% more likely to stay on the site a week out. So we should get more people to be doing this on the site. And so for there, something as simple as a decision tree model was actually the most predictive, but not most predictive, but the most useful as opposed to a very fancy model. So I think there's been a sort of case and I like everybody so it comes to it. You always want to go to the sort of sexy, fancy, whatever the newest algorithm is. For a lot of cases, that's not necessary. Yeah, you know, if you're engineering team at Google and you're trying to study creating hierarchical things for image recognition, right, you know, so the idea of deep learning with these very deep networks is that you'll create like first you're studying like macro properties of an image and then you get to smaller structures and it learns how to actually categorize and cluster on those things. Yeah, that's super interesting and it's really, really powerful for that. But I've seen people be like, you know, I'm going to use deep learning to predict whether or not like a user is going to click on this, you know, button. And it's like, I mean, yeah, you can do that, right? But like, you probably could have just done like a logistic regression, five features and gotten like 90% of the way there. Yeah, you know, one tenth of time. So, you know, it's always a balance and I feel being a good data scientist in industry, you know, as opposed to being like a research data scientist, an academia or Microsoft research or something like that, or Google research is really about figuring out how you can solve a problem most efficiently and actually solve the problem within the constraints, right? Does it have to go in production? What's the speed it has to operate at? What scale does it have to operate on? Are you constrained to a specific language or architecture? There's certain things I do that, for example, based on the way our site is architecture have to be in SQL, at least some layers have to be in SQL. That means that I might not be able to do really fancy matrix operations for something. So I need to go another way. So that might close off a whole set of algorithms for that particular layer of that particular problem. A lot of things I've also done combine various algorithms. So, you know, I've done things where I've combined, let's say, a hierarchical clustering step with some association rule learning and things like that, where you step it out or you use bits of each of the algorithms, you combine them together and sort of like the sum is more than the whole, the parts kind of thing. Yeah, I think that's all put together some of the best takeaways from the whole episode and just great advice for practicing data science in general. Earlier, you'd mentioned you guys were hiring, but I don't think we've yet said where you guys are located or any of the details. Could you share any of that? Okay, I should say that. So we're located in Vancouver, British Columbia. It's our only real office. We've got data centers and stuff like that, but almost everyone works in Vancouver. We're a shop of a between on any given day, somewhere between 60 and 80 people. I think we're around 70 right now, just the normal sort of ebb and flowing tech of you hire a bunch of people for a project, some people leave, some people transition and so on. And yeah, I mean, we're always looking across the board for really smart people outside of the data science realm. So yeah, we've got a website, we've got an iPad app, an iPhone app, an Android app and a Windows phone app. All of those apps are actually native. So we have a full fled to Android team. We have an iOS team. We have a much smaller sort of Windows phone team that's mostly the web team sort of moonlining. We've got a full database team. And we're now, as I said, we've got this data ops team whose job it is to build out this data pipe and we didn't talk about it yet. But one of the other things we do a lot of is sort of testing and split testing and things like that. And we have constructed a very robust split testing framework that allows for setting up, you know, you can run 10 tests at once you can split off people into very like, so you're not just doing like a 50 50 like mod on user ID split, you can do very complicated splits, you can use different functions to randomize, you can do conditional splits. All of that stuff is actually built into our testing and logging platform. So we can run very sophisticated tests. So if we want to run a test on, you know, only mail Android users in a certain city who fit certain criteria and do like a 70 30 split with like a Bernoulli trial, like we can set that up. And we can do that at the same time we're running a test on everyone in the US on a different platform with a 50 50 split and so on. So there's a lot of really, really interesting data engineering data architecture work that we're doing there. And then obviously the sort of data science and research stuff. And I would encourage people, even if there's not like an obvious job description, especially in the research data science realms, if you're really smart, we will talk to you, we will at least talk to you. So if people are interested, they should, you know, kick a resume and we have this sort of general application thing, or you can always get in contact with me. Yeah, where's the best place to either get in contact or if someone just wants to follow you online. So if you want to follow me online, my Twitter handle is at T is in Thomas S is in Scott because that's my middle name. And then ledi, which is my, which is my last name. So you can follow me there. Oftentimes tweet about data science. I oftentimes tweet about what grilled cheese I had for lunch or funny things I saw feel free to sort of filter that you can email me at thomas@poff.com, which is my sort of company email. So it's T-H-O-M-A-S at P-O-F dot com. And then you can always add me on LinkedIn on Thomas L-E-V-I and LinkedIn. Just I do ask if you're going to add me on LinkedIn. If I don't know you, at least send me a note why you're adding me or how you found me. If I just see like I want to add you to your network on LinkedIn and I've never heard of you before, I usually won't add you. But if you say like, Oh, I heard your podcast or I saw some interview you did or read an article or something like that. And I wanted to chat that I'm always more than happy to add people and talk with them. Excellent. So, you know, lastly, I like to ask my guest to provide two recommendations. It can be anything, a book, Python package, a paper, whatever you like. The first is the benevolent recommendation, something you have no affiliation with, but think listeners would really get some benefit from. And second, the self-serving recommendation, something that ideally you derive direct benefit from the exposure here. Well, I'll give you two benevolent because I think following me on Twitter is probably the most, and adding me on LinkedIn is probably the most self-serving I can get. So the benevolent one I would do is there's a great book called All of Statistics by Larry Wasserman. It's W-A-S-S-E-R-M-A-N. I read it cover to cover a couple times. I highly, highly recommend it. And the reason I do is a lot of data science and the buzziness about data science is on the machine learning side of things, which is extremely interesting. And I use it all the time. What I find emphasized less is really a good foundational knowledge of statistical thinking, understanding, for example, what a biased estimator is, an unbiased estimator, what parametric and non-parametric inference are when you need to do one or the other, how to actually properly run a test, what the multiple testing problem is. All these sorts of things are, I would argue, way more important for doing real data science. And the reason for that is if you just blindly do machine learning or even split testing and things like that, machine learning split testing, you're always going to get a number or an answer out. The thing you won't know is how right or wrong that is. And the way I actually describe my approach to statistics is it's me constantly panicking about the mistakes that I know that I'm making that I'm unaware of and trying to minimize those as much as possible, and then going back and revisiting ones that I made. Statistics is a very subtle field. It's not just about knowing a bunch of probability distributions and F tests and H tests and this kind of stuff. That's not statistics. It's understanding what these things actually mean and when they're reliable and when they're not and how to analyze that. And the Wasserman book and the references in there are phenomenal for this. And I can say if you wanted to do data science with me and we were hiring, you're going to get a lot of interview questions around those sort of things. Not so much that there's a right answer, but we want to see how you think about those things. So when a lot of people say they know how to do split testing, what I find is they know how to do binomial testing, which is conversion based testing, and they know how to go to an online calculator and get an answer out. When you actually say, okay, well, what are you doing? Why can you phrase the test this way? What's actually going on under the hood? A surprising number of people don't actually know about that. If you read Wasserman and referenced this there, you will know that. So I highly, highly recommend it. Another recommendation I would do is so I use R, I use R and Python. I principally use R. I'm not going to get into the war of ways better for data science. Both are phenomenal. Both have their pros and cons. I use R principally right now because that's what my shop used and I first try to solve a problem in R and then I go to Python or something else when there's not an appropriate solution in R. There's a great package. I want to say it's Max Kuhn that developed her. He's a researcher at Pfizer, I think right now, or some other pharmaceutical company, but he's sort of a biostats guy, bioinformatics guy called carrot. C-A-R-E-T. And what carrot is is actually a wrapper for about 100 to 200 other machine learning packages. And it sort of brings it all together in one sort of API like interface. So instead of having to remember the syntax for the NNET package versus this other random SVM package and this one needs a model in this forum and so on, it's one unified framework. Now, like one person trying to unify 200 things, there's bugs and sort of tricky things in there, but it's really, really good. And the other thing that it does very nicely is it has a lot of built-in functionality for splitting up data sets into test and training, running cross-validation, and doing a parameter on feature exploration in a parallel way. You can register up what's called a parallel back end with packages like do parallel and do Redis. And you can use, for example, this other package called for each. And it will. So for example, if you want to explore 150 different model like parameter and learning parameter combinations for a neural net with tenfold cross-validation and you have 80 cores available to you, this will do it and it will return an answer. You might have to tweak it a bit and kind of deal with issues around memory consumption and so on, but it's really great. So those are sort of my two recommendations. My non-ba-da, one is follow me on Twitter @tslevi. If for nothing else, my friend and I have a running competition to see who can get more followers, it's sadly very small. So, you know, help us out and keep me ahead of her just so I can send her annoying G chats about her. Absolutely. I hope the show can contribute quite healthfully to that. Well, thanks again so much for your time today, Tom. This has been a fantastic conversation. I'm sure the listeners are going to really enjoy this and I appreciate you coming on the show. My pleasure. And again, I encourage them to feel free to tweet me out of me on LinkedIn or shoot me an email. I'm always happy to talk to interested people about data science machine learning and those type of things. Excellent. Let's take care. Yeah. And thanks again. I really appreciate being invited. (upbeat music) (gentle music)