This episode contains converage of the 2015 Data Fest hosted at UCLA. Data Fest is an analysis competition that gives teams of students 48 hours to explore a new dataset and present novel findings. This year, data from Edmunds.com was provided, and students competed in three categories: best recommendation, best use of external data, and best visualization.
Data Skeptic
Data Fest 2015
[music] The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. Well, welcome to a special edition of the Data Skeptic Podcast. In this episode, I'm going to be playing some recordings I made this past weekend at the 2015 Data Fest hosted at UCLA. I had the opportunity to attend this weekend long event as an industry mentor and a podcaster. Data Fest is a data analysis competition which presents students with a brand new data set they've never seen before and gives them 48 hours to derive some interesting results. So I'm here with Rob Gould from the Statistics Department UCLA that I think you're the chair or primary organizer, is that correct? That's correct. Can a nutshell maybe give me a sense of what the Data Fest is all about? So in a nutshell, Data Fest is a big data hackathon. So it's like a hackathon so teams work really intently around the clock to solve a problem. I think it's different from a hackathon and the problem is a little more specified. So it's maybe more puzzle-like and in this case edmunds.com very generously gave us a very large complex data set and asked the students to give them recommendations for improving the customer experience and they worked around the clock and great numbers. It was the lowest attrition rate we've ever had at this event too, yeah it was a great event. What kind of numbers did we see in terms of teams and participants? We started with 60 teams and we had 240 participants and we ended with about 49 or 50 teams and I don't know how many participants. That's still a good put I would say. I was really impressed with the diversity of different projects and approaches. Is that something you were expecting? Yeah every year it's always part of the fun is to see how many ideas that you wouldn't have thought of that people came up with. So that is a lot of the fun. I used to think that I saw regional differences from the different schools would have different approaches but I think over time either I can't see them anymore or they've gone away. Tell me a little bit about the scope of the whole Data Fest. Are satellite locations participating as well? Yeah so this year the American Statistical Association adopted Data Fest as one of its primary organizations. There were seven around the country three years ago, maybe four years ago Duke started to do one. The year after that we had one of the five colleges in Massachusetts and also Emory and Princeton. And this year I think it was at Duke, Massachusetts and Emory and Purdue and Washington D.C. and UCLA and another place I can't remember right now. That worries. And how long have you been planning all this? Oh we start planning almost right away. I mean where my dream is to have two data sets lined up for the next two years. But it never works out that way. So we're going to meet in a couple of months with the other people at the other sites and start talking about any leads we have with data. I'm welcome for any leads if anyone wants to donate data. Around about December we're going to start talking to begin the process of getting the data set together. Well excellent. This has been a fun weekend for me and I think all the participants. I'm looking forward to next year as well. Yeah thank you. Thanks for your help. It was great to have you around. Maybe we could start with you introducing yourself and your role at Edmonds and your role at the event today. Sure. My name is Annie Flippo and I'm a data scientist at Edmonds.com. And what do you do specifically in data science at Edmonds? We have special projects that we're trying to answer business questions, challenges or issues that we're trying to resolve. We use data to answer those questions and try to have some actionable insights so that we can get back to the business and they could maybe implement them. What's been the most insightful thing you've been a part of since your time there? The most current project is pretty interesting. We're doing audience segmentation so based on browsing activities of a user we can see or kind of infer their intent on buying a particular making model and whether they're in the research phase or very close to buying. And tell me a bit about the data set you guys contributed for the event here this weekend. The data set is a aggregate data for our visitor transactions so each visitor that comes to our site will start a journey record for them. When they came, how often they came, how much time they spent on the site, what make they viewed. If they looked at particular Honda, Toyota and stuff, we collect on aggregate how much time they viewed certain things to click on ads they might have. Then we look at how many leads they've generated. Leads in our site is a general concept where they give us a little bit more personal information in exchange to get a specific price quote. We pass that along to the dealer so that they could contact the prospective customers and then hopefully they will buy a car from them. Our data is about the visitor, their shopping habits, the lease that they generated, the car that it configured. It's pretty comprehensive. This is for the last three months. Very interesting. What do you find is the typical journey without revealing any trade secrets or anything? What is the typical journey for a customer? Do I show up and make my decision in a day or two or am I kicking tires for a couple weeks looking around? The typical user would say would come to our site just for one day. You've got to be very quick in deciding how you optimize them. Maybe they already thought about buying a car for a while but typically they come for a session and that could vary between a few minutes to 30 minutes. We try to give them as much relevant information and spend time as possible. That's an interesting challenge. I've seen a lot of great stuff. I'm sure you have two yesterday and today volunteering here. What's been maybe some of the highlights of the student work you've seen going on? There are a lot of ideas and there's no bad ideas. Sometimes it's very hard to find external data you want in this short amount of hackathon time to kind of flesh out your ideas. I'm very interested in what they will come up with tomorrow. Me too. Thank you so much, Annie. Tell me about your current approach to the problem. Basically, we've decided that we're first going to try to look at the visitor behavior data and try to essentially limit it to as small and of a project as we can. We're sort of running through PCA models of some of the ad data, for example, to say, "Oh, the sort of data for people looking at different kinds of sedan is pretty correlated." We can just create a sedan ad category that we were not analyzing. It's just a mess of various different variables. Do you want to go in there? Of course, one of the biggest parts of this project has been actually cleaning the data sets so far. We have about 2.5 million leads, and we're just looking through the leads. We see that at least 1% of users in the top 99 percentile generate at least 10% of all the leads we have. Unfortunately, this data is not filtered as all real data is, but we spent more time than we thought we would, just organizing our tables. Makes sense. What's been most surprising about the data so far? I want to say it's just the amount of strange cases you find in that, because he was talking about the filtering. We have multiple cases where somebody will be clicking on thousands and thousands of ads, which is just ridiculous. For a question of our time, we were considering just because there were so many of these weird cases, analyzing them. At that point, we're just going down a rabbit hole, I think, so we decided to stick to what's humanly possible. Do you think those could be bots, or are those really active car buyers? Well, I'm initially thinking they're bots, but there is also a scenario where our unique ID is based on IP address, and with IP address, they could be libraries, they could be schools, they could be commonly shared computers. But unfortunately, when we're looking at customer behavior data, it means that when we have our unique ID, which is supposed to be a unique customer, the patterns in their online behavior aren't going to be consistent. Either are their purchasing patterns. So it does add noise to data, although it does represent a significant portion of at least 10% in our leads information. Interesting. So I know there's three categories you can compete in. Have you guys picked which one you want to do yet? Well, we thought we would go for best recommendation. And what do you think will be the recommendation you'll give? We're thinking it's going to be something along the lines of, based on the different sort of ad behavior that users depict, thinking about what sort of information that admins can present to those users as different from between the two groups. For example, we had some idea that when users are clicking on not so much the ads for cars, but for stuff like insurance programs, they may be more interested in used cars in terms of specifically selling their own used car, maybe even looking up online to see what my used car would go for, stuff like that. And so basically if admins could sort of predict that kind of thing and then offer differential services, based on, for example, Google's "did you mean" or kind of an idea. The number of cells that are generated from our data set are for the majority of new cars sold. But the majority of our visits is for used cars. This could be a trend where people are looking up a car from a dealer or from a third party, and they're looking at admins to get a price category or a sense of what the car is worth. And so in that sense, maybe targeting ads from insurance or car services to those customers who aren't necessarily in the market to sell or buy a used car from admins. Possibly, admins could get more revenue by targeting ads towards that group. So, admins is an ad delivery primary, is that the primary revenue stream? Well, the primary revenue stream is subscriptions from dealers. They're allowed to show their ads on an admin site. And ideally, the leads that customers submit to the dealers will convert to sales. But at times, that's not always what happens. And if a dealer receives enough leads that don't convert to sales, they might be tempted in canceling their subscriptions. One of the factors that we know will affect car sales, or conversion rates from leads to sales, is price and distance. Closer somebody is, difference between suggested retail price and the listed price they have on the website. We'll have the most significant effects on whether the customer converts or doesn't. It's really interesting. Have you guys tried to plot that yet? I still think because we're still at the stage where we're mainly limiting our data both vertically and horizontally. And so once we have that little more settled, we feel more comfortable doing that. Because I think we jump into plots window problem where we still have all those crazy cases and we're just waiting through a massive plot. Makes sense. Well, I think we have a better conception of where we're going to go and what type of data we mean. So in the next 22 hours, we're going to be coding and trying to create the tables and statistics we would ideally like to present on. And so the first 25 hours was more about discovering other data, trying to clean, and conceptualizing the recommendation we want. But now it's about actually implementing it, which should take longer. Makes sense. And tell me a little bit about your team, your guys' names and who you were working with. So my name is Guyon. My name is Freddy. And we are working with three of our members of the statistics program at UCLA. Spencer, who is actually running the event on one some sense. So then Jeremy, who's a friend of ours from statistics and then no. I think he's the only one who's not a fourth year on our team. So we just talked with your collaborators a little bit about the approach you guys are taking. Tell me a bit about how you guys started to focus on that. You went from a data set you'd never seen before to now deciding what you want to deliver as a key insight. Okay, so we actually took a bunch of different approaches. And the very first approach we took was just going to the Edmunds.com website and looking at the data itself and really just trying to get a feel for what was going on. Personally, what I did, I focused mostly on the website and just clicked around and tried to think about what kinds of things, like what kinds of decisions actually of Edmunds.com would have to make when designing their website. I think other people started doing exploratory data analysis, making histograms and variables and kind of seeing what we had to work with on that end. Kind of a way that I like to look at this after a certain period of time was about decisions that Edmunds.com would make whenever they're setting up their website. And so these could be things from like what menu options to show people, when they're searching for cars. So like what ads to show to kind of what information to provide about a car or whenever you do see the car, all that sort of stuff. And then see in what ways could we use the data to see what the implications would be of different decisions that Edmunds.com would make. And the idea is that by focusing things around decisions, then we can make a recommendation to Edmunds.com saying if you decide this, this will happen, given what the data we have shows. Nice, so I could approach. So hi, we are lucky three. My name is Lucy. I'm a mass major at UCLA. And here's my team member Chloe. Hi, I'm Chloe. I'm a communications major second year at UCLA. So just a minute ago, you were telling me a funny story. Would you mind sharing again? Okay, sure. So we are working on a data set from Edmont. And as you guys know, they sell cars. So I was like, "Oh, why don't I just go on the website and experience how it feels like to be the customer?" And I went on their website and eventually I reached the stage where I just need to give them my email and my phone number to get on to the next stage. And then just this morning, it's like night dealers called me, asked me if I want to buy their cars. And I was like, "Oh, I'm sorry guys. I'm just so sorry. I'm still in school." And I had to clean up with excuses and I was just like, you know, all the midterms and the finals. Guys, just, you know, all the homework. I will call you back. Let's just get in touch and then I will call you guys back after the summer. Things like that. Yeah. So it was interesting. I wonder if this event will cause a big outlier in the Edmonds data set from everyone doing research. Probably. So how has the competition been so far? It's been fun. It's really fun. We've been learning R because both of us have learned it before. We learned a lot and we have ample snacks provided by the data fast. So it's good. And it's also really cool to see there are so many people who are interested in stats and then can come up with cool ideas. We've seen so many cool competitors. Look around and you can see a lot of fancy plaas and crafts from all colors. My name is Zija. It goes by Steven. I'm a junior year stats major and double economics. My name is Hyunee. I'm a stats and business and double major. Hi, I'm Sharon. I'm also a stats major. I'm Bruce. My minor stats. My major is actual mathematics. Excellent. So when we talked last night, you guys were just getting started, had some good ideas. How are things going this morning? It's great. We sort of, like, wanted to narrow down and, like, ended up going, like, two separate tracks. Basically looking at different types of people who are looking at different brands of cars, for example, like the people who tend to look at Ferraris. Spend a lot of time on researching, like, inordinately compared to other people. So maybe, like, we could look at what types of articles they're looking at on Edmonds and make it easier for them by, like, giving them suggestions and, like, having a curated list, like, right there on the Ferrari pages. Like, something like that. It's an interesting idea. And what about the second one? So that one is looking at fuel-efficient cars and what kinds of people are more likely to buy fuel-efficient cars. So since we have zip code in the visitor database, we can look at, like, median income, mean income, political affiliation, like, more liberal people tended by environmental friendly. Political affiliation, we took, basically, as an estimate, like, the last election who they voted for. Income, we got, I think, from a government source. I think there was one where, oh, yeah, pricing by zip code for guests. So that was on, I think, yeah, also, like, just searching online. It wasn't too hard to find it by state and then merging that with zip code. So you're thinking of competing in the best use of external data category, then? Well, we took the best recommendation, just because, like, it was spelling up. It seemed like the most prestigious award, still, even though there's no first. But, yeah, we could switch out since there's a waiting list for a best recommendation. We could always switch out like this. So how's it gone so far? You have an interesting theory that maybe gas prices would affect how interested a buyer is in a fuel-efficient car. Has that turned out to be the case when you looked at the data? I'm actually, like, merging the data, I'm, like, almost done with, like, getting the entire data set together, like, just adding the political affiliation now. So, just about to run the model, yeah. Exciting. So, where's what? 20 hours left or so? Are you guys feeling on track? Not exactly. But we are trying to figure out the red track. All right. And, like, right now, what we're going to do is just try to sing about one scene and try to work it out together. Instead of dividing to three to two teams, then two ideas. And I think the best, the most efficient way for us to finish on time is just keep talking and keep showing our ideas. And even though, like, I think the two of the ideas, they are pretty good and they sound fantastic, but we have to give a one. And we have to, like, cooperate with each other and understand this idea. Even though, like, we've already finished almost half of each idea, but we have to learn how to keep up and learn how to pick up and use that. Yeah. Various students. I'm Alexander Chan. I'm a Statistics Major at UCLA. And I'm Kevin Lee. I'm also a Statistics Major at UCLA. And what was you guys' focus for your particular presentation? So, in our presentation, we were trying to look for predictors. What makes a consumer buy a car? And we used a random forest algorithm, and we determined the most important things were time spent and the total page views. Time spent, in what way? The time spent on the website at edmunds.com. Interesting. Tell us a little bit about the visualization you guys put together to show that. So, the visualization, we actually just used the website. It's Infographic Maker. And a lot of the graphs and stuff was already custom made on the website, and we just entered the data points we found from our onto the online website. Nice. Well, congratulations again for placing. We'll see if you guys take first or second just a little bit. Overall, what do you think was the best thing you learned this weekend? I came here knowing, like, barely any R, and I didn't know, like, what a random forest algorithm was. And I feel like I left here, like, with a pretty good knowledge of what that algorithm does. And I've probably learned, like, two months' worth of R in, like, three days. That's excellent. How about you? I took my first statistic class this quarter. I installed R four weeks ago for the first time. So, it was a great, enjoyable learning curve. I had a lot of fun. Awesome, guys. Well, thanks and good luck. Thank you for helping us for, like, I don't know, two hours or three hours that day. It was my pleasure. I'm here with now the other team competing for best visualization. Either way, you both win. So, congratulations. Can you tell me a little bit about your team and what brought you guys here this weekend? Right. So, for the most part, we're all statistics majors, but I think we all come from a different background. And we just landed in this major. So, it's great to be out here. This is, I think, all of our first data fest. So, we did not expect this at all. And we actually worked with the other team, the entire competition. We sat next to each other. So, we weren't working on the same things, but we kind of felt each other's vibes. Especially, like, 3 a.m., Friday night, Saturday morning. In terms of motivation, I thought it was really great that we weren't expecting to be finalists. But, I think we're both content that we're finalists. So, especially being here and having this experience, it makes us really happy, actually. And humbling, you know, to have, like, professionals. Professionals are the ones judging us, you know, to have them sit here and think we're worthy of speaking in front of everyone. That's just, that's amazing. I mean, it was crucial, I think, those 48 hours. But honestly, I don't, apart from being tired and correct me if I'm wrong. But I don't think, like, those 48 hours, there was a point where we said that this wasn't something we wanted to do. That this was something that we were, like, as, we were tired, but we were, like, still on our computer and still, like, we fought through everything, you know? So, it's cool, and it's cool to share, like, that experience with other people. And I couldn't think of anyone else that I'd like to be in the same category with, like... Awesome. Tell me a little bit about the visualization you guys presented. We presented a map of the United States, and on it are points that represent all the leads, which is, like, customers going on to the edmunds.com website, and they're interested in a specific vehicle, and they go on and continue to want more information. So, they've submitted forms to edmunds and to ask information from the dealer, and we're focusing on that piece of data, and also, among those leads, we're focusing on who did end up getting the car and who didn't. And we plot that on the map, and so, there were some several patterns. We didn't go into very deep detail modeling because we're in best visualization, so we want to help the company sort of visualize as a country who's doing what, who's not doing what. And we saw that a lot of the purchases were made in clusters, so they're not alone in one area all over the place. And the leads are a lot more scattered. We saw that graph, and then we sort of wanted to build on that idea, and that's basically the heart of our project. Awesome. Tell me a little bit about some of the tools you used to put your visualization together. We did a lot of trial and error with a Gigi Patu, since we don't have much experience with Gigi Patu, but that's why we used to plot it. We were only given zip codes, so at first we were like, "How are we going to plot this if we're only giving zip codes?" So, we actually got external data and got the longitude and latitudes of the zip codes, and we were able to plot every single point onto our map. Very cool. Thanks guys, and good luck. Hi, my name is Christian. We worked on the HMN training model to predict what car a viewer is going to look at next on a certain website. Awesome. So, give me a sense of how Edmonds would implement that and benefit from it. Well, if you know what the user is looking for, you can target at them with the cars that they like. And for example, if the model showed that the variability of cars that a person uses, hi, then he's probably more prone and more open to seeing more cars that people usually don't look at. So, in both ways, distinguishing these two types of people would help advertise, and it would also help the search engine too, because instead of click, drag, scroll, look for stuff, you could just click, click, based on previous learnings and previous behaviors. Can I ask you guys the same and introduce yourselves and talk about how you contributed? Hi, my name is Shin. For me, I'm mostly like to report the data, get the chart pretty much about this. Cool, how about you? Hi, I'm Feja. And I best do some primary data conditioning and producing the results slice or graph to show our results, something like that. I'm Ben, I assisted with data cleaning, I merged datasets and found descriptive statistics for some variables. So, how dirty did you find the data to be, how much cleaning had to go on? It wasn't that bad, but because the data set was so big, it took a while. Every single time we ran for loop, it would take 20 minutes or so. And if we mess up, or if our crash is, it would have to reload everything and run every function again, so that was just a pain. At first, if you don't know what you are looking for, you don't know what kind of data you want, like this variable or that variable. So, you have to extract this variable from the original very large dataset, so that's kind of time consuming. And how did you guys decide on the methodology you were going to use? Well, I pretty much specialize in time series analysis. So, whenever I can put anything in time series, I do. And I really like time series analysis because it's more dynamic than regular statistics and spatial statistics, and there's more to predict and more hidden things. Well, congrats guys, I'm placing. Good work. So, I'm here now with some of the grad students that helped out today. Can you tell me a little bit about what brought you guys to the weekend? Well, we were pretty much organizing it and trying to get everything together. Hopefully everything ran as smoothly as we hoped. It would. And how long have you been planning it? Months. Months. Yeah, or years, if you count all the work that Rob Gold has been doing since 2010, I think. And what's been the highlight of the weekend for you guys? It's just always fun to see the students coming up with new ideas and figuring out ways to answer their questions with the data. And I love playing with R, so anytime I get to try and work through tricky problems, that's really fun. Nice. So, yeah. What do you think is more fun participating or helping out like you guys did? I would freak out if I were participant. I wouldn't do anything as well as these guys did. Like, I don't do well under pressure. I'm surprised they did so well, and they actually used pretty advanced models. And they brought it together in the last five minutes, so I'm very impressed, yeah. So I think I'm glad watching them present, so. Yeah. Can you give me some sense of the scale of things, like how many teams and participants there were? I think we had about 200 students who participated. 250? Okay. And then what was it about 60 teams, because we had about 20 in each room. We had a few teams that dropped out, or Fantastic Five that dropped down to just the Fantastic One guy who was left standing by the end. But I think we had less attrition this year than we have had in past years, and also the quality was higher. So, yeah, I was really impressed. Well, thanks guys. Yeah, of course. Thanks for joining me for this event coverage edition of the Data Skeptic Podcast. We'll resume our normal format this coming Friday with a brand new interview. But before I sign off, I just wanted to remind all the listeners of something Rob Gould said earlier. I'm welcome for any, any leads, if anyone wants to donate data. So please consider that a call to action for any forward-thinking companies or organizations like Edmonds that might want to donate an interesting data set for next year. (upbeat music) (upbeat music)