(upbeat music) - The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. - Welcome back to another episode of the Data Skeptic Podcast. I'm here this week with my guest Randy Olson. How you doing, Randy? - Hey, doing great, excited to be here. - Well, thanks so much for joining me. So I initially found you, I don't know if I found your blog first or I found you on Twitter, but have really enjoyed just kind of digesting all your writings on your blog and following you as a wealth of information. I hope all my listeners will go and check out after hearing you speak here, if they haven't already. - Yeah, that'd be great. - In fact, it was almost hard for us to pick, gee, what should we come and do the show on? 'Cause there are a bunch of different interesting options. But we finally settled on your paper and related blog posts and the interesting visualization that talks about the mapping of Reddit based on the interests of users and subreddits and that sort of thing. Is that a fair characterization? - Yeah, I'd say so. And just generally trying to figure out how can we map the internet into a sort of 2D spatial domain? - So before we get into the thicker things, maybe you could tell me a little bit about your background. - Sure, so I've been studying computer science for about a decade now. The first five years just undergrad in central Florida and in the past four or five years up in Michigan, snowy Michigan, doing my PhD there, studying artificial intelligence and machine learning and things like that. As I was working through my PhD, I found that I actually really enjoy working with data. I'd get data from these experiments that I was running. And so I kind of started up a weekend hobby, I guess you can call it, which is what turned into my blog now that you were mentioning earlier. - And am I correct in saying your focus is computational biology? - Yeah, I'd say so. I mean, it's sort of taking a different angle to artificial intelligence, rather than trying to reverse engineer the human mind, we're trying to use evolution to build the human mind. So it's very much drawing from biology and seeing how can we build a human mind from the ground up inside a computer. - Yeah, it's surprising. I've gotten a number of episodes in. I haven't yet been focused on genetics, algorithms or anything like that. So for a future one, and anybody who can't wait can go and check out some of your papers as well. I presume. - Yep, yep. There's definitely a few papers out there that cover it. - All right. So yeah, let's jump into the study you did on Reddit. Can you tell me a little bit about what was the motivation for doing the work? - I mean, anyone that follows Reddit, especially the data is beautiful subreddit on there, knows that I'm an avid fan of Reddit. I'm on there pretty much every day. And I just really enjoy discussing things, especially data science related topics on there with people. And so for the longest time, I'd realized that Reddit, but really like most social media websites, is really disorganized. If you go to the default front page of Reddit, since a whole bunch of links all over the place, there's images here and there. I've shown some friends it, who have never looked at Reddit before, and they're completely overwhelmed. - Yeah. - The unfortunate part is, is that once people get used to Reddit, because of course there's tons of interesting stuff on there, is that they kind of stick to what's called the front page, which is just a collection of links that come from the 50 most popular subreddits. But they don't really delve in into the deep communities that they might be more interested in. The front page focuses on just 50, but there's actually thousands of these communities, that people are missing out on, and that I think they'll be really interested in. So my motivation was really wanting to connect more people to those smaller communities, to sort of flesh out Reddit, and make it a better experience for everyone. - Yeah, no, I think it's very noble, and a great way to expand the knowledge, and value people are getting out of it. - Tell me a little bit about how you start approaching that, how you can figure out what someone's interests are, and then use that to learn more about what else they might be interested in. - It actually started out, I wanted to make a pretty basic recommendation engine. Someone just says, I'm interested in funny pictures and movies. You could say, okay, well you'll probably be interested in these subreddits. And so the way you can really do that though, it's actually pretty inefficient to just try to poll everyone on Reddit. There's millions of people visiting this website every day, and they don't want to spend the time, filling out a poll, saying what they're interested in. But then that's when I had an idea, where whenever you post or comment somewhere, Reddit keeps a record of that. And so you can actually go to each person's user page, and it tells you where they've posted, when they've posted what, even even what they've posted. And I think some people don't know that, because I've found some friends on Reddit, and they have some pretty embarrassing histories. (laughs) I should probably go clean that out. But anyway, so you can sort of use that information to get a guess of what this person is interested in, because generally, the more they post somewhere, the more they comment somewhere, that's a pretty good indication of what they're actually interested in. If they're posting a lot in the funny pictures subreddit, then they're probably interested in funny pictures. If they're posting in the World of Warcraft subreddit, then they probably play World of Warcraft, and they're interested in related things. That was sort of the base idea of how I figured out what people are interested in on Reddit. - So you can look at posts, are there any other metrics available? - I mean, it's really just posts and comments that are publicly available. I know that Reddit tracks other things, like where people go, where people click around in Reddit itself, and even where people subscribe, but that information isn't available publicly. And really, that's because Reddit is huge on user privacy. Even though they want to make as much data as possible public, they don't want to sort of give away private information about their users, which is actually a good thing, so I respect that, and I work with what I had. - Yeah. So is there any distinction between posts and comments or were they more or less the same indicators? - Posts are when they actually share a link within one of those subreddits, whereas a comment is when they're saying something about those links. I didn't make a particular distinction between them. I mean, you could theoretically maybe wait posts more, if people are actually sharing content in a subreddit, but I decided not to do that for this study. - Sure. So then you can go on, I guess, scrape all the data off user pages. You now have mapping from user IDs to posts and to comments. What's your next step in turning that into some sort of data structure that can help inform the recommend there? - Part of this project, not only having the goal of building useful tool for Redditors, but I also wanted to sort of practice network analysis and things like this. So I wanted to sort of picture Reddit as a huge network that's linked by users. So if you're familiar with networks, you know, networks basically have nodes, you know, which are typically pictured as a circle in a diagram. And so each subreddit is a node, then there's edges between them. So, you know, typically a line is drawn between these nodes to indicate that there's some sort of relationship between those nodes. So in this case, I had to think about what does it mean for two subreddits to have a relationship with each other? And for me, that meant if two subreddits are highly related, if you will, if there are many people posting in both of them. So for example, if someone is posting in, let's say, the gaming subreddit and the World of Warcraft subreddit, you know, there's a thousand people that are regularly posting in both of those subreddits, then that's probably an indication that there's some sort of relationship there, right? So that's sort of the gist of what I did is after I scraped about close to 900,000 user pages, I basically looked at where they cross posted and just counted up how many times each user cross posted in each subreddit. - So your example of gaming and World of Warcraft is very intuitive. Were there any highly correlated subreddits that were a little surprising during your analysis? - Yes, there were some interesting ones. The one that surprised me the most was, I wasn't surprised that these subreddits were related to each other, but I was just surprised that they existed, which is this huge community of my little pony subreddits. - Oh, yeah. - All these people come together on Reddit and they just love to talk about my little pony and post pictures of it and everything. You know, they basically had their own little island when I ended up visualizing them. And then they were related to the anime group after that. So I guess that kind of makes sense. - Yeah, I wouldn't have necessarily guessed it on my own, but it's intuitively plausible to me. - Right. And another one that I have to say that was really surprising was there was a fairly strong link between the gun subreddit and the Gone Wild subreddit. (laughing) - Okay, this is interesting one. (laughing) - The Gone Wild is where people go to share basically naughty pictures of themselves. And so I thought that was kind of strange, kind of disturbing that there's a link there that people go to discuss guns and people go to either share pictures or comment on pictures of naked Redditors. Yeah, that was interesting. - So it sounds like there's a huge amount of data here. Tell me about some of the challenges around just acquiring and processing all of it. - This was sort of a side project. It wasn't funded by a research grant or anything like that. So I kind of had to do this with limited resources, which means I have my own little web server that I purchased some space from. So I kind of had to fit this in a relatively small database, where we're storing information about nearly a million people and their interests. In a database, I think I was limited to a database of about five gigs or something. So I had to do a lot of data compression. You know, unfortunately, I wasn't able to do things like save every single user's post and comment. That would have easily filled up the database immediately. And so I had to do a lot of, you know, summarizing of the data. So I would say if a person commented and posted in a subreddit at least 10 times, then, you know, I would create a link in the database saying this person posted the subreddit. Definitely another challenge was even after all that summarizing, once I put everything in the network, I have this fairly funny picture that I like to show when I talk about this. It was basically a giant buzz ball when I tried to visualize the network. And that's because, you know, if we're looking at where people post, you know, where nearly 900,000 people post, at least one person has posted, or will create at least a tiny connection between every subreddit, right? You know, most every subreddit. So when I was trying to visualize this, which I used an open source tool called Gefi, you know, just sort of just laid out in a 2D domain, it was just this giant buzz ball that didn't make any sense at all. So that was definitely the biggest challenge that I faced in this project was trying to figure out, okay, I have all of these connections, you know, all of these relationships between the subreddits. But not all of them, not all of that information is necessarily created equal, right? Some of it is useful. Some of it is just spurious. So I had to figure out a way to sort of get rid of some of these edges. So I could actually see something meaningful, you know, see the meaningful relationships between these subreddits. And so the typical approach that people take when they do this is that they just say, we'll keep all of the edges or all the connections that are at least of strength, you know, some random number or some arbitrary number like 10, you know? So if at least 10 users posted in both of these subreddits, then we'll maintain that link. Otherwise we get rid of it. The problem with that method that I quickly found out is that that method gets rid of all of these smaller communities, right? Because my goal was to take people from these really large, really popular subreddits which have hundreds of thousands of people posting and commenting in them. And to sort of help guide them down into these smaller communities that may only have five or 10 people there. By choosing an arbitrary threshold, I was completely destroying all of these communities. You know, they just disappeared off of my map. So that's when I turned to a different method that's fairly new. It was published only about seven years ago, I guess now. It's basically the idea is that it tries to use a statistical model to extract the backbone of these networks. And what they mean by of all these connections that basically seem to be there by chance. And I don't know if you wanna go into the details of this. - Yeah, I think that's really interesting. - So the gist of it is this method goes through and it goes through on a node by node basis and it looks at all the subreddits that it's connected to. And it sort of looks at the distribution of the strengths of those connections. So let's say if we look at one subreddit, like the PIX subreddit and it's connected, I'm just making a very simplified example here. I mean, it's connected to four other subreddits. Each of them share 10 users, you know? So it's fairly uniformly distributed, right? There doesn't really seem to be any preferential attachment between these subreddits because it's just 10 users or share, the edge weight is 10 between each subreddit. So in this case, this method would say, there doesn't really seem to be any preferential attachment between these subreddits. So let's just throw out all of those edges. So then if we think about a different example, what it's really looking for is if we consider a different, let's go to a different subreddit now, say funny, we look at four subreddits that it's connected to. And let's say it's very strongly connected to one of those subreddits, and then it sort of has the same 10 users connected or the same edge weight of 10 to the other three subreddits. Then it would say, okay, well, let's get rid of those three other connections that just seem to be spurious, but let's leave this abnormally strong connection between, you know, let's say PIX and funny or something. And then it also works the other way where it looks for abnormally weak connections as well. The gist of it is it's looking for abnormal connections and not just sort of random looking connections. The nice part about this really is that what's decided as an abnormally strong or abnormally weak connection is decided on a node by node basis. So it doesn't, you know, it doesn't belittle or it doesn't destroy these smaller communities by applying the same sort of thresholds that it does to the more popular ones. - Yeah, it's sort of invariant to scale then in a way, would you say? - Exactly, yeah. Yeah, so that's, it turned out to be a very important part of this project. - We briefly talked about the visualization and now we're at that difficult moment of an audio podcast where we need to describe something that's entirely visual. So I certainly want to direct people to your website. I'll have not only your blog posts and the article, but there's also an interactive version of the visualization that's very fun to play with. So listeners who are at a computer or will be at one later should go check out the links in the show notes for the sake of giving, you know, a high level description. Could you tell me a little bit about what the visualization shows? - Sure, I saw that there were actually two versions that I ended up making. The first version sort of looks like a paint splatter on the screen, actually, I guess both of them do. The algorithm that I used placed all of these nodes according to these abnormally strong and abnormally weak connections, subreddits that had sort of stronger connections were pushed closer together. Whereas subreddits that had weaker connections were pushed farther apart. And the neat part about this was that you sort of see, when you look at this map, you sort of see some communities breaking out. You know, there's a big blob in the center and that's really just the default subreddits and all the really popular subreddits, but you start to see these communities breaking off. So there's that my little pony island that I talked about that you'll see on the top right, bottom left, if I remember right, that's there a whole bunch of sports subreddits there. - The really funny part was when I shared this online with people, they started taking screenshots of it and they actually started marking it up, you know, by saying, oh, I mean, like on the top, there's sort of a peninsula coming out, you know, if you started, if you try to visualize this as a landmass and that's the huge group of porn subreddits on Reddit. So people started labeling that like the porn peninsula. - Oh, too funny. - Yeah, you know, and they had, what were some other funny ones? - Well, the porn peninsula is not too far from the furry colonies, so I guess that's what you do to sense. - Yeah, there's LGBTists and all these other things. Yeah, it was really interesting. And when I saw that, I sort of kicked myself, you know, I was like, well, you know, I've been learning all this stuff about network analysis. And of course, one of the great things you can do with networks is you can cluster the nodes within those networks, you know? So I saw, well, obviously people are seeing clusters here. So I just sort of applied a standard modularity maximization clustering algorithm, which was actually just nicely built into the Gefi tool that I was using for visualization. And the really cool part was that when I visualized it again, and I had it color these communities by clusters, it identified a lot of these same communities started popping up, you know? So you see this huge gaming community where there's like Guild Wars II, Diablo, StarCraft, there's the sort of tech communities, all color the same color, you know, for people who like Python or Ubuntu, Windows, things like that. I thought that was really interesting that, you know, this fairly simple method of, you know, just throwing things into a network and clustering actually kind of produce some intuitive clusters of communities that people could use to navigate Reddit. - One of the things I found really interesting about it was your approach is totally context free. There's nothing hard coded that's saying these categories exist or these topics exist. It's a completely emergent analysis. Yet when you do that clustering, you're getting something that is more or less comparable to what a human would label when you ask them to put their own clusters on it. So it has this very independent lines converging on the same answer property that's really nice to see in a good analysis. - Yeah, so that was really neat to see that even without knowing it, you know, by posting and commenting on social networks, people were actually giving a structure to that social network that we can automatically detect using, you know, fairly standard network analysis methods. So I thought that was a really neat result for this project. - Yeah. How do you suppose it will evolve with time? Like if you repeated this in 10 years. - That's definitely a good question. So when I made this, I mean, this current version here was built off of user behavior in mid 2013. So it's already a little dated. And that was also when there were only 25 default subreddits. More recently, Reddit has actually added and expanded their default subreddits set to 50 now. So I'm going to guess that this middle cluster of the defaults is going to expand a little bit. Every day, you know, at least a dozen new communities are created on Reddit. And so over time, we're just seeing more, you know, we're getting, you know, hundreds of more of more communities on Reddit. So I presume that if I were to rebuild this map today off of current user behavior, we'd probably see a lot more nodes and especially a lot more clusters on there, hopefully more sort of islands like the My Little Pony Island. That'd be really neat to see. - Yes, actually, we touched on the My Little Pony Island earlier. It might be a nice kind of use case for anyone in particular who's pulled up the map and can look at it. It very much isn't islanded. Those are the nodes that are very, most pronounced as being disconnected from the glob in the center. Pop culture and music are very close, relatively speaking. There isn't a lot of white space between them. But the most, at least to my eyes, the most isolated cluster is the one you've identified as the My Little Pony Island, or so-called My Little Pony Island off on its own as if to imply that those nodes are in some way distant from much of the rest of the mass. How does one interpret that sort of distance like that's presenting itself there? - Right, so that basically means that those My Little Pony or the people of Pony communities tend to only post within those communities, and they don't tend to post so much outside of those communities. So they sort of formed their own isolated metacommunity within Reddit that actually doesn't interact that much with the other subreddits. And so that's pretty interesting that that was already identified there. And the funny thing is, is when I shared this with the people who regularly post their My Little Pony, they're like, yeah, that makes sense. A lot of people here, we sort of stick to ourselves here. We're a tight knit community. The funny thing is, is if you actually go to the interactive version and you look at the communities in there, they're sort of My Little Pony themed versions of other subreddits on Reddit. So it'll be My Little Pony pictures or whatever. And this is where they wanna say, this is where if you wanna share pictures, especially related to My Little Pony, this is where you share it. Don't bother going to this larger community in Reddit, just stay here in our community. I'm sorry, so that was really interesting that that popped out. And a few other communities popped out like that too. I mean, especially if we focus on the porn one, it seems like the people who go there for porn only go there for porn, they don't comment much in the rest of the website, which probably makes sense. I mean, people do have multiple accounts on the website, so maybe they're private time account that they use to make comments there. - That makes intuitive sense, absolutely. So if someone who's, let's say, a very casual front page Reddit user is looking to kind of interact with the visualization as a discovery tool or their own sort of recommendation engine, how can they go about using it for discovery? - So there's two ways I can see this being used. The first way that I think the perhaps the most intuitive way is if you know there's a topic that you're interested in, let's say you're interested in the NFL and football, then you can use the search feature on the interactive version and you can just type in NFL and it will zoom you in to that node or to that subreddit there. And so from there, you can sort of get an idea of, okay, what are the most related subreddits? Where do people who talk about the NFL a lot also talk, where else do they commonly go to post links and comment? And so if we zoom into the NFL, well, there's a sports subreddit that makes sense. And it also lists like the related ones on the side here. So, people who talk about the NFL on Reddit a lot, they also talk about beer a lot, they talk about college football a lot, fantasy football, fitness, things like this. So this is how I can imagine this being used. You can either look directly on the map or just look at the recommendations on the side there and go, oh, I didn't actually know there was a college football subreddit around it. No, there is a fantasy football subreddit. And so in a sense, this is how I've actually discovered a lot of really, really neat communities that I never would have found on Reddit without this. I mean, I knew I was interested in data is beautiful and so I could go there and find out, that's how I discovered the map port and subreddit, which isn't actually a port and subreddit, it's a subreddit for people who like to share really cool maps and there's a data subreddit and things like that. And so that's the one main way that I think it could be used, but also just if you really have no idea what subreddits you like on Reddit, let's say your first time visitor, you can sort of scan over these communities. So I'm looking at the clustered version and let's say if I go to the top left and there's this huge yellow community here and I'll start scrolling over and seeing what these subreddits are and I go, oh, interesting, there's StarCraft here, there's Dota 2 Trade, the Star Wars subreddit and various other things. And so you can sort of use it to just get a quick glance of what these major communities are and if you see interesting things there, then you can zoom in from there and find out what are the smaller communities. And so really one thing that I would love to do that would help with that is to automatically do what people were doing when they were taking screenshots of my map, which is just to automatically label these clusters. Like I would label this yellow gaming cluster here gaming or something and I would label the My Little Pony island, My Little Pony, things like that. And so I think that would be a really interesting future venue to go through that would help with this because we could automatically label them like that. - It could be a neat community effort too. - Right, absolutely. I mean, it seems probable to just crowdsource the labeling right now. But also, I mean, I think it's really neat to think about how this could be done automatically without human input, that's the cool part of this project I think. - Yeah, so maybe a listener who is interested in name entity extraction and things like that might wanna reach out and see if they can make a contribution. - Yep. - Cool, so tell me what's next for you, whether it's related to this or your academic work or your other writings. - Well, also, I guess I can talk about both. I'm coming to the end of my PhD. So I sort of started cutting off my side projects and focusing on my PhD. But for this project, I was actually talking with someone earlier this month about it. And one issue is that if you type in the name of a subreddit, it doesn't necessarily give the absolute best recommendations. It says here's related subreddits. But it doesn't take into account, for example, that a person may like multiple subreddits. - Sure. - I like pictures and I like football. I mean, those are very different subreddits in this map. But there's information there, right? If I'm posting a lot in pictures and I'm posting a lot in football, that might also mean that I'm really interested in football pictures, it would be really interesting if we could make a method that would take into account all of the person's interest and make recommendations based off of that. The one way we were talking about doing this was, basically, we know that there's information in the XY coordinate of each subreddit here, right? You know, the fact that the My Little Pony Island is far away from everything else tells us that all of the My Little Pony subreddits are highly related, but they're not related to the other subreddits. So if we could, you know, sort of automatically create, let's say 30 of these maps and then measure the distance between every subreddit, that might be able to, you know, sort of make use of the information. I mean, 'cause it was, it's still very much like the planning stage of how we're gonna actually take into account multiple user interests at once when making these recommendations. - Sure, yeah, exciting stuff though. - Yeah, so that's sort of the future for this project, but it's, like I said, it's sort of slow going. And then as far as the future for me, well, I hope to be graduating this spring. Well, actually, I'm, I should say, I'm graduating whether I'm ready or not this spring. The date's been set, so now I just need to finish up the dissertation and everything. And then after that, I'm headed off to the University of Pennsylvania to work with a professor there working on bioinformatics and AI research there, so for a postdoc. So that should be fun, you know, a different scene and a different research group to work with, so I'm pretty excited about that. - Excellent, congrats on the postdoc. - Thanks. - So I like to end each interview by asking my guests for two recommendations. The first being the benevolent reference, something you're not connected to, but you think is interesting and worthwhile you'd like to share. And the second being the self-serving reference, something that hopefully you get direct benefit out of from being on the show. And we've already featured your blog, randalolson.com, and there'll be a link in the show notes. So feel free to plug anything else as well. - Sure, so I guess for the benevolent recommendation, if you're just learning Python and you're wanting to get into visualization and statistical analysis, there's actually a really great library that's becoming a little bit more popular over time. It's called Seaborn. And it's a really great library because it makes beautiful graphs right, you know, right out of the box. Also it makes statistical graphs. So it can fit a linear model to your data or look at a correlation between variables and all kinds of other really neat things. So I highly recommend checking out Seaborn as soon as I ran across that library, I fell in love with it. And then I guess the self-serving recommendation. Well, of course, I mean, anyone interested in data analysis and visualization should definitely check out the data is beautiful subreddit on Reddit there. I'm actually a moderator of that community. I've been there for, well, actually I guess I just celebrated my one year, not too long ago. And so we have about 2 million subscribers there now, but definitely, you know, the people who don't know much about data analysis and visualization vastly outnumber the people who do know much about it. So we're really trying to build a bigger community of people that are knowledgeable or even just interested in data analysis and visualization there. So definitely come check out data is beautiful on Reddit if that's your thing. - Awesome, yeah, that's a great destination. And so I would also say, in addition to your blog, which is great for casual readers, I would encourage some of the more technical and seasoned data scientists go check out some of your publications as well. There's links on your website and it's a lot of great stuff. - Absolutely. - Well, thanks again, Eddie, for joining me. I think this has been a really fun chat and I enjoy hearing more about the project. I'm sure the listeners are going to enjoy going and checking it out. - Yeah, awesome. I hope it inspired some people to go and check it out and maybe extend it, you know. I put everything out there online. So if anyone wants to build off of it or anything like that, I'm more than happy to work with you. - Oh, fantastic. Well, thanks again. - Yep, thanks for having me. (upbeat music) (upbeat music) (upbeat music)