(upbeat music) - The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. - So Linda, I'm gonna have you jump on a quick intro to this episode. Are you cool with that? - Yeah. - All right, so I'm interviewing a researcher who used machine learning on Facebook-like data to help predict personality traits in a couple of categories, categories like openness. And what they found was very interesting, I'm not gonna ruin it until we listen to the actual interview, but basically they compared how well their machine learning output was able to predict people's traits compared to their self-rating and ratings by people like their spouse and their friends. What do you think? - Sounds interesting. I think my vote is for the computer. - Oh yeah, that's interesting. Well, we'll see at the end. So I'm gonna go do that interview now. We'll come back at the end and I'll reveal the scores. - All right. - Well, welcome to another episode of the Data Skeptic Podcast. I'm joined this week by Yo-Yo Wu. How are you, Yo-Yo? - Good, how are you? - Doing very well. Thanks so much for joining me. I asked you on to discuss your interesting paper from the Proceedings of the National Academy of Sciences titled Computer-Based Personality Judgments are more accurate than those made by humans. I thought it was a really good read and I wanted to share it with my listeners. - Oh yeah. - Maybe we could start with your background and what got you interested in the subject. - I'm a personality psychologist. I studied people's perception of each other, people's judgments of each other's personality. And traditionally, I look at how we perceived the people living with us or like our parents or friends or romantic partners. And recently, I got interested in looking at how our personality can be inferred from digital records of our behavior. Like when you use Facebook, when you use Google, you leave a lot of digital traces of your behavior, like who you friended, who you talked to and what you liked on Facebook. And we can infer a lot of information about yourself, including personality. And I got interested in comparing how accurately personality can be inferred from your own digital records of behavior online. And comparing that with how that can be judged by people who know you in your real life. - I guess you're getting the ground truth measurement from the consensus of people who know you and can vote on the accuracy of the personality traits? Is that correct? - Yes, and that's one thing. And the ground, another ground truth that we mainly use in the paper was people's self-report of their personality, like how you see yourself. - Makes sense. - Mm-hmm. - So as you were pointing out, yeah, we're leaving more and more of a digital footprint online. - Yes. - And you mentioned a couple of things, like Facebook likes and who you're connected with. Was that the extent of the data that you looked at in this analysis? - Yeah, so the data that we look at in this paper as Facebook likes, when there are certainly many other signals online that are indicated for our personality, like your language. And there are several other papers out there by other researchers looking at how, say, the language you used in Twitter, your Facebook status post, can be predictive of your personality as well. - Can you give me some examples of personality traits that you're able to predict? - Yeah, so we look at mainly what we call the big five personality traits, open this to new experiences and conscientiousness, like how reliable you are at work and are you well-organized, that kind of thing, and also how extroverted you are, how agreeable, nice, cooperative you are, as opposed to being competitive, and, lastly, how neurotic, how emotionally stable you are. - And what were the approaches you looked to get computers to try and learn these models? - Yeah, so the assumption is that we think that when people click on different things that they like on Facebook, and people with different personality would like different things, and they have different preferences, behavioral styles and tendencies in life. And we basically take a large dataset of people. We asked them to provide a list of their Facebook likes, as well as a self-reported personality. They rated themselves on the five personality traits I just described. And so I used machine learning techniques so that we use computer models to figure out association between liking something and having a personality trait. So for example, we found that extroverted people tend to like some things that are obvious, like partying and meeting new people, making people laugh. And there are some also less obvious associations such as being introverted is related to liking, watching Dr. Who. And yeah, so we basically have the computer models to figure out the association between liking something and trait. And then once we have the models, then we can feed people's Facebook likes into this model and use the algorithm to make a prediction about this person's personality. - Very interesting. Can you tell me a little bit of the specifics around machine learning techniques you use? - We use a very simple algorithm. We basically just use linear regressions and inputting all the likes that you have. And we did in the cross validated way, meaning that we are not just predicting your personality from the data that contains somebody's personality. Like we took an independent sample and we trained the models on that sample. And then we applied that to another sample, what we call the test sample, that we don't reveal the truth, the self-report personality to a computer. And we validate the accuracy that way. And we rotate each time we take a different independent sample and we did it 10 times. So we cross validate and to make sure that our model is broadly applicable to the broad population and to make sure that the model is applicable in real life and to many different kinds of population. - Makes sense. About how many users did you use in your training set? - So in our training set, we use a total of around 80,000 people. Yeah, that's the number of people that we had in our dataset. - And roughly speaking about how many likes on average did most users have? - I think in our sample, the average number of likes were 100 in our sample. And I think we had, our sample came from a dataset I was collected from five years ago from 2007 to 2012. In our paper, we also estimated on average how computer would do in terms of predicting personality. Because we found that the more likes you have, the more information that you reveal, the more accurate we can make predictions about. Average number of Facebook likes, like of a typical Facebook user, like an average, like a number of likes that a typical user would have. And we collected another sample in 2014. And I was about, I think, 100 and 227 likes and we've used that as a baseline to estimate how our model would do on average if given an average Facebook user. - Interesting. Yeah, knowing that it's, you know, 100 likes, the 200 likes or so is a lot, but it also isn't because I would imagine there's a lot in there that aren't particularly predictive. So it's impressive that the computer is getting a nice result with a limited dataset. - Yes. - How did your predictions compare to the personality questionnaires you had for your ground truth measurements? - To measure the accuracy of our prediction, we correlate the ground truth, which is people's self judgment of the own personality. And on average, using the average number of Facebook likes, a typical user would have. We found that our model would predict with a correlation of 0.56. - And how did that compare to when the Facebook, the participants, friends themselves, try to assess the people for those traits? - And in our example, we did ask like a group of their Facebook friends who evaluate and that was an average of 0.52. So we determined from these numbers that on average, computer models can be better than human and judging personality traits already. And I, as I said, it's also a function, how accurate our model is. This is a function of how much information you reveal. It's definitely, so we think that with more likes, computer is just more accurate. And it's a matter of time when people more open and they tend to rely more and more on digital devices when it comes to activities in life. So I think, given the time as people leave more and more digital traces online to computer models, we'll just do even better. - As I recall, you broke it down in terms of the Facebook friends that you asked to give those ratings for comparison. You broke it into work colleagues, friends, family, and spouse. - Right. - How did those categories affect the accuracy of the ratings? - Yeah, so the theory is that people who know from different life contexts, and they also can, the accuracy of their judgment of personality varies. And we found that romantic partners usually tend to be the most accurate judges of our personality. And they can, on average, have a correlation accuracy of 0.58. And in our sample, it wasn't statistically different from the computer's average of 0.56. And so we determined that these two are comparable. We also look at how many likes it would take for a computer to overtake a romantic partner. It turned out that you will only need 300 Facebook likes to be as accurate as a romantic partner. - Well, that doesn't sum up relationship too carefully. - Yeah, no, but I mean, there are much more than judging like the five big personality traits in relationship, but to some extent shows us that maybe in the future, a computer will have the capacity to do more than just judging personality, such as like interacting with us and responding to our thoughts in a social way, based on analysis of our digital records. And I think this is the start of those social engineering to make computers able to interact with us in a socially intelligent way. - So I noticed some of the areas you're looking at are like political attitudes in physical health. These things weren't too surprising to me because people tend to like maybe politicians or like charities or movements that are indicative of political attitudes. - Yes. - Similarly, I might like sports or hiking that could imply a little bit about my physical health, but I was surprised you were able to have some validity of the predicting substance abuse because I presume most people don't like things like doing drugs or like cocaine or anything like that. - Yeah. - How did those models work out? - Yeah, so yeah, it was surprising to us as well that how personality predictive from Facebook lies can predict those things. And I want to emphasize that we use personality predictive from the likes to predict those outcomes. So it will, in the previous paper by my colleagues, one of my two co-authors, they actually found that if we use likes directly to predict those traits, it would be even more accurate. And I think it's, for a substance abuse, it's based on an assumption that people who are more say risk-taking and less concerned with consequences and those people tend to be more likely to be engaged in drugs or substance. And it's relevant, it's evident in the likes that they have, there are certain things that indicate if you're a risk-taking person. - Do you think the computer or the machine learning approaches have advantages that we don't have as people? - Oh yeah, definitely. And there are several advantages of computers over people in judging personality. And first of all, it has more information. The fact that the information that path is that you leave like in your digital records, that they got recorded and they got a memorized. All your records got retained on the computer. And in comparison that even if you displace and behavior that indicated your personality in real life and people might not pay attention and your friends or your family, they might not, it might not like stay in their head. So I think computers has the advantage of accessing and storing a more information, more relevant information to your personality than our friends or family. And another advantage is the computers when we used a model like algorithm to figure out association between behavior and personality. Like computers rational, like it uses the optimal algorithm to figure out how to make a judgment. Why human beings and we use our intuition and a lot of times intuition can be wrong and we can also be biased, like we sometimes are motivated to see a person in a certain way. Like say, for example, you don't want to judge your romantic partner as being a nice person or like you're motivated to believe that he is, he or she is a certain kind of, has a certain kind of personality. So to some extent, computer models are less biased than people. But I think on the other hand, people also would have advantages over computers as well. Like we're able to capture a lot of subtle cues that are not available in the online environment. Yes, they like body language and facial expressions and those kind of things. - So how does this work fit into your overall academic interests? - I'm really interested in how we can infer a lot of the social psychological traits from people's online behavior. And for now, we are looking at how personality traits like those stable characteristics of a person can be inferred. And I think following that, I am going to also look at how people's psychological state, like how happy you are, like emotionally state can be inferred from your digital records like your status updates and your messaging and those things as well. - You had mentioned that computers at the moment don't have access to body language and these sorts of things. But there's no reason we couldn't be doing some video recognition trying to capture those. So I'm curious about your thoughts on what are the most useful features that you would make available if you wanted to extend this work beyond just using Facebook likes. - And I think that since now we have Apple watches, it has the capacity to capture a lot of the physiological measures, such as their heart rate and how active you are, those kinds of things. And a lot of the physiological measures, we know that they're very related to people's state, like how tense you are, like how happy you are and those kinds of things. And I think that would be interesting features to look at in predicting people's psychological trait or state. - So what are some of the novel uses that we could apply with a certain work like this is going to continue and improve in accuracy? What are some of the benefits we can get from having computer-based personality judgments? - And there are definitely some practical implications. And the one is that we can use that in the marketing context where say advertisers or companies who want to promote their products instead of like targeting you, like universally on Facebook that when people are annoyed by how much ads you see. And we can tailor ads according to people's personality. Say like if you person open to new experiences, I will probably recommend say extreme sports view. And that way we basically improve the ads experience of people. And I think that important online experience to have now it's horrible that a lot of people are annoyed and as quality is not as good because they are not tailored. And if we know people's personality and we know them on scale, and in the past the problem was that advertisers know way to figure out people's personality. You cannot just ask people to fill out a question there. And it's like time consuming and it's costly and there's no way that we could do it at a large scale. Now we have this technology of transferring digital records into personality and it can be done as scales in no time. And so we can improve people's ads experience. And apart from that there are some other practical implications such as we can use it on say dating websites. And where the dating websites are already asking people to fill out like tons of questions about yourself. And first of all is time consuming. And second of all, a big problem with people's self report is that on dating websites everybody tries to look as attractive as they could. And so people lie and without even realizing it. Like definitely you want to be a nice person, you want to be extroverted, you want to be attractive. And I think the technology that we're using now solved this problem because it's much harder to lie. Of course you can click on Facebook likes intentionally, but it's harder to do it consistently for like a few years and to make yourself look like someone who you actually know. So I think the technology predicted from your own digital records is much more objective and accurate, true, than people's self report in the context of dating websites. And it can, as I said, it can be done very fast and cheaply. And we can use the personality that I got predicted to match people on those platforms so that people can have a better experience of finding their romantic partners on dating websites. - Yeah, absolutely. I'm wondering about your sense of the state of this line of research. In other words, are we at the very basic stages and we're going to see lots of advancements over the next decade? Or are we nearing the upper limit of how predictable human personality actually can be? - No, I definitely don't think that we're near the upper limit at all. I think in the era of predicting social traits, like computers are pretty good at predicting things like Google maps are super accurate, it's very accurate in figuring out directions, in predicting where you would go, like quickly figuring out where a home is and where your work address is. But I think in terms of predicting social traits, like psychological traits, we're at a very early stage. I think that we're definitely going to see a lot of advancement in the next decade when more and more social scientists, like me, are starting to work on those things. And it's also, I think, advancement is going to be driven by the implications, such as those I just described. And I think those things are gonna evolve like the research, as well as the implications are going to evolve together and they're gonna stimulate each other. - Do you think people should be worried about privacy at all? For example, if I like Red Bull, is that going to imply that I'm a risk taker and maybe a future employer will be a little bit adverse to entertaining me as a candidate? - Yeah, I think it's definitely, so it's already obvious that people are super concerned about their privacy and they freak out when they heard that. Oh, a lot of information that you didn't reveal is actually obvious to companies like Facebook and Google. And I think you definitely should worry that people have the right to protect their privacy. And I think that it's mostly at the responsibility of the company, the service provider, as well as the government to make sure that people's privacy, people's data are protected. And first of all, I think that the government should design policies in a way that minimizes associated risks, like to make sure that companies are transparent about how they're gonna use their user's data and how it's being analyzed and how it might be used in the future. And second, I think those digital service providers should make sure that users take full control of their data and decide that they can decide for which purpose it can be used. Also, I think that it's important for companies to show and inform users of the benefits of how their data gonna be used. Like say, if I go to example, would be what the next best does. Like it takes a lot of information from you. It asks you what movies you've watched in the past and that kind of thing. But people are happy to give out this information because it knows that it's gonna increase, it's gonna improve their watching, viewing experience because like giving out information makes them recommend movies more accurately. And I think in the social sector companies like Facebook and Google should do the same thing and try to show users first how their data gonna be used and how it's going to benefit us. And it's gonna make people more willing to accept this technology. - Yeah, I'm really glad you made that point. That's the other side of the privacy argument that I don't think it's discussed enough that there is a benefit to revealing information and that the services you get back can be more custom tailored to you. So what's next in your line of research? - As I said, now I'm working on something in order to predict people's psychological effects. And we are possibly looking at Facebook data's updates and to see how that can reviews, how happy a person is when he posted a thing and how happy this person in general are by looking at all his passports. - So I'd like to wind up my shows by asking my guests for two recommendations. The first is the benevolent recommendation, a nod or lean to a book or paper or something like that that you think the listeners would benefit from knowing about what you're not directly connected to. And the second is the self-serving recommendation. Hopefully something you get direct benefit from by appearing here. - So I think a paper that I would recommend would be a previous paper by also my two co-authors. I'm not involved, but it's a good paper by my co-authors. It's called private traits and attributes are predictable from digital records of human behavior. It was also published in PNAS in 2013. And that paper took a much broader approach looking at what kind of traits are predictable from people's digital records. And in my paper, it was personality. And in that paper, they look at a lot of other things like demographic traits such as sexual orientation, ethnicity and religion and political views, your intelligence, happiness, substance abuse and whether your parents were divorced at your age of 21 and your age and gender and that paper just showed us how many information that we reveal just by looking at people's Facebook likes. So I think it's an interesting paper to read as well that is related to what I did in my research. And I think another interesting app I would like people to try is called Apply Magic Source. So it's just applymagicsource.com. And basically it's a prediction API and it's an app that is built based on our research. And so people can input their own Facebook likes and to see our predictions of your face that I just described and including your personality. And you can judge whether our models are accurate or not. - Well, that's really cool. I'm gonna go take that test myself and I'll post the results on the podcast official Facebook page for anyone who wants to check that out. - Yeah, let me know. - Maybe if listeners want, they can do the same and we can all share our results and our experience with the test. So that'd be a cool thing to do on the page. Oh, and one more thing. I have a wide spectrum of listeners, some who don't necessarily delve into academic journals very often. But I have to say, if people are looking to read their very first academic publication, yours is quite accessible. Not all listeners will necessarily understand the Pearson correlation coefficient, but some of the technical pieces like that. But I think it's a very accessible paper anyone can read. So I highly recommend it to people looking to take the dive into more technical literature. So it'll be linked to in the show notes for anyone who has to do that. And thanks again for your time today. - Yeah, it's a pleasure. (upbeat music) - So the listeners have now heard the interview. We're back Linda and I to reveal what my scores were and to get her readings of me so we can test the accuracy. I should just say before this, this is totally anecdotal. This is just one data point. This doesn't validate or invalidate the study. It's just a fun way to explore it. So first off, so all these are a scale of zero to one or zero to a hundred if you want to think in percentages. The degree to which these are statements that describe me. All right, the first one is openness. How do you rate me on openness? - Did they define each of these traits? - They did, but we're out at a restaurant and I don't have the definitions. So go with whatever you find to be the sort of colloquial definition of that. - Openness. - Yeah. - On a scale of how open you are? - Yeah. - 75%. - 75%. - Okay. I gave myself a point five and the computer gave myself a point five three. - Wow. - All right. How about agreeableness? - 60%. - 60%, okay. I gave myself a point five. The computer gave me a point four six. (laughing) - Computer's still one, huh? - Yeah. - Extraversion or how your degree to which you're extroverted? - 40%. - Uh-huh. I also said 40% for me and the computer said 32%. Conscientiousness. - Of what? - Of the universe, I don't know. - I'm confused what that means. Of yourself? - No, conscientiousness is like of your community and those around you. - You mean like considerate? - Yeah, I think. - I don't think you're defying that right. So I'm gonna go out in my original thought. - Okay. - Okay, let me think. - Does it remember I rated myself in the definition I gave so? - It was a definition you thought it was. - The one I just said. - Considerate. Okay. Like 40. - 40? Well, the computer said 0.37. So you're close to that. - They're at bay now. - I said 0.8. So, don't laugh. (laughing) - This is why I said you need to have the definitions there. - Okay, and the last one is neuroticism. - 20%. - I said 30%. The computer said 38%. - Oh, computer things are more neurotic. - Than I claim to be. - I think that's good for me. (laughing) - It also gets me to be 24. It thought I was not that likely to be married and a few other interesting things. So, I'm gonna post all that on the show's Facebook page if anyone wants to check that out. Linda, will you take your scores too after this? All right, and we'll run you through the same experiment as well. So, more with that in a second. - Hold on. So, we've gone to applymagic sauce.com. Linda has not logged in. And she's in advance given her ratings of the five traits. Let's go see what the algorithm says. - I'm scrolling down. - Okay. - It thinks I'm not in a relationship. - That's all right. It said about the same amount for me. - Not enough posts about you, girl. - I guess not. - 70 percent chance being married. - Okay, here's your personality traits. Open this. You, the computer gave you a 56. You gave yourself? - 60. - Oh, I gave you a 53. - Oh, close. - So, I got close. And conscientiousness? - 50. Oh, I'm like 90 percentile, for sure. - I gave you 82. - 50, you're saying I'm just medium? - The algorithm gave you 50. - It's wrong. - Extraversion. You gave yourself? - 31. I thought it would think I was extroverted. - The algorithm gave you 31. You gave yourself? - 75, would you? - I also gave you 75. - Yeah, you know what? - I actually grew up a really shy kid, so I could swing either way, personally. - Agreeableness, the algorithm gives you 45. - Oh, I got 25. - I said 46. - Really? - Yeah. - Neuroticism? - 41 percent. - The algorithm gave you 41. - And I thought I was 60. - Oh, I said 31. - But what is neurosism? It says calm and relaxed. I'm calm and relaxed, what is the most? - Find you to be relevant. - A fundamental personality trait characterized my anxiety, fear, moodiness, worry, envy, frustration, jealousy, and loneliness. I would say sometimes I have anxiety, fear, I'm moody. I'm frustrated, and sometimes I'm lonely, yes. - Well, I think of neuroticism as like an extreme of all those. Anyway, we both, I think, outperform the algorithm on rating each other, but that's kind of consistent with her finding that spouses are pretty good at this sort of thing. So anecdotal stuff, but really interesting. - Thank you for participating, Wanda. - Thank you. - And we'll see you guys next week for another mini episode. - Yep, hopefully we'll learn more about my neuroticism. (laughs) (upbeat music) (upbeat music) (upbeat music)