(upbeat music) - The Data Skeptic Podcast is a weekly show featuring conversations about skepticism, critical thinking, and data science. - So welcome to another episode of the Data Skeptic Podcast. I'm joined this week by my guest, Emery Saragal. How are you doing Emery? - Good, thank you. - I asked you on to discuss a really interesting paper I had read that you were a co-author on, related to online privacy. Before we get in, would you share some of your background and why this topic was something of an interesting research point for you? - Well, initially I studied computer science like about 10 years ago. And since then I've worked in different jobs as a developer and then as a researcher and so on. And somehow I've always been interested in privacy, mostly due to political aspect of privacy because this has become quite a vulnerable point in especially in the Western society in the last few years. Also in my own country, this is a problem. People get profiled due to their activity online and then the boundary is not really clear where one should stop or one should not be included anymore or the privacy becomes an issue when like the government, for instance, we can use this against you and so on. So this became, this issue really started interesting me. Me and my friend David Garcia, we were interested in analyzing doing some quantitative analysis in the line of our research of privacy at large basically in a social network, how much privacy is leaked or preserved and so on. - And so I don't think I ask where you're from. - Oh yeah, I'm from Turkey, by the way, yes, my home town is Istanbul. - Istanbul's on my wife's list of places we need to visit. - Yeah, well I will recommend you that. It's a very mixed land. - Excellent. - So you will yourself observe probably. - Yeah, I think privacy in governments are very interesting and important topic, at least in my country of the United States, we've known because of everything with Edward Snowden, just how much surveillance is going on. So it's certainly a very relevant topic in today's world. There's a specific phrase I think we'll use a lot that I don't know that every listener will be familiar with. So I was wondering if you could define what shadow profiles are. - Okay, so shadow profiles, this is actually a term that we came across in one blog article. It's cited in the paper, now I cannot recall it, but so it refers to a bug that was disclosed by some users when they download their Facebook data. So it turned out that when you share your contact list with Facebook, or it makes basically a list of all your email addresses, all your contacts and so on. And then another person also shares these email addresses and so on. Eventually, they are able to build these directories of email addresses. So even if you change your email address, they can somehow join it together. - This bug was discovered by this guy that he downloaded his Facebook data. You know, you can do this. I don't know if you're still able to do it, but you used to be able to download all your Facebook data and so on. And then he saw that there was the contacts of people that he didn't know, basically. This was immediately, this gave him the hint that Facebook is basically building this directory is based on the email address that I've been shared. So then this term was coined, shadow profiles, and it referred to basically any kind of information that Facebook is able to mine just by the information that users are sharing. Yeah, so this can include email addresses and vice versa. But then we have somehow split this term into two. - The first one is the partial shadow profiles. These are the profiles that a social network, in this case, potentially Facebook, can build about its own members that have not shared any information, but they have joined the network. And they have some contacts, and through using these contacts who have revealed some information, you're able to build partial shadow profiles. This is one of the types. And the other type is full shadow profiles. This is exactly what this bug referred to. So it is shadow profiles that can be built about individuals that are not part of the social network yet, but their email addresses or some contact information has already been shared by the users of the social network. And so we made these two distinctions to see basically which side is more vulnerable, obviously the partial shadow profiles are easy, somewhat easier to construct, because we are already part of the social network and so on. But still, it was more interesting to analyze and to figure out basically how vulnerable are the people that are not part of the social network. - Interesting, yeah. I have a friend who is very concerned with privacy and refuses to make a Facebook account whatsoever. But it sounds like from some of the techniques you mentioned, Facebook almost certainly knows about this friend of mine already and has some amount of data about him. - Yeah, I mean, this is the main focus of this study. This is exactly that. I mean, I'm not counting for the extreme cases where you're totally on the anonymous side and so on. I mean, even then, of course, obviously there are risks, but private efforts will not save your privacy. I mean, this is the main point of the paper because here we try to show that privacy, as we know it, is actually a public concept. I mean, it is something that the collective should take care of. It's a collective value. If I want to respect others' privacy too, I should interact with the social network in such a way that I will not only protect my own privacy, but also other people's privacy because there's certainly a network effect. And so privacy basically is lost in cascades. So if I'm, this is one of the terms we introduced in the paper called the leakage factor. And it is basically showing that if the more I leak information, the more people my neighbors in the network or my contacts that are outside of the network are liable to be compromised. So there's a cascading effect here. So let's get into the specifics of your research. Can you talk a little bit about the data set you chose to do the exploration? - So we chose the, well, we, I wouldn't say it's a choice. I mean, if I had a Facebook data, of course, I would do it with a Facebook data, obviously. But anyway, so we used data from Franster. Franster is the predecessor of Facebook, according to many people. So I mean, the data set is available in archive, in the internet archive. And what we did was basically download the first 20 million profiles. And the first 20 million users have mostly, with a very high percentage, have been from the United States. So we downloaded these profiles and extracted all of these that had some public information in them. Of course, there were also holes in the data, not all the first 20 million were there. It was what basically the internet archive provided. So this boiled down to six million, or so that had some data in it, and out of the six million profiles, about three and a half million had at least one contact. So this was our basic data set, the three and a half million users that helped us to build a full social network. Again, because of the friend list that they have exposed and so on. Since we also knew the sequence of joining times, we were able to split the network into time sections, so to speak. - And what sort of features are available in the Friendster data you acquired? - Actually, these are pretty primitive features. I like the categorical features. Obviously, the name, the birth date, the gender. And in the gender, you can select male, female, or unspecified. And then your relationship status, which has five categories, single, married in a relationship. And it's complicated and domestic partnership. So these are the five categories of relationship status. And then there comes interest. So there you can choose out of 10. So why are you basically in the social network? And this includes things like friends, like I'm looking for friends, and I want to look for activity partners, and I'm just looking around. And then there were some more explicit things like dating men, dating women, dating men and women relationship with men, relationship with women, and relationship with men and women, and so on. So these were the six fields that we have. We have extracted. So name, birthday, gender, interest relationship status, and the user ID. But then other than this, that was, I think, the last login location and something else that I don't remember now. But basically what we wanted to do was, in the end, frame our research around sexual orientation by using the gender, the relationship status, and interests, and somehow, classifying users with respect to their sexual orientation using these three features, so we didn't make use of the other ones. In the end, it's boiled down to analyzing the predictability of sexual orientation of users and non-users of the social network. But this was actually, I mean, it's not specifically a paper about that studies, gender, behavior, and so on. It was, this was somehow a proxy for us to analyze what happens to the privacy and so on. - Sure. - Yeah, so it could have been political interest if this was available, but that's my point. - Yeah, you guys referenced a number of great other works where people had done similar things where they could inference the gender or the age or the political affiliation or someone's home location. All of which is kind of interesting, although these are harder things to hide. As soon as you see a picture of me, it's clear I'm a male and maybe I don't want you to know how old I am, but you can guess within plus or minus five years, perhaps. But I thought that your choice to look at sexual orientation was a really interesting feature because it's one that people, you cannot tell on the surface, most likely, just from a photograph or something. And it's something that unfortunately, some people think needs to be private. And by unfortunate, I mean, I would like to live in a world where everyone is free to share this if they wish to share it. But there are certain places and situations where people would be in a dangerous or bad spot if they shared it. So they really need to keep it private, unfortunately. - Yeah, you're absolutely right. I mean, this, actually this was really a fitting choice, not only from the technical perspective because from the technical perspective, you have an unbalanced data set, unbalanced classes. Basically, you know that there are more heterosexuals than homosexuals in your data set and so on. But not only that, exactly your remark too, the study is also about how vulnerable minority classes are. Privacy, for many, many people, it carries a different meaning. If I have nothing to hide, basically privacy is not an issue for me. And just like you said, out of the, because of the political context or because of the social norms and whatsoever, I have to hide my sexual orientation than privacy is an issue for me. And I am in the minority side. And then it becomes a minority problem, basically. Because of this, I think you pointed out an important point about the choice of the data, yes. - Yeah, and I would say, even though you're very good attempt to gather as much data as you could, the result is a somewhat limited data set, especially in comparison to the volumes of data Facebook has. - Yeah, certainly. - How much of a handicapped you feel you were at compared to what they have access to? - I just, this is a, I mean, we had a huge handicap. Of course it's, of course, whatever I say would be an estimate now. - Right, right. - In the paper, we have the somewhere in the paper, the feature vector that we used for the classification, for the machine learning side. And this had about, so the number of homosexual males, heterosexual males and so on and so on. In my first neighborhood, in my second neighborhood and in my third neighborhood, this amounted to some 30 features or so. And we used age and stuff like that too. So in the end that we had like 36 features or so in the feature vector, but all of which are derived from the counts of the different classes of individuals in my neighborhoods and in my broader neighborhood. So if I imagine what Facebook has access to, first of all, their network is much more dense, obviously. Their network is much more layered. Here we have just friendship links, but there you can construct different layers of network. Like you like something, you like a post, somebody else likes a post. So you have a bipartite network in this case and so on and so on. So there's many different types of networks you can build and a network is much more dense. Each node has much more information. I mean, here we just had this little profile that people built and selected some things about their interests. There you have actual text. I mean, this is already on its own out of information because you're writing text about something most of it, which is often very emotional and so on or emotionally engaging. Of course, you have much more data in many dimensions. You have much more data. - So let's talk a little bit about the machine learning aspects you involved to try and do the predictions. - For the partial shadow profiles that I mentioned earlier, the idea was to get a random subset of the users. So we have a parameter which is the percentage of the users that are revealing information about themselves. At each simulation, this parameter is called R. For each R, we had 10 simulations. So for the case, 10% of the users revealed information, some information, we made 10 simulations and for 20% we made 10 simulations and so on and so on. And then, using these R% of users that are sharing information, we trained a classifier using random forests, which are, in many cases, my personal favorite classifiers. Anyway, so, and then we kept growing this amount. So how many people are sharing? How much percent of the network are sharing information? We grew this until R90% to test whether the privacy is more and more compromised for the users that are not sharing information. So once you get to more than 30, 40% of the network, you see that there's already, the predictions are going well beyond the base rate and so on. This was the first aspect. It was basically using the people that shared some information as the training set and then trying to figure out the sexual orientations of the other people that didn't share anything. So for the full shadow profiles, it was a bit different. We also had another parameter, the disclosure parameter. So this was the number of people that actually shared their contact with, so we somehow tried to simulate this case where, for instance, when you join Facebook, you will share. Now, with the Facebook Messenger, you share all your contactless but in the old days, you didn't have this, in the desktop version and so on. You didn't have this obligation. So it became much more of an issue. But in the Facebook situation, you had, let's say you shared your contact list and this was called the disclosure parameter in our paper. This is one thing that we increased in time. When the network is at its 10% of the 3.5 million users that we have, only 10% of it of them joined the network. So the network is small. Some of the users in this 10% are disclosing all their contacts that are in the outside world. Some are non-users that are that joined the network yet with a probability P. Then the network grows, it comes 20% and 30% of its actual size and so on. So we grew the network and at each snapshot, we used three different disclosure parameters. So this was 50% of the users are disclosing information, contact lists and 70% and then if I'm not mistaken, 90% of the users were disclosing the contact information. So basically this is how we grew the model in the second case. And for the machine learning part, what we did was to train the classifier for the known part of the network. So let's say the network is 50% of its actual size. And what we do is of that 50%, we pretend as if we don't know some of them and then we train the classifier with remaining ones and try to build the classifier using this known part of the network and then again, try to estimate sexual orientations that are of the users that are outside the network. And we did this for each size of the network and each disclosure parameter. In the end, what we found was the size of the network that the network grows, no matter in which class you are, you are getting more and more compromised. The minority classes are more and more affected. It becomes more of a problem for them. For instance, if you are outside and if you belong to a minority class, in this case, had homosexual males. If you are outside, if you didn't join the network yet, it's enough if more than one person discloses your information and then you are most probably predicted in terms of sexual orientation. What we found was that this went all the way up to 60, 65% in precision and the base rate for the homosexual males was less than 5%. So this is a scary amount if you think about it. So what we also found was not only that you, if you're outside the network and you connect to one person, it is not just this connection, but also how connected this one person is. So the size of your second order neighborhood within the network is also an important predictor. So if more popular people are leaking information, this amounts to more higher precision value, basically. And this is somehow intuitive because if you're a popular user in a network, what makes you popular is the interaction. If you're basically having a lot of contacts, having a lot of sharing posts, all these things making a lot of friends, you are in that sense very, very active and very, very connected and really embedded in the network. That means that you can reach much more information than someone that doesn't have that much contacts and so on. But this is of course also, this is a dilemma because if you think about it, what makes you popular in a social network is precisely information feed into the social network. This raises a problem, I mean, then how do you not give any information to the social network and still be popular because at the end of the day, what you like in the social network is attention and popularity and so on. This is true in many cases, in many studies. So there already a dilemma is present. - Yeah, so let me see if I follow everything as best as I can. Let's say there was one user who was homosexual male and wanted to keep that information private so they didn't join any social networks. If I have no information on them, the base rate is about 5%, meaning that maybe one out of 20 potential people do fit that category. - Yeah, exactly. So that would be the best I would guess, yeah. Even though that person doesn't join, simply by having one person share a contact list that's connected to them, not only brings in that connection but all the second order connections and you're getting a 65% precision on accuracy predicting at that point for the non-joining user. - Yeah, exactly. - Wow. - That's a strikingly high compared to the base rate. - Yeah, it is unfortunately, given the fact that data is basically you don't own the data and so on. There's all these aspects, of course, of the legal aspect of the whole thing is a mess anyway. This is just one of the many scaring aspects of you not owning in data and so on and basically they can't do anything with it because it belongs to them, right? - So based on the tuning you were doing with the disclosure parameter, how minimal of a shared set do you think has to be achieved before we have a good coverage? In other words, if only one out of 10 users will share their full contact list, does that give the social network enough data or is there a tipping point? - This is the thing that we were in the beginning thinking about analyzing the disclosure parameters starting it from 10% from 0.1 in the paper I think we put in these terms. But then we saw my answers. I don't know what happens if only 10% of the contacts share their contact lists. And the rationale behind it was that anyway, in Facebook, which was our main, of course, example, for the mobile users, anyway users are obliged to share their contact lists, which means that for the mobile user case, 100% all of them are sharing it. - That's a great point, yeah. - Starting from 10% will be too much of an underestimation. That's why we started from 50%. I started growing it 60, 70 up until 90 and then we left out to 60 and 80%. I think we report only 50, 70 and 90% because it's already captured enough of a difference between the levels of disclosure. - Yeah, so basically, I wouldn't expect to see a tipping point somewhere before the 50% barrier. I think the increase is pretty much monotonous. Like the more people share contact lists, the more vulnerable outside users get. - I think a really good choice to pick the sexual orientation is the hidden feature you were trying to discover. But if we were to maybe a brainstorm, what are some of the other things you think you could uncover with time about users? Like estimate their income or how healthy they are? 'Cause this is somehow like a rabbit hole, right? I mean, what you can do with machine learning and data. The thing is initially you have to keep in mind that you are limited to the data that you have. In this case, we have these user profiles that are providing us with some information. And then of course, for different age groups, I can imagine that you have a different classifier trained from a different data set about, let's take your example about income distribution with age and income distribution with location and things like this. If you have other data sets to consult, you can basically infer other things more than what the social network provides you in the first place. So I could, for instance, using somebody's location and age and so on, I could predict their income level, predict their social level, basically, or education level and things like this. For instance, location is a very good predictor and there are many, many studies that are done with this. It's quite a unique predictor because you don't follow the same path with many other people and it is already an individual, like a fingerprint and so on. So it basically boils down to what other data sets you have, what other correlations you can pull out of these data sets. So if you just have the social network data itself, like in our case, you are limited to what you put in the system. So I put in certain features and I expect to get only a set of features and something about them if you know what I mean. Yeah, Rabbit Hole is the best word to use when you talk about location data. Yeah, I agree, yeah. So there's a term we mentioned earlier that I think you guys coined that I wanted to revisit because I think it's an especially useful metric and I hope I see other researchers following the same line of reasoning and that's the privacy leak factor. So would you mind sharing again kind of the formal definition and how you used it? There are two definitions of it for partial shadow profiles and full shadow profiles, but they are more or less the same. In the partial shadow profiles case, it is the weight of the size of the network. So the percentage of the network that's known, the R parameter that I mentioned earlier, in a linear regressor. And so the intercept of the linear regressor is the copper coefficient and the copper coefficient is the agreement between the rate of the agreement between all classes. It is given by the relative observed agreement between classifications and the test data and the probability of the chance agreement. In this case, the reason why we use the copper coefficient is it can weigh different classes equally because it is often, it is calculated as one class against the rest of them and then one class against the rest of them and so on and so on. So it is one class versus all classification problem. For instance, we could have used the F score as the intercept of the regressor, but the F score would depend on the distribution of classes in the data sets. So the F score would naturally be higher for heterosexual males in our data set, which is about half of the whole data set and so on. So with the copper coefficient, you don't have this. So it gave us a neat sort of metric to measure the increase, the intercept of the regressor over R in the partial shadow profiles case. And for the full shadow profiles, the same regressor we did over the size of the known network. So in both cases, the privacy leakage factor refers to how much the network is able to know depending on the information that it has already. - And I noticed the leak factors for the partial shadow profiles, they're all noticeable, but one really stood out. I think it was homosexual males, which is incredibly high. Did you have any follow-up thoughts on why that class was much more predictable than some of the others? - Yeah, we did, I mean, this was somehow, it occupied us, we were thinking of actually focusing more on this point in the paper, what time we didn't. So I think we cite one of the, one paper that did a study about this, it's called Jaydar, as far as I remember, and these guys, they basically predicted, they were able to predict gay individuals based just on their assorted relationships. And then it was more geared towards gender studies also. As far as I remember, they were suggesting that it is because of the assertative nature of their communication that makes them more predictable. So I think since we use neighborhood information, first order neighborhood and second order neighborhood and third order neighborhood, I think these also captured this sort of assertative nature of the communication. So you, of course, in our case, since we just have friendship things, what our assertivity means is the friendship. So that's your friends with the person with the same sexual orientation. This is, in our case, what assertativity is. Of course, the gaydar people, I think they use data from Facebook, a small, relatively small data set, but still their assertativity metric was somewhat more advanced in terms of messages or something like that. But I'm really not clear, I have to double check that. - A lot of what's highlighted is that there's an asymmetric information distribution. So as a private user, I can choose what to share and I can control that a little bit, but I cannot control what my peers are sharing. And since the social networks have access to all the data, they can aggregate it and they have a very complete picture compared to my incomplete picture. I'm wondering from your perspective, does there need to be an intervention or do users just need to accept now that we live in a world where nothing is private? - Well, I personally think that the concept of privacy anyway is changing quite fast. I mean, it became accepted that you give up your privacy, because it is, for instance, in five years ago or so, it could be thrilling to see some add-on your screen that relates to your browsing practices. Now nobody cares about this, you know? So this is already a privacy breach, but it became accepted. So the recognition of privacy is changing. I think it's becoming more and more public in the sense. Yeah, it's a complicated matter, I think. I mean, on the one hand, you're out in the open, your privacy is not yours anymore. On the other hand, of course, there are things you can do to stop this. I mean, it is just the concept of privacy, the privacy being breached and so on. It is often perceived as something beneficial also to the users, and this maybe has to change. Like often the belief is that I give up some information, but in exchange, I get some services, I get benefits from this. In the search case, you get certain ads and stuff like that. Maybe you like that and so on. In a social networking case, you get access to your peers, the phenomenon called peer surveillance. I mean, I give information. I take information about my people, like me, or people that are different from me and so on. As long as there is this concept of receiving benefits, I don't think this will change. What can be done, of course, is that you are more aware of the fact that your interaction is not just your interaction, but you are part of a system. And the system is growing out of your control. And what you can do is basically to provide less information to it somehow. This, of course, requires some awareness, some expertise, not that much, but some still. This comes to the point of, you would think, why I have nothing to hide, why would I care, right? - Sure. - And I mean, this is, of course, this raises a problem because I mean, you get into the situation where other people have to care about it, and other people are suffering from it and so on. For instance, this case with WikiLeaks, that happened a few before Christmas, right? Google revealed that they have given information, they have made of three guys to the US government, two years after doing that, actually, and so on. And now they have to fight this and so on. I may not care that these things are happening, but they are happening. And what I do, my contribution, by contributing a lot of information to Google, to Facebook, to systems like this, is to empower them, basically, because they are basically capitalizing on our interactions. They are living existing based on our interactions. Back to the question, what happens to privacy in this case? I mean, it's just, I think we have to remember what privacy means, and we have to treat it as a collective value. Personally, it hurts to see that when I use Facebook, for instance, the same tool is being used to catch some demonstrators on the street that demonstrate for the same thing that I would, it's a privacy breach, too, for me. We have to somehow realize that this is a collective concept. Yeah, I think if there's only one takeaway people get, it should absolutely be that, that when I'm sharing something, I'm not only giving up my information, but I'm giving up some secondary information about all the people around me. Exactly. This is the main takeaway. I mean, this is what we wanted to somehow, at least, from a quantitative, systematic perspective, what we tried to show. Yeah, I think it's well done, definitely achieved. Thank you. So what's next for you guys in terms of research? Is there going to be a follow-up, or is this kind of the end of the chapter? Well, there is a few things that are on the table. One of them is certainly to look specifically into network features, so take a more network science approach in analyzing the privacy leak factor of those individuals that are this way or that way, different levels in terms of centrality in the network. There are many, many measures you can utilize to do this. Yeah, the other thing is to create, basically, categories to get deeper in the data, and try to figure out what leaks more information, and what leaks less information, what behavior, what user behavior, is basically malicious, in this case. We have a few ideas, and some data, and we are constantly trying to do more stuff on this, 'cause we think it's important. Definitely. The first milestone is, I think, to really show the systematic aspects, like the network science approach that I just talked about, and then some more follow-up from that. We want to do some text analysis, and so on. Once we get our hands-on some data, like how much text, speech, disclose information about contacts and things like this. Speech meaning, of course, written text. Things like this, I think this, I mean, this is really just scratching the surface of this phenomenon, what we did here, it's, I feel, that there's much more to explore. Definitely. Much more to disclose. Yeah, I'm eager to see what you guys come up with next. So, in closing, I like to ask my guests to give two recommendations. The first is what I call the benevolent recommendation, something you're not connected to, but you think is worthwhile and you would like to share. And the second is the self-serving recommendation, something that ideally you get some direct benefit from by appearing on the podcast. Well, for the benevolent recommendation, I would, well, in terms of the research that we have done here, I would suggest to use some tools that are, to some extent, privacy preserving. Like, you wouldn't take any VPN server, like, I use quite occasionally Zammate, it's this company based in Germany, like back when there were some internet outages two years ago and so on in Turkey, because of some massive demonstrations happening I was constantly using and also recommending it to other people. There is this spot, I mean, you can't take any VPN or any service that, to some extent, it animizes your traffic because the idea behind it is that even without getting into the content, we have seen how much can be done just using metadata, metadata, it's any data that you, any trace that you leave behind, metadata itself can be quite informative as we have seen, not just about you, but as our research suggests, it can give information through you about your contacts too. So, just when you connect to a social network, at least do it either through Tor or through some VPN, or use what is called a site-specific browsing. For instance, Google Chrome has this feature writing Firefox has it too, where it isolates all your browser information to just a single browser session, such that Facebook cannot track the other sites you're visiting in another tab that have some Facebook like or Facebook comment. Vigits, things like this, this would be my suggestion. I just, in terms of being more aware of what we are using, what we are consuming online, just to do it more consciously. - Yeah, I think that's great advice. And then if I may ask your self-serving recommendation. - Well, my self-serving recommendation is in the same directions. I mean, it would be to follow up on our next paper that will hopefully appear at some point, and not just our paper. Of course, it is important to follow anything that will raise some information, some awareness, but we will keep having research on this subject. And I would suggest that anyone that's interested in learning the debts of this business, basically, they should follow our research, because I think we'll have more insights on this. Like I suggested, we just scratched the surface so far. - Yeah, no, that's clever and poignant that your self-serving recommendation would be the same, because as you've shown, there are these second order effects, so. (laughing) - Wonderful. Well, thank you, Emory, so much for your time today. It was great chatting, and I'm glad I get a chance to share your work with my listeners. - Thank you as well. It was great for me to look for you. - All right, and good evening. - And you too, bye. (upbeat music) (upbeat music) (upbeat music) [BLANK_AUDIO]