Data Skeptic

Wikipedia Revision Scoring as a Service

Duration:: 42m
Broadcast on:: 18 Dec 2015
Audio Format:: other

In this interview with Aaron Halfaker of the Wikimedia Foundation, we discuss his research and career related to the study of Wikipedia. In his paper The Rise and Decline of an open Collaboration Community, he highlights a trend in the declining rate of active editors on Wikipedia which began in 2007. I asked Aaron about a variety of possible hypotheses for the phenomenon, in particular, how automated quality control tools that revert edits automatically could play a role. This lead Aaron and his collaborators to develop Snuggle, an optimized interface to help Wikipedians better welcome new comers to the community.

We discuss the details of these topics as well as ORES, which provides revision scoring as a service to any software developer that wants to consume the output of their machine learning based scoring.

You can find Aaron on Twitter as @halfak.

[ Music ] >> Data skeptic features interviews with experts on topics related to data science, all through the eye of scientific skepticism. [ Music ] >> And as we start here, I'll mention I usually read a bio to get going of my guess. And I'll write something up, cribbing from their website or wherever. And then I'll do a first read-through on the interview where I'll give my guess an opportunity to give me any edits or corrections. But in the, you know, theme of wiki that we'll kind of get into here, I thought it might be interesting to do something different with you and read the first paragraph from your Wikipedia page, which I presume should need no editing. >> I mean, hopefully not. I don't track it that closely, but I know it's there. >> All right, well, we'll see in a moment how well that works. And so let's get right into it. >> Okay, great. Aaron Halfaker is an American computer scientist who is an employee of the Wikimedia Foundation. Halfacre earned a PhD in computer science from the Group Lens Research Lab at the University of Minnesota in 2013. He is known for his research on Wikipedia and the decrease in the number of active editors on the site. He has said that Wikipedia began a decline phase around 2007 and has continued to decline since then. Halfacre has also studied automated accounts on Wikipedia known as bots and the way they affect new contributors to the site. He has developed a tool for Wikipedia called Snuggle, the goal of which is to eliminate vandalism and spam and to also highlight constructive contributions by new editors. Aaron, welcome to Data Skeptic. >> Yeah, thank you for having me, Kyle. >> Oh, my pleasure. So maybe to get started, we could talk a bit about how you first got interested in wikis and the Wikipedia project in general. >> I started as a volunteer for Wikipedia. I mean, it was really well I was doing my graduate work. I didn't really see Wikipedia as a subject of research. I mean, coming from a background in computer science, you know, at the time I was looking at programming languages and compilers and machine learning algorithms and trying to figure out there was something in that space that would be interesting for me. So anyway, in the meantime, I was working on Wikipedia, but not really on articles. There's actually a relatively large community of tool developers for Wikipedia. And so what we do is use the technological skills that we have that might involve JavaScript or writing a web app and using a database or something like that to enable Wikipedia and to do more powerful things on top of the wiki. So for example, a little bit of JavaScript that allows somebody to make several edits to several pages with one button click that works with like some Wikipedia in process. So that's really how I got started working on Wikipedia. The research lab that I was working in, which my article, I guess, aptly says is group lens research at the University of Minnesota. Anyway, there was a grant that came through that had some funding for looking at conflict patterns in Wikipedia, and that's really what got me started. It got me started both digging into Wikipedia as a data analysis project, but also thinking about Wikipedia and its social spaces as a focus of research. Interesting. What does it mean for there to be conflict resolution or conflict in general, perhaps, on Wikipedia? The first studies that I did on Wikipedia were looking at one of the primary expressions of disagreement that happens in the behavior of Wikipedia editors, the revert. This is an action where one person, one editor removes the changes that another editor made to an article. Generally, this is very common for vandalism and other types of obvious damage, but it also happens in the case of content disputes or when somebody makes an edit that has a substantial mistake in it. So it was those things that I was looking into to try and understand how they affected the editor dynamics. I mean, really, my first two publications were really just trying to answer the question of where do these reverts happen, what's predictive of them, and then how does it affect people's behavior afterwards? I was able to show that there's some interesting psychological patterns that are really common in the real world that, of course, play out on Wikipedia, such as an ownership bias. When you touch something and you feel like it now represents your work or your interests, then you will judge it more positively. And so we were able to show this ownership bias playing out in these reverting patterns. I was also able to show that reverting, especially for newcomers, or rather being reverted, was very demotivational. And so there was a trade-off between training newcomers to do things right versus demotivating them and kicking them out of the site, because, of course, we can't let bad things stick. So it sort of opened up this idea that maybe we can do this better. Yeah, I think there's some very novel results you shared in the paper, the rise and decline of an open collaboration community. And I want to get into that in a moment. But first, I would like to ask you a few more about how you ended up going from being kind of a researcher with interest in Wikipedia to an employee of the Wikipedia Foundation. Could you share the story of that transition? It actually turns out that those two questions have a very-- or will essentially share an answer. When I was working on the studies of conflict and these sort of things in Wikipedia, it was around the time that the first study started coming out, suggesting that the population of Wikipedia editors was declining. And so I thought that this was a very interesting problem. And I started aiming my research in that direction. In parallel to this, people at the Wikimedia Foundation were looking at this with much concern. And so they actually had the director of the community department at the time, Zach Exley, was traveling around to various universities, talking to researchers who were looking at this problem, and trying to learn from them and trying to figure out, what do we know, what do we not know, which we do next? So one of the labs that he visited was the research lab at the University of Minnesota, the Group Lens Research Lab, and asks, you know, like, what do I have on the subject? It turns out that I had made a substantial amount of progress since the news stories first started coming out, and since I had first seen this with research pop-up. So I had a lot to say, based on this conversation and conversations with a few of the researchers at various universities, the Wikimedia Foundation developed a single-summer internship program, where we pulled in, I think it was seven or eight researchers, to the Wikimedia Foundation, mostly grad students, but some professors, to explore this problem, to essentially dump a bucket of science on it and see what we could figure out. There were a few things that came out of that summer of research internship program, but the Rise and Decline paper was really, I think, one of the most seminal things. It's at least one of the most well-sided things that came out of that summer research experience. So my second and third author, Stuart Geiger and Jonathan Morgan, were also interns who were participating in that program. And so we were working together the entire summer on this particular set of questions and hypotheses and analysis methodologies to try and see what we could work out. That paper was eventually the publication of what we thought we learned. Yeah, I have it in front of me. It's an excellent read. I recommend it to all the listeners. This being an audio podcast, they won't be able to enjoy the figures the way I can, but I guess maybe to set the context for anyone who wasn't familiar with the trend. Could you describe what happens to be Figure 1, the histogram of active edits by editors through time? So this graph is showing essentially the size of the Wikipedia editing population over time. When I looked at this graph, I thought it made sense to divide this into three phases. The early phases of Wikipedia is roughly between 2001 and 2004. Wikipedia was really, really small. There were very few edits coming in. The internet just really wasn't aware of Wikipedia at the time, but it was starting to slowly get picked up by search engines. And so if you did a search that had a factual answer, you might likely get a Wikipedia article. And of course, there was a lot of question about what this Wikipedia thing was and whether anybody should believe anything that they see inside of it. Around the beginning of 2005, an exponential growth curve started for Wikipedia. And so as Wikipedia started getting pulled up in search engines, more people found the site and more people decided to contribute to the site, which boosts the quality and coverage of content on it, which made it get pulled up into search engines more and more. And so as you can imagine, this sort of feedback loop resulted in this exponential growth pattern. Or at least that's why we think this growth pattern happened. This exponential growth was sort of harshly erupted in the beginning of 2007, with a sudden and linear decline. So this is somewhat unusual shape for a population growth pattern. If you take a population model and you apply it to, say, rabbits on an island eating carrots, you wouldn't expect to see this peak in sudden fall. This doesn't suggest that we suddenly reach capacity and had enough Wikipedia editors or that are pooled that we were drawing from, which was available people on the internet who might like to edit Wikipedia, had reached some threshold and was starting to level off. This was a sudden shift. This suggested some underlying sudden shift. And so anyway, that's how we set the stage for the paper. So when I looked at it, obviously, before I got into your research, I thought of just a few hypotheses like, well, maybe the quality of newcomers are declining, or maybe the low-hanging fruit has all been eaten. So there's a scarcity of edits that a newcomer can easily make, or maybe the standards of Wikipedia themselves have risen. Could you share maybe some of the hypotheses you guys looked at as you started studying the problem? Yeah, and so I appreciate that you bring these up, because really when I started at the summer of research, I was positive I already knew the answer. And I thought that the answer was that Wikipedia was running out of apparently available work. Seems very plausible. Yeah, so the idea, I mean, that most things that you look for on Wikipedia right now have, at least they have a substantial article, if not a really well-written substantial article. And so the idea is that it would be very rare for you to find something that you actually might make a contribution to, as opposed to, say, five years ago. That might not be totally wrong. It doesn't seem to be very clearly the case. There should be a lot of indicators that would suggest that this kind of thing was happening, like the rate at which people make their first edit should substantially decline, the rate at which new people register accounts should decline. We didn't see like a sudden shift in those metrics. It sort of, it refutes that hypothesis, but I wouldn't say it's conclusive. There's still probably this effect happening. As we were trying to address these kind of hypotheses, we sort of actually stumbled across the hypothesis that it might be related to quality control tools. You know, I had known from my past work that having your work rejected was demotivating to a newcomer. But I didn't expect there to be any substantial shifts around this time. It was really when I was talking to my collaborators, Stuart Geiger, who his publication history is really about the history of tooling and automation in Wikipedia, where bots came from and how these human computation tools that Wikipediaians use have substantially affected how Wikipediaians operate and how their social spaces work. You know, he reflected that some of these tools, some of these high-power quality control tools, came online around this time. And so that really, like exploring that hypothesis, it was really critical that we checked one of the others that you brought up, that newcomers just aren't getting worse. So we're seeing this higher rate of quality control hitting these new editors who are coming to Wikipedia. But maybe that's good. Maybe they're vandals. You know, I mean, in an exponential growth curve like this, I mean, Wikipedia is hitting the popular media, like around the middle of that growth curve. It's like mid-2006 is when there was a episode of the Colbert Report, where Stephen Colbert vandalizes the article on elephant on TV. By the way, the article on elephant is still protected because of that. Usually we only have politicians who are currently in office, as you can imagine, George Bush. I believe that the page on George Bush has been protected for a very long time. But elephant is an unusual one. I give this example so that you can sort of understand how the external media affects Wikipedia because it's so popular. So it would make sense that maybe the newcomers are just bad. What we did is we got a random sample of newcomers from each year throughout the history of Wikipedia that made some basic criteria. They registered an account and they made it an edit to an article. And we took the first activities that those new users had on Wikipedia and showed them to a bunch of Wikipedia's and myself. And we manually went through their edits and tried to figure out, are they contributing productively? Are they at least trying to, you know, editing in good faith is the term that Wikipedia's used? Or are they trying to be funny and not contributing in good faith? And then of course the worst side of that are they actually actively trying to offend and cause harm and that sort of stuff. And so we did analysis and we looked over the history of newcomers joining Wikipedia and starting editing. And what we found is that there was a transition. There was an early and late adopter transition where the rate of newcomers who were coming in and who were being already productive and it looked like they knew what they were doing and they were contributing well to the wiki. But that transition happened in the beginning of 2005, well almost two full years before the transition that we're really concerned about. From 2005 forward, the rate is constantly, or it's basically constant, that 40% of newcomers who join Wikipedia will be already productive. They already know what they're doing. They're already contributing productively to articles and that's great. Another 40% are at least trying. So they're trying to contribute productively and they're making mistakes. The remaining 20% are various types of battles, whether they're just trying to be funny and say something silly in Wikipedia or put in racial slurs and actually offend somebody. And so anyway, this sort of 80% good faith and 40% productive is the trend that we've seen ever since this 2005 shift. And so it suggests that this sudden shift in the retention of newcomers and the sudden activation of quality control towards newcomers wasn't really reflected in the quality of newcomers who were coming to the site. And in fact, one of the graphs that we have in this paper, this is a graph that I really like to show people when I'm going over these hypotheses because I think it really solidifies this sort of problem in where it's focused. So we made a graph of the retention rate, the proportion of good faith newcomers who are trying to contribute productively, who stick around for at least two months in Wikipedia. And you can see that there is a sudden shift in 2007. Like there's a slow decay in this graph, I will admit. But there is a step where we lose more than 60% of the good faith newcomers who might have stayed before this transition. And so that's substantial and that usually catches people's attention. Yeah, it's a notable plateau. So I imagine you're after some explanation of something that happened in 2006, 2007. Could you maybe walk through some of the things you looked at and what might have changed in that year? I'm going to cheat a little bit and I'm going to use some of my work since the rise and decline study to explain what was happening here. At this point in the rise and decline study, we do some quantitative analysis, but we don't really describe why these sort of things happen. Here's what I think happened now. Around this 2005-2006 exponential growth curve, Wikipedia's were looking at their incoming edits turn into a fire host. What essentially was a drinking fountain and was sort of like the life spring of Wikipedia was now overwhelming and it was its own problem. You know, and then they see something like Colbert say something on TV that has dramatic consequences for what happens on the wiki and for quality control work. I mean, it's very rare that I give interviews like this and I don't have somebody at least ask the question, "How do we know that Wikipedia has any quality in it at all?" I think that Wikipedia's at the time were really feeling this, which by the way, the answer is Wikipedia has a ton of quality and there's been a bunch of studies and it's really fascinating that Wikipedia works as well as it does, but it does. Yeah, absolutely. Anyway, Wikipedia's at the time, they're the ones who made it work. They're the ones who made the quality control work. So anyway, they're looking at this fire hose and they're looking at developing some automation to make this work faster so that they can spend less time doing quality control and more time writing encyclopedia articles. And so they designed these tools with this world point of view in mind that we have a fire hose of incoming edits and we need to filter out the damage as fast as we can so we can get back to writing the encyclopedia article. And so these tools capture that mindset really well. They're very efficient for removing damage. They're actually a multi-stage process where the first filter is a fully automated robot that only reverts the most egregious vandalism. It turns out that natural language processing is not so terribly developed that we can catch subtle vandalism, but a machine learning model based on some simple statistics of an edit can catch vandalism well enough that at least a sizable proportion of vandalism can be automatically removed by a bot. So the stuff that makes it pass the bot goes to the human computation system. And so this is where technology is showing edits to an editor really fast and asking them to make judgments really quickly about whether this is damaging or not. And they essentially have two buttons that they can click. Is not damaging in which case just move on to the next example. Let me look at that or is damaging. And when they click that button, the system will make several edits across the wiki, which involves reverting the edits, sending a warning message to that user and potentially posting about that user on the administrator's notice board for Wikipedia. So we've learned that there are some unintended consequences of these tools by making quality control very efficient and making it so that the humans who were doing quality control work didn't have to invest so much of their time into the activity. They also made that experience much less human for the newcomers who are having their edits reverted. There's I think an important thing to point out the problem that happens that we didn't know at the time, but we do know now. When you split edits into good and bad, there's sort of like a wide range between what's like a good edit and should stick in the wiki and what's like the most egregious type of vandalism. And a lot of that space in between are edits that might be good, but are violating some relatively obscure Wikipedia and policy about how we do things in Wikipedia. Just to give you a quick example, we have special policies about biographies of living people because it's really important that we don't have statements that are false on a biography of a living person because that can really hurt somebody. And so on Wikipedia, you can add statements without citation and will flag that they need citation anywhere other than a biography of a living person. If you add a statement of fact to a biography of a living person and it's not supported by a citation, then that should immediately be removed and you should be sent a warning for violating the policies of Wikipedia. As you can imagine, this happens to newcomers all the time. When these tools give you this Boolean option of good or bad, then everything that's not perfect is bad and gets the exact same response. So it doesn't matter if you're adding racial slurs to Wikipedia or you're inadvertently adding a statement to a biography of a living person without a citation, you will still get the same warning message telling you to stop analyzing Wikipedia or will ban you. Can you talk a bit about what it is that machine learning and NLP are doing to find those really egregious vandalism examples? That's a good question, and that sort of leads us to the work that I've been doing that has been getting some media attention recently. The way I see it is that these tools are not bad. They're not inherently bad at all. In fact, the developers of these tools are my collaborators and they're wonderful people. The people who use these tools are also wonderful people. The core technology that allows these tools to work is the machine learned damage detection model. This is a prediction model that will take some statistics about what happened in an edit. For example, the number of words from a bad word list that this edit added to an article. So curse words and racial slurs and that sort of stuff. As you can imagine, that's quite predictive of vandalism. It will take all these statistics and figure out the correlations that those statistics have with an edit being damaging and turn that into a prediction model. Then now this prediction model can be used to sort the incoming feed of edits by the probability that that edit is damaging. The really cool thing that this lets you do is take a fraction of the edits that were coming in, a fraction of the edits that you might have needed to review and just review those instead. And so with a state of the art algorithm for this type of stuff, you can review about 10% of the incoming edits to Wikipedia and expect to have caught all of the vandalism with very, very small false negative rate. That sounds like a massive success. Yeah, there's several of these quality control tools that are used across the wiki. And every single one of these tools has its own damage detection model that roughly does the same thing. Makes sense about how many bots are out there doing this sort of work. There's not too many bots that automatically revert vandalism. But as I know, there's only a clue bot on the English Wikipedia on other wikis, especially wikis that are not as large as the English Wikipedia. There is less and less support for these kind of things. But there are several human computation tools. And so I'm using this technical term human computation, which really just means that the machine organizes the information for a human to make a judgment. And then once the human makes the judgment, the machine carries on with its operation. And so it's sort of an algorithm with the human in the loop. And so as you can man, like when detecting vandalism, this human judgment is very valuable. So there are several of these on English Wikipedia, including the one that you referenced as you were reading my intro, Snuggle, which I developed. Yeah, could you talk about the Huggle tool and why you built Snuggle as a response to it? The Huggle tool is one of these tools that sorts edits by the probability that they're damaging and then has editors review the most damaging edits. This tool is really awesome. And it's kind of easy to read my work and come to the conclusion that I'm critical of this tool and its developers and the people who use it. And it's really important to me that they don't get that impression. I think these people are doing really critical work for Wikipedia. It's just sort of problematic how their tool is interacting with the social system in these, you know, I mean, really encounter intuitive ways that nobody would have seen in advance. And, you know, I'm looking to fix that. Anyway, this Huggle tool sorts edits by the probability that they're damaging and has people review them. This tool that I built as sort of a critique and an alternative to Huggle, which we called Snuggle, as you can imagine, as we were sort of playing off that other name. Instead of sorting edits by the probability that they're bad, Snuggle sorts editors, new contributors in Wikipedia, by the probability that they're contributing productively. I was telling you about this analysis that we did for that "Rising the Client" paper, where we manually labeled a bunch of new editors by whether they were trying to contribute productively or not. Well, we use that same data set to train a new machine learning model that predicted whether new editors were contributing productively or not. And it turns out that this is a surprisingly easy prediction problem, so we could get really good fitness with just a couple edits saved by that new editor. And so essentially using Snuggle, an editor who is interested in mentoring newcomers or finding collaborators among the new editor pool could use this tool to find the best newcomers. But also one of the things that we incorporated into this Snuggle tool that I think is important is it highlights the negative reaction that newcomers receive. And so I actually did an interview study as I was setting up this Snuggle tool and writing about it and publishing about what it is and why I think it's important. Where I had Wikipedia editors sit down, look at the tool, note that there were a lot of good faith newcomers who were having this strong negative reaction and talk to them about that. Across the board, Wikipedia and this were very, very surprised that there were so many warning messages on newcomers' talk pages. This is how we comedians communicate with each other via these talk page things. It was very, very surprising to them that they saw this sort of pattern. It seems to be strongly suspected among Wikipedia editors, at least before these studies, that quality control was really catching the bad editors. And so if you saw a talk page that was full of a bunch of warning messages, from these tools and bots, then this was probably a bad faith editor, probably not going to contribute productively. It turns out that that's not nearly as right as it ought to be. So I think that that changed a lot of perceptions, having that tool highlight these issues. It sort of made it real. I could suddenly give a whole bunch of individual examples that are happening right now of this problem happening in Wikipedia. Absolutely. You know, in my experience, there's a challenge in applying machine learning on problems like this when you have a certain subset of users with very few observations. People with a rich history or time series generally have good models of, but there's always this sort of bootstrapping problem when someone is new and hasn't done a lot yet. Do you have any insight into why just a few edits gave you some predictive power? I was cheating to some extent. I was rerouting the signal that we got out of the damage prediction models. And so damage prediction in Wikipedia is a really hot academic topic for computer scientists right now. Because the data is open because it's a sort of popular website, there have been a lot of research papers that have been published about this problem, like how do you detect damage in Wikipedia articles? That wonderful research has led to very, very high quality prediction models that give me lots of signal in their output. So essentially, when I was modeling good faith and bad faith editors, I was actually taking the scores that were produced by these other prediction models and remodeling on top of them. I actually used a really simple naive base strategy where I'm just asking the question, how likely is it that this damage prediction score for one of the edits by this newcomer was actually associated with the damaging edit? It turns out that good faith newcomers, even the ones who are making mistakes and getting reverted and sent warning messages, they tend to score really low on that damage prediction. They might have one or two edits that score high, but it's going to be generally low. So if I have even two edits, three is great, two still works from this new editor, then I can know with pretty high certainty whether this editor is working in good faith or not. What you see from editors who are contributing unproductively is a series of edits that have these high damage prediction scores. And so they're really easy to pick up. Winding up on the rise and decline paper, I was very glad to see you guys have taken some really rigorous empirical methods in validating your conclusions. Could you talk about how you use logistic regression and what you found there? We were sort of doing a poor man's survival analysis. So there's some really powerful methods for doing survival analysis in populations that were developed for medical studies, like the Cox regression. The problem with these robust survival analyses is that it's hard to take the things that the model learns and turn them into something intuitive. And so we use these advanced survival analysis strategies, but we ended up turning back towards the logistic regression because it's powerful. It can describe what we were really trying to get at and interpreting the coefficients of it were easy. It was very intuitive. What we use the logistic regression for was to try and predict whether an editor would continue editing after this two-month threshold. We put a few different data sets into this model. So first we put all newcomers into it. But of course this would have the problem that there are some newcomers who are not contributing productively. And so if they don't stay the two months, that's okay. So we also put into this model editors that we had manually labeled as trying to contribute productively and compared the two results to see if we saw anything unusual when we didn't have that many observations, but they were manually labeled or saw something unusual when we were using the full data set, but it was tainted by potentially bad editors. Happily, we did not see there was no fishy smells. The coefficients were all generally pointing in the same direction. And it allowed us to sort of tease out the difference in effect between, say for example, just plain having your edits reverted versus having your edits reverted by one of these quality control tools. Having your edits reverted by one of these quality control tools was one of our strongest predictors that somebody was not going to continue to edit in Wikipedia. There's also a few other things that I should note. In these multi-variant models, you get the opportunity to control for some potential confounding effects. This is something that I find really useful when we're doing really large data analysis and we can't run an experiment. We can't really take these quality control tools away and then put them back that would be very damaging to Wikipedia. And so we're stuck with this offline data. So one of the things that we can do is take these potential confounds and make them predictions in our model as well. Our leading confounding variable that we included in the model was the number of edits that a user saved in their first session. So the first time that they sit down and start editing Wikipedia, how many edits do they save before they get up and leave for a little while? It turns out that this is extremely predictive of how much future worker user is going to do whether they're going to survive and that sort of stuff. And by including this in the model, we could still see that this has a really substantial effect. Even if you get reverted, your work gets rejected, you get sent aggressive warning messages. If you've already invested a lot into Wikipedia by making a lot of edits in that first session, you're still probably going to survive through that. The effect that being reverted, having your work rejected has, is still roughly the same. It's just lowering the probability that you're going to make it past that threshold. These are the various things that we could do with this logistic regression and it made for very powerful hypothesis testing for the things that we really wanted to understand. And how has this work overall affected the Wikimedia community? There's actually been a lot of initiatives that have been started by volunteers within the Wikipedia editing community and by the Wikimedia Foundation. So one that I think is really worth highlighting is this social Q&A question and answer community on Wikipedia called the Tea House. So this is one of my collaborators on that rise and decline project, Jonathan Morgan, gathered a group of collaborators to work on this social Q&A space specifically for newcomers so that newcomers who were having a bad time joining Wikipedia would have a space to turn to a group of people to ask questions of, you know, a safe space to say something dumb and have somebody not call you dumb for it. We've actually been doing long-term analysis on this project recently and it's showing significant effect on newcomer retention. That's been really interesting and in this case, too, I should mention that we could actually run a controlled experiment because we developed it and we can choose who gets sent invitations and have a purposefully set-aside control group. And so we've actually been able to show a cause and effect pattern there. Oh, that's fascinating. Has that been published? We've made a few publications at the Wikimedia Foundation and we have wiki pages describing these results. And so right now it's just a technical report, but we'll probably be submitting it to journals in the next cycle. Oh, very interesting. Really looking forward to that. That's very interesting work. There's a few other initiatives that I thought it would be worth mentioning. So there's the Inspire Campaign, which was we do small grants at the Wikimedia Foundation. So if you're a Wikipedia editor and you have an idea on how to make Wikipedia better by, say, building tools for people or building a help space, anything that's other than directly editing the encyclopedia, then you can apply for a small grant that will fund you for maybe three or six months so that you can get that work done. This Inspire Campaign was focused around improving the gender gap in Wikipedia, which I believe is a newcomer retention issue, where we have far fewer female editors than we do male editors. There's also been initiatives from the Wikipedia community, such as wiki project women, which is focused on getting more articles about notable women into Wikipedia project art and feminism, which is focused around getting a sort of set of content that's hard to get into Wikipedia because it's not part of the Western scholarly tradition. So it's hard to find reference work for it. We've been doing harassment surveys of Wikipedia editors so that we can know where else they're having trouble, where else they're getting sort of aggressive reactions. And so there's a lot of social activities around trying to improve this sort of pattern in Wikipedia and focusing on groups that we think are primarily affected by this negative environment. What sort of areas do academics find interesting in studying wiki? Oh gosh, so it's really broad and you know there's this kind of term that those of us who are really close to Wikimedia use for ourselves. We call ourselves wiki researchers and there's like maybe five conferences in a couple journals that we're very frequent in. It's a very interesting group because they're computer scientists, physicists, sociologists, psychologists, ethnographers, people with a very, very broad set of skill sets and a very wide set of lenses aimed at Wikipedia. And so they ask questions about what does Wikipedia's organizing mechanisms imply for governance. And so it was it was actually that work that we were able to draw from when we when we discuss Eleanor Ostrom's work in the rise and decline paper. Eleanor Ostrom, if you're not familiar, is a Nobel Prize winning economist who talked about how you build rules in common spaces so that they make sense. They work for people and people actually follow them. As you can imagine, that's pretty applicable to Wikipedia. There's also studies I was talking to you about vandalism detection. There are a lot of computer scientists who are interested in these sort of machine learning problems around Wikipedia. There's also a lot of studies around natural language processing using Wikipedia as a data set to learn how human language works and to build interesting tools with it. There's all sorts of interesting user interface development because collaborating asynchronously over the internet is a hard problem. And so presumably you can use the history of articles the way that people interact with each other psychological theories on motivation to make improvements to the technology to make it easier for people to operate in ways that they like. Very quiet. I think you've touched on perhaps at least four or five well done PhD thesis potentially to be written from all those topics. So I hope we inspire someone here from this episode. That would be great. Any other tools you think we should mention? Maybe things like Mr. Clean or RES that you've been working on lately? Oh yeah, I suppose it would be a shame if I didn't at least talk about it or is for a little bit. Or is implements the core infrastructure for efficient quality control in Wikipedia, the damage detection model. It actually does a few other prediction models too, but this one's sort of core to this discussion that we're having right now. And it's a big reason why we started even getting into the machine learning as a service business in the first place. So essentially what the system does is it posts these prediction models. All you need to do is pull up a URL or send an HTTP request to the service that contains the identifier for the edit that you would like to have scored. So let's say that you're listening to changes coming in Wikipedia. We have nice streams for that sort of stuff. As an edit happens you'll get the ID number. You send that to the service and ask the service. Is this edit damaging or not? If you're writing a tool that takes advantage of that, then you might then show that edit to a human and say, hey, would you like to review this? If you have a bot and the score is strong enough, then you might decide that you're going to revert it outright. We really wanted to get this core infrastructure in place because I don't think that it's the algorithms or the fact that we put technology in the space of quality control, that's the problem. I think what we really need to do is innovate on top of the way that we were already doing it. So this sort of comes back to what I was saying about Hugel and how I don't think that Hugel is a problem. We shouldn't get rid of these quality control tools. We actually need them. If we didn't have them, we couldn't deal with the half a million edits that Wikipedia gets every day. What I wanted to do was get this really hard problem of building a machine learning model that works in real time out of the way for this community of Wikipedia and tool developers so that they could spend their creative energy thinking about how people should interact with newcomers and what sort of facilities they should have within the user interfaces they're building, rather than having to worry about this really complex engineering problem of building this real time machine learning model. So this tool Hugel, it's about 50,000 lines of code. It's a really substantial engineering project. It's hard to do this kind of analysis, but just looking at the code myself and kind of making my own judgments about what was related to what, it was about 20 to 30,000 lines of code were devoted to damage detection and sorting things based on this damage detection, and that corresponds to the ORS system. The ORS system is about 20,000 lines of code. And so what I like to say is that standing up this real time machine learning problem, it will cost you about 20,000 lines of code, which is a substantial engineering project, and an advanced degree in computer science, or at least that level of understanding of machine learning algorithms and evaluation, and you'll have to read the literature and that sort of stuff. What ORS is it takes that out of the way. Now there's no longer this massive barrier to getting into the quality control business. So we actually have tools that have been built that use ORS as a back end in order to support Wikipedia and quality control work that are just 80 lines of JavaScript. And by doing that 80 lines of JavaScript, that's all you need to sort edits and present them on a page and then go query our service so that you can find out which ones are likely to be damaging. That's one of our most popular tools. It's called scored revisions, and it just runs on top of Wikipedia and highlights edits in place on Wikipedia that are likely to be damaging. And so we have this thing called the recent changes feed, which lets you watch edits coming in. It'll highlight edits there. We have lists that include edits by a particular user, and so it'll highlight edits there. And of course you can look at the history of articles. You can look at the edit history. If you click on that history tab on the top and it'll highlight edits there. And so you could see is the most recent edit to this article likely damage. Maybe I should go review that is the edit that came in five seconds ago damage or are edits by this particular user likely damaging. And I believe outputs of those predictions from or maybe I should have called it out by its full name earlier, the objective revision evaluation service. That's public facing. Is it not so a developer or researcher could consume those predictions? Yeah. And so this is something that we took really seriously in setting up this system. I mean, one thing that I think is really worth being concerned about, we're just sort of like as a technological community using machine learning in social spaces. I think we're just sort of becoming aware of the potential harm that can come from having a high power algorithm that profiles people or their actions aimed at a social community. Yeah. So for example, our model could pick up some bias or something like that. So one of the things that we train our model on are past reverted edits. We've actually had editors come to us and encourage us to use extreme caution when looking at past activities in Wikipedia, because people are biased. This is something that we just can't get away from. We have all sorts of biases in place. The machine could then learn these biases and perpetuate them. So we're taking very seriously, keeping everything is out in the open as we possibly can. So the way that we train our models, the way that we generate our training sets, the fitness measures that we use to try and figure out which models are good, the ones that we should actually host in the service. We actually even include those fitness measures as endpoints on our API. If you have a service reading from our API, you can actually read in the fitness measures with machine readable JSON and make decisions about which models you're going to use. And so what we're hoping is that by keeping everything out in the open, all the scores, all of our process, all of our code, we're going to make it much easier for the communities that we're serving to maintain control about how they're using these things and which models they're using. And so we've actually been getting a really, really positive response from this. We have reams and reams of false positive reports where people report edits that were flagged as damaging that are probably not damaging. And so we're right now pushing on the state of the art in the machine learning algorithms literature by using these false positives to try and find new sources of signal that will let us get around them. New ways to do our natural language processing work and build better features for this model to make predictions based off of. Yeah, that's a really interesting opportunity. I know one of the challenges of studying Wikipedia for researchers is just at scale and being able to manage the production pipeline. So I think having those fitness measures available via API is a really novel step forward in getting more people involved and seeing as how we're winding up 2015 here, I'm thinking about my schedule for 2016 on this show. I hope someone will do some very novel things picking up those scores and working with them in the same way you use the output of the damage prediction algorithms as an input to your system. So I hope someone comes along and does some interesting work there and I can have another interview next year on the topic. Yeah, I agree. I think that's very interesting. You know, I'd like to add to that I would like to see some of the major tech firms that have algorithms affecting social spaces. So Facebook's getting been getting a lot of flack for their feed and for running experiments in that space. Google's been getting flack for the way that their algorithm picks up images and image search which can suggest gender stereotypes. The way that Twitter flags hashtags as trending cannot work for some types of trending patterns. Don't just give us a list. Don't give us a sorted list. Show us the output of your algorithm. Give us your score and let us do things like what I did in Snuggle, which is take that score as a source of signal and do something else with it. Give us your fitness measures. Give us your output. Don't just give us some product that is walled off from us. Give us a source of signal. Let us work with that and maybe we can work together to solve these problems or find out whether they are actually a problem in the first place. If this work could inspire those other tech companies and developers who are working on this in the future, I think that would be a major success. I couldn't agree more. And this has been a great conversation. Anything you want to touch on? Maybe some of your upcoming work before we sign off? One thing that I'd really like to encourage, I'm not sure how much your viewers do technology development for like a user facing things. But I think Wikipedia is a fascinating space to operate. And a lot of people don't really realize how much technologists can do for Wikipedia. There are a lot of problems that benefit from automation there. We have a lot of problems with automation, but we still need it. And it's still a really cool space to operate. I'm really hoping that by releasing this order system, we can get a lot of people thinking creatively about how to do quality control better. And maybe some of your listeners have some really good ideas about how we could do that. I hope that they give it a try and run some experiments and reach out to me. I'd like to help. Excellent. Where's the best place to go if they want to learn more about Ores? We have a project page on a wiki, which is it's linked to from all the articles, but I'll give you the link afterwards so that you can list that with the podcast. Excellent. I'll be sure to get that in the show notes so people can go and check that out. Where can people follow you online? I'm on Twitter as health hack, H-A-L-F-A-K. That's a good spot. I will post all updates there. Excellent. That'll be in the show notes as well. Aaron, thank you so much for your time, and also all your work on Wikipedia. It's one of those projects that I think is going to have a lasting effect on not to speak broadly, but on humanity for centuries to come. So the more good work we have going in there, I can't tell you how much I appreciate it. Well, thanks for having me, Kyle. It was a pleasure talking. Excellent. Take care.