PageRank is the algorithm most famous for being one of the original innovations that made Google stand out as a search engine. It was defined in the classic paper The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Page. While this algorithm clearly impacted web searching, it has also been useful in a variety of other applications. This episode presents a high level description of this algorithm and how it might apply when trying to establish who writes the most influencial academic papers.
Data Skeptic
[MINI] PageRank
[ Music ] Data skeptic is a weekly show about data science and skepticism that alternates between interviews and mini-episodes, just like this one. Our topic for today is the PageRank Algorithm. This is an algorithm that was invented by the founders of Google, and it was originally kind of the main innovation in what made Google search results different. Do you remember back in the day before there was Google and we had just a wide slew of different search engines that usually gave pretty mediocre results? Yap, Yahoo, Altavista, Webcrawler, hotpot, AskGeez, Dogpile, InfoSeekLikos. Oh, Likos, I remember. Geez, it's still around. Yeah, let's just ask. Anyway, the thing that made it different was instead of trying to just look at the words on the page and that sort of thing, which Google obviously did also, but they had this idea. What if the incoming links to your page help us evaluate how good it was, and not just necessarily how many of them there are, but how important they are. So, for example, if, I don't know, the whitehouse.gov links to your site, you're probably more important than your data skeptic.com links to your site. Yap. There's a bit of trivia here that people love to point out, PageRank is not named this because of web page, but because of Larry Page, one of the founders who helped build this algorithm. And the reason I think that's important here is because I actually want to talk about PageRank not and how it's used in search, but in how it can be used in a much more general case. And what case is that? Well, any case, you can solve lots of problems with this. Any problem, I would say, where you're trying to assess the importance of something, and you can ascribe the importance by how many things reference it. So, PageRank has also been applied in the academic world, like scientific papers, where you cite like other papers. So, you can say like, well, how important is a paper, and it's largely measured by how many citations it gets, but not just how many arbitrary citations it gets, but perhaps how many citations from important, innovative papers it gets, right? So, it's about influencers, who they are. Very much so, yeah, that's kind of what PageRank tries to figure out from a graph, where you have nodes are different elements and edges between those nodes are suggesting references, citations, links, or just general nods of influence in some kind. And you want to weight each of those incoming links. So, nodes are elements, elements of what? Elements of anything. So, in the case of Google, it would be web pages. Well, maybe you could give more examples of what a node is, versus the other one? An edge. A node and edge. Yeah, so picture a graph, a bunch of circles that represent elements, and then edges are links between them that are directional arrows that convey when one thing says, yep, that's important, or look at that. So, the circles you're calling nodes, so they're points. Yep. And then the arrows are like vectors, right? Yeah, yeah, basically, yeah. What's another example, so you can help people understand? Well, the other example I gave was academic papers, where you cite your references. So, which one's the node, and which one's the vector? The node is the paper, and the edge is the citation. So, if there's a really important paper, we expect that many people will cite that in the future, so it'll have a lot of citations linking to it, whereas sort of like an inconsequential paper might be cited zero or only a few times. So, in comparison, we would regard it as less influential, but there's also a critical point here. It's not just the number of links you have incoming. It's the importance of those links. If there's a groundbreaking paper in a field, and it cites five other papers, the fact that later it will be established as groundbreaking means that everything it cites is also really important and influential, or at least the fact that it was cited by the groundbreaking paper conveys more influence than if it was cited by just a random paper that no one ever reads. Now, there's a really important point here, too, is that if you have a lot of outgoing links, or like in the case of the papers, a lot of outgoing citations, you cite thousands and thousands of papers, you're just kind of citing everything at some point. And this actually makes more sense on the web, I think, where people, if they know links are important, they'll just link to all their friends' sites just to be nice. So, part of PageRank that's also critical is that how much influence is conveyed through a link is the influence of the individual page or node, and then you divide that across all of the things it links to. So, if I have a really important page and I link to two other pages, those each get like half my weight, whereas if I link to 100, they each get about 1% more or less. There's also an important factor here in that we need this to be discounted with time, so we don't convey 100% of the influence across nodes. We convey just a little bit less than that so that this converges. So, the value of an incoming edge is not only where it came from, how important that node is, but also how many outgoing edges that node has as well. And maybe we just call every from here on out to kind of keep those two points nicely packaged up, we just call that the influence of the link. So, you mean the sum of all your links? So, like, let's say I'm a website and 100 people link to me is the zero plus one, 100? It would be the sum of the influence of those 100 people. So, how did you rank them though? Because if everyone's a zero, it's almost like no one has a rank. No, no one starts as a zero. People all start as one over n, where n is the size of this network. So, everyone starts with a very small value. And then what? And then in the next iteration, you are that small value plus the sum of the influence of all your incoming edges. So, we can't make up our own to show us an example. Any page rank example that's, that even a trivial one is kind of involved because to show all the nuances, like, I think if you go to Wikipedia and look up page rank, their example has like five or six nodes and takes a couple rounds of iteration to converge. So, you have pretty much most page rank problems tend to be big data problems, not necessarily, but they tend to be. As in, we need multiple computers to calculate this. Good job. You've been paying attention. Okay. So, you're just talking about in theory. So, we're never going to get a number. You haven't done this in the past. You can't talk about an example of the numbers. It's not in a mini episode. And in fact, this is a bit of a foreshadow. In the next interview next week, I'm talking to a researcher who did just that and applied page rank in a very novel and interesting context. So, listeners will have to tune in next time to hear about a specific application of page rank. Yeah. So, I think this is a powerful way to establish an importance measure and to algorithmically derive how importance is conveyed through some network. The criticism of page rank I just wanted to mention, because while I am encouraging people to think how this might solve their everyday problems, it's a pretty powerful algorithm. It also can tend to favor older things over newer things, because what it wants to do is favor authoritative things right and rank those highly. But if something is new, but of high quality, it may not yet have had time to bring in lots of links. So, it's not good at picking up trending. Not in its native form. However, that doesn't mean you can't modify it or solve trending as an auxiliary problem, because actually, this is not, even though, as we started out, this is one of the secrets to Google becoming a good search engine. It's not the only trick in the toolbox. And in fact, I'm told that page rank has declined an importance over the years as people have figured out how to gain the system. So it's still there, but there's tons of other factors they look at, something like, I think I heard over a hundred different parameters go into their ranking. Well, it is their livelihood, and they are worth a lot of money. That is very true. This is Kyle Polich for DataSceptic.com, reminding everyone to keep thinking skeptically of and with data. Thanks for joining me, Linda. Thank you. [MUSIC] [MUSIC PLAYING]