Skip to main content



A Matrix View of PageRank

http://www.ams.org/samplings/feature-column/fcarc-pagerank

This article discusses the methods that are used for Google’s PageRank algorithm for analyzing the value of webpages for search. Because the web has grown to such a massive size of 25 billion pages, search is no trivial task, and calculating values for the rankings of these pages is a daunting task that most people take for granted using Google every day. Because the web is constantly changing, Google must recomputed the PageRank values on a regular basis, doing so demands an efficient way to calculate the values. The article describes how using the power method and eigenvalues of the PageRank matrix can be used efficiently to find the ranks for all of the webpages on the World Wide Web. The matrix contains numerical descriptions of the links between pages, that is, if there is a link between page i and page j, then the i,j entry of the matrix is nonzero. It turns out that most of the values of this 25 billion row and column matrix are zero, which makes computing the necessary eigenvector most easily done through the power method. Of course, there are a couple concerns that needed to be addressed in this approach, such as dead ends and subnets that made the computation unfairly favor parts of the web. To fix this, probabilities were added to the matrix so that the computed page rank eigenvector has the information we desire.

The math behind this process can become rather cumbersome to delve into, but the article offers an analogy to try and understand the concept behind the scenes. The article explains how the algorithm using the idea of random web surfers that traverse the links of the web. These surfers sometimes get stuck at the end nodes, in which case they are allowed to randomly jump to any other node in the network with equal probability, which avoids dead ends. They may also get stuck in sub-webs with no escape, in which case they either continue traversing this part of the web or jump out to any other random page with a certain probability. These surfers will spend disproportionate time at particularly good pages, which will rank them higher.

This relates directly to our discussions of PageRank in class and how search is done on the web. We discussed how to computer PageRank values on small networks in class and the idea of how PageRank flows through the links in the network trying to reach an equilibrium. We also discussed how dead ends affect the computation and how this can be addressed by replacing portions of the PageRank at the top of the network so it can flow down again. This article goes into depth about how Google actually computes the PageRank values on such a large network of webpages. It digs deeper into the mathematical side of the process to help gain a better understanding of how search on the web works.

Comments

Leave a Reply

Blogging Calendar

October 2014
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728293031  

Archives