Machine Learning vs. Page Ranking (2006)
Source: http://courses.cse.tamu.edu/caverlee/csce470/beyondpagerank.pdf
Technology moves fast, and it’s no surprise that PageRank, Google’s breakthrough algorithm for their main product, has been replaced with more refined technologies. In this paper, Microsoft researchers (in 2006!) showed that their own machine learning-based search algorithm outperformed PageRank in static ranking. Static ranking is defined as non-user-customized search, basically what Google is like if we aren’t logged in and if we assume Google has not been tracking us regardless, which I’m not so sure is true in this day and age. Anyways, in 2006, PageRank as well as HITS (hubs and authorities) were supposedly the hottest ranking algorithms. The problem with PageRank, as explained by the researchers, was that it could easily be manipulated by just creating pages that point to the pages that you want to be ranked higher. By understanding the relatively simple algorithm, you could beat Google at the time. Another example of manipulating search results was in the textbook, where two pages would refer to each other only and hog the PageRank points.
Cue machine learning. The Microsoft researchers thought ML would be a good choice to replace PageRank because it would complicate the model for ranking, which would prevent hackers from manipulating search results, and there was a boatload of data on search results at the time. The researchers implemented RankNet using four groups of features: page-level, domain-level, inlinks, and popularity. (Page features = stuff only in the pages, like number of words or the most frequent word, domain features = features across all pages in a domain, inlink features = features based on links pointing to the page, and popularity = number of users visited.) With ML problems, we need a dataset with a ground truth to train on, so the researchers asked a bunch of human subjects to rank a bunch of webpages based on a bunch of searches. They then used a two-layer neural network to train on the data to produce rankings for each webpage.
It turns out that their method outperformed PageRank, ~67% vs. ~57%, in terms of pairwise accuracy! The researchers finish the paper by talking about how to improve this method and steps forward for static ranking. This seemed like a pretty big step forward, but I think Google must’ve figured something out later on since none of us really use Bing…