Skip to main content



Manipulating PageRank — Adversarial IR and the rise of the Google-bomb

Peter Hamilton’s Google-bombing – Manipulating the PageRank Algorithm highlights a variety of adversarial information retrieval (IR) strategies — techniques used to manipulate information retrieval algorithms into prioritizing specific pages and documents. Specifically, it discusses the so-called “Google Bomb”, a set of such adversarial IR strategies targeted specifically towards Google’s PageRank algorithm. Here, Hamilton explains some basic Google Bomb strategies and provides speculation on the ways in which Google has adjusted PageRank to respond to these strategies.

As discussed in class, PageRank assigns pages different weights based primarily on two criteria —  how frequently they are linked to, and the weights of the pages that provide these links. Pages that are ranked highly by PageRank are linked to frequently, and by high-quality sources. Google Bombers manipulate PageRank by filling documents with a large number of common search terms and setting up large, heavily cross-referential groupings of these pages. In doing so, they artificially inflate the perceived quality of these pages, and in turn, the ranking assigned to them.

Of importance is the PageRank algorithm’s assumption that the text near to a given link is related to the content of that link. By identifying the terms that are commonly used when linking to a specific page, PageRank is able to assemble a list of queries that should point to this page. Google Bombers utilize this assumption to manipulate PageRank by filling each page in a grouping of interconnected spam pages with long strings of text unrelated to the content of the pages in the grouping. If done effectively, PageRank is manipulated into displaying these irrelevant pages when given a query related to these strings of text, instead of displaying relevant content.

Hamilton explains that a first step toward thwarting the Google Bomb is to increase the rankings of pages that can be confirmed (or are at least more likely) to be reputable. Specifically, content hosted on websites with less easily accessible top-level domains (i.e. .edu or .gov websites) can be generally assumed to be more consistently reputable than that hosted on .com or .org domains. Another strategy for deterring the Google Bomb is to compare the terms used in linking and linked pages — if these differ drastically, it is likely that the link text manipulation strategy described above is in play. By reducing the ranking of sites that appear to utilize this strategy, a step is taken towards ensuring accurate search results.

https://pdfs.semanticscholar.org/2bce/4885f4d27923acc283af760027cec94ccbdf.pdf

Comments

Leave a Reply

Blogging Calendar

October 2017
M T W T F S S
 1
2345678
9101112131415
16171819202122
23242526272829
3031  

Archives