Google-bombing – Manipulating PageRank Algorithm and Solutions
As we learned in class, Google has a secret algorithm of Information Retrieval (IR) for its search engine. In addition to the in-link counting we did in class, Page Rank also considers link text used. If a majority of links use a specific phrase when linking to a target page, querying that phrase will include the target page in the results list. However, Google-bombing, an adversarial IR, can decrease the accuracy of search results by manipulating the result of the Page Rank.
A Google-bomb is the result of an intentional set of actions that could increase the target page’s rank and disrupting the accuracy of the system. Here’s how it works: suppose there are have four hubs and two authorities. We can achieve a Google-bombing by adding a fake hub and a fake authority and links the fake hub to all five authorities (like Question 4c in Homework4). When doing a Basic Page Rank Update, the hub score of the fake hub will increase rapidly (possibly surpass the real hubs) because it points to all four real authorities which have high authority scores. In the next round, the score of the fake authority will be boosted by the fake hub. As we do more updates, the fake authority may rank above some of the real authorities and show up higher in related search results, but it does not provide anything valuable to the surfers.
The rise of Google-bombing was detected throughout the early 2000s. While initially considered a prank, it has found serious applications in political and commercial circles, associating competitors, rivals, and enemies with negative or derogatory terms. As of 2007, Google has neutralized many of the known Google-bombing key phrases by modifying the PageRank algorithm. Despite Google’s secrecy, we can reasonably guess at some of the algorithmic changes to diffuse Google-bombs.
Linker Reputability: When an author repeatedly links to the same page, the rank for the target still increases, regardless of whether the page is relevant. This general problem can be addressed by evaluating the reputability of the linker. Certain websites could be given default reputability rankings (i.e. government (.gov) and education (.edu) sites might be more reliable than a .com page). Reputability can also be defined by a threshold, where only the most popular pages should be considered reliable in that most users visit them and rely on them for information.
Link Text Analysis: If a majority of incoming links have the same text but that text is in no way related to the page, the authors can artificially increase the rating of the page. To combat the abuse of link text, first, check the target page for the terms in the link text. If the target page contains none of the link terms, it is likely the information in the link has nothing to do with the target page and therefore the link should not be included in the link analysis for that page. Alternatively, a history of link text could be maintained, tracking the types of link terms used when linking to the target page. If there is a sudden shift in both the number of links and the “average” link text, it may a Google-bomb.
Looking through the developing process of the Internet, Google-bomb is among the many phenomena where individuals find bugs or take advantage of a system for their interests (or maybe just for being cool). While they certainly cause problems, they also give directions for developers to optimize their products and to have more considerations about issues like system safety and information privacy when developing future products.
Source: https://pdfs.semanticscholar.org/2bce/4885f4d27923acc283af760027cec94ccbdf.pdf