Development of Anti-spam PageRank algorithms
Article: https://ieeexplore.ieee.org/document/7960003
In class, we explored web search algorithms and specifically, the PageRank algorithm, which is used by the Google search engine to rank web pages in search results. It works with the assumption that highly ranked pages are referenced a lot, and by other highly ranked pages as well. In class and in the textbook, we look into the details of very basic PageRank update rules, and also explored the concept of “cheating the system” and how a good understanding of the PageRank algorithm can be exploited in order to give pages a higher ranking than they should.
This article published in the 2017 and presented at the IEEE/ACIS International Conference on Computer and Information Science explains an algorithm that improves upon the PageRank algorithm in order to prevent link spamming and cheating by websites to get a higher rank than they should have. The article brings up link farms, a common method of link spamming, described as “a group of meticulously organized web pages where many cheating pages hyperlink to other sites to increase the link popularity of other sites”. Existing algorithms improved upon PageRank with the same goals, such as TrustRank and BadRank, use the assumption that good and “non-cheating” sites link to other good sites, and bad spam sites link to other bad sites, such as in the case of link farms. Therefore, the improved algorithms use a whitelist or blacklist system to access how likely a page is of manipulating its rank to be higher than it should be to account for link farms and other similar methods of web spamming. The new algorithm presented in the article uses strategies from other algorithms including TrustRank and BadRank’s whitelist and blacklist systems, and shows promising results. The fact that the article was published in 2017 also shows that research and breakthroughs continue to be made in this area of information science and networking. And as we gain a better understanding of the web and search engine operations, more methods of attacking or abusing the system will surface, further prompting more research to combat them.