Networks, Links, Spammers and Search Engines
Given the importance of ranking high in search results, and the ability to reverse engineer some of the criteria used by search engines in computing their rankings, it is not surprising that would be spammers are continually finding new ways to manipulate the results. In their paper “An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach”, Shahzad et al. describe “spamdexing” as an act by spammers to bring “illegally favourable importance or relevance” to their web pages, as compared to the page’s true significance. In other words, spammers use various techniques to take advantage of search engine algorithms in order to have their own pages ranked highly rather than legitimate pages that users actually search for. Two prominent types of spamming are link-based spam and content-based spam, with link-based spamming being particularly relevant to Page Rank as it was discussed in class. In link-based spamming, it is common for spammers to exchange reciprocal links, where two spammers sites’ link to each other. Spammers can engage in reciprocal linking with many other spammers, thus creating a large component of fraudulent sites connected to each other. Additionally, rather than merely exchanging links with one another, the owner of a site x with high page rank can also offer to sell links to another spammer with site y so that the owner of site x links x to y for a payment. As detailed in the paper, services offering links for payments can also similarly sell other social media services such as instagram or twitter followers. As one might imagine, the practice of paying for backlinks is strictly banned by search engines. In order to detect spammers based upon their links, the authors suggest using a “paid-links database”. This database includes the sites of any services providing paid back links, and thus when attempting to access the legitimacy of a site, one could check if the site has any backlinks to any paid-link site in the database, and if so, it would be considered fraudulent. The only potential problem I could see with this approach would be that it allows for owners of paid-link services to sabotage legitimate site owners by linking to their sites without permission. Thus legitimate sites would be caught in the crosshairs if no further checks were performed to find out which sites are not fraudulent. Given a site with outgoing and incoming links to other fraudulent sites, a more certain judgement about that site’s authenticity could be made. Despite these minor misgivings, the general idea of having a web crawler go through a network of linked sites is promising, as having links to and from other fraudulent sites is rather unlikely for a legitimate site.
Though the structure of the network that any given web page resides in is very important to understanding that page’s relevance to a given search query, network analysis alone is insufficient to understand more specific features of sites. Therefore, search engines like Google tend to analyze the content of websites in terms of their html markup, and css styles in order to understand to further provide accurate search results. While spam websites can be identified merely by their position in a network, further analysis of their content can provide more evidence of their spamminess. Shahzad et al, in developing a model for detecting spam, identify many of these potential signs of spam. Some of these factors include high ratios of links compared to content, a lack of a favicon (the icon you see in the corner of a tab when you visit a site), hidden links in headers, footers, and toolbars, and the presence of “spammy keywords”. While a frequent internet user is likely to pick up on these signs of spam, it is still in search engines’ interest to limit it, as to not waste users’ time or cause tech illiterate users to be scammed. Whereas a lot of these signs of spam are immediately visible to humans, other illegal acts of search engine manipulation are hidden in the code, and banned by Google. Foremost of these techniques is using hidden links or images, which could for example, be accomplished by having white text placed against a white background. While a poorly optimized web crawler would pick up on the text, users would not, and thus the content being fed into the search algorithm would not conform with the users experience when browsing the site. Such clever manipulations are rampant given the importance of high search rankings, and thus there is likely to be a continual arm race between the Googles of the world and the spammers.
Sources
Citation for Journal Article
Asim Shahzad, Nazri Mohd Nawi, Muhammad Zubair Rehman, Abdullah Khan, “An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach”, Complexity, vol. 2021, Article ID 6625739, 18 pages, 2021. https://doi-org.proxy.library.cornell.edu/10.1155/2021/6625739
Google’s Policies on Spam
https://developers.google.com/search/docs/essentials/spam-policies#hidden-text-and-links