Crawling Across The Web
As we discussed in class, the pagerank algorithm is used by Google to rank pages in it’s search results, using a variety of factors – including hub and authority scores based on the number of links to and from sites. However, how does Google collect all this information about the state of the world wide web? One of the biggest challenges facing search engines is the tendency of the web to rapidly evolve. Sites which exist today may not exist tomorrow, and all the sites that will exist tomorrow do not necessarily exist today. Further, site content – mainly links to other sites – may change even more frequently. For this reason, it is not useful to collect a “snapshot” of the web at a given moment in time and go on using that as reference for any extended period of time. The snapshot would become very incorrect, very quickly.
To get around this, search engines deploy what are known as crawlers or spiders. Basically, a crawler is a bot which travels through the web and records any changes it observes in pages, links, and data since the last time it crawled past that location. It starts on a single web page, and then maintains a list of webpages linked from the current page that have changed since the last time it viewed the page, and then recursively visits each site in the list and appends changes in links on those sites to the list as well. But if Google only had one crawler, it would take an extraordinarily long time for it to pass through any significant portion of the web, and by the time it finished, most of the results it gathered would be outdated anyways. So, Google deploys many crawlers simultaneously, which it refers to as Googlebots, all starting from different locations, and traversing the web using the associative memory layout we discussed in class, by following links on the sites it is looking at (for the web developers, Google’s crawler looks at both ‘href’ and ‘src’ links). Although Google does not state how frequently its Googlebots visit a site (it also depends on several properties of the site), it does give an upper bound, stating that “For most sites, Googlebot shouldn’t access your site more than once every few seconds on average,” which I found surprisingly high.
Crawling takes advantage of the associative memory arrangement of the web to meaningfully update Google’s map of the web, which it then uses to provide the best search results.
Source: https://support.google.com/webmasters/answer/182072