Preferential Crawlers
A web crawler is “an internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.” A few of its many uses are that it supports universal search engines, vertical search engines, and business intelligence. One type of crawler, the preferential crawler, is selectively biased towards the most relevant pages, or the pages with the largest PageRank. Two examples of preferential crawling algorithms that I will be discussing in this post is the Breadth First Search (BFS) and PageRank. The BFS, which is implemented with queues, is a type of graph traversal that finds pages along the shortest paths; it exhaustively visits all the links in the order encountered. PageRank is a priority queue sorted by keywords and PageRank, a concept we covered in class.
In 1998, Cho, Garcia-Molina, and Page conducted a series of crawls on 179,000 pages from the stanford.edu domain using four different ordering metrics: breadth-first, backlink count, partial PageRank calculations, and random. After extensive research, Cho et al. determined that partial PageRank calculations was the most effective in terms of steering the crawlers to lead to the “hot” pages — pages that have the highest PageRank, or the pages with a high number of links pointing to it — and the breadth-first search a close second.
In 2001, Najork and Wiener conducted a similar research on 328 million pages using two tools, Mercator and the Connectivity Server 2, to measure the quality of a page using breadth-first ordering. However, instead of the conventional use of setting the PageRank so that the sum of all the PageRanks equal to one, as we have done during class, they designated the PageRanks to equal to the sum of the number of nodes in the graph, which is 351 million, so that a page has an average PageRank of 1. Najork et al. discovered that the average PageRank for the first day is 12.1 and for the second day is 1.8, and the numbers decline until 0.6 for the last day. Therefore, they concluded that the highest quality pages for breadth-first ordering comes during the early stages of the crawl.
Sources:
- http://en.wikipedia.org/wiki/Web_crawler
- http://www.cis.uni-muenchen.de/~yeong/Kurse/ss09/WebDataMining/kap8_rev.pdf
- http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf
- http://www10.org/cdrom/papers/208/