Skip to main content



Limitations of Web Crawlers and how they affect Search Engine Performance

https://www.google.com/search/howsearchworks/

During Lecture and in the textbook, one of the topics that came up was how do search engines rank the search results of a particular query. The methods that were discussed — Hubs and Authorities and PageRank — both require at least a partial understanding of the overall structure of the internet. I was really interested in how a search engine obtains the graph or network structure of the internet, so I did a bit of digging to see what would come up. And while most search engines are rather tight-lipped about the specifics, Google had a general high-level overview of how they would find webpages in the internet.

The way Google indexes websites on the internet is actually a lot more straightforward than one would imagine — it uses something called a “web crawler”, which visits webpages and finds all links that exist within the website either through a sitemap provided, or just following the links in general. This makes sense, because the internet is just a graph with hyperlinks as the edges between the page nodes. The information about links is also essential when applying search ranking algorithms (Hubs and Authorities and PageRank), which uses the amount of links between sites to determine how relevant they are to the query.

Of course, such a method to determine the network structure can never be exhaustive — there has to be pages that slip through the cracks. I realized that because of the bowtie structure of web, there has to be some nodes that can never be visited, depending on where the web crawler begins it search. The disconnected components of the internet with no connections to the main component will definitely not be searched, but there has to be some nodes within the “IN” set that will not be visited by the web crawler, because there are no links from the main Strongly Connected Component of the internet to the site. This applies to the tendrils that stem from nodes in the “IN” set as well.

Knowing this, how will this affect the performance of the search engine? Based on what I have learnt about the search ranking algorithms, I believe that it would not have a large impact on it. The reason is because the number of links to and from any node in the “IN” set has to be limited compared to a node in the main strongly connected component. If Google uses Hubs and Authorities, then the node in “IN” set will not have a lot of authorities pointing to it, because the authorities will most likely be in the main strongly connected component, and thus it wouldn’t have been ranked highly anyways even if it was included in the search. Similarly, for PageRank, nodes in the main strongly connected component must have more links in and out the page compared to any node in the “IN” set, and again, will not be ranked highly anyways. Therefore, the web crawler’s inability to find nodes in the “IN” set will not affect the search results drastically, and the search engine can still retain a reasonable performance.

Comments

Leave a Reply

Blogging Calendar

October 2019
M T W T F S S
 123456
78910111213
14151617181920
21222324252627
28293031  

Archives