Crawling a spider’s web of data
In two decades the internet has grown from a small collection of academic hubs communicating with each other through hyperlinks to a full blown entity that has permeated every part of our lives and is now considered an integral tool for not only information retrieval but also for entertainment and social networking. Billions of web pages interlink across a network connected through fiber optic cables with each one accessible through a simple click of the mouse. But the important question that had plagued the minds of early adopters of the internet was how to organize all this information and retrieve the relevant parts. Enter the search engine.
A search engine uses complex algorithms called “crawlers” that travel from webpage to webpage through links, making a note not only of the links, but also the content of each page. The vast trove of content and links is then indexed, similar to the index at the back of the book. An index is like a giant catalogue or inventory of websites containing a copy every web page and file that a crawler finds. However, these indexes are dynamically updated to account for any changes. Once an index is made, users like you and I query this index for information that is relevant to us and it is this retrieval algorithm that is the secret sauce of all search engines.
Google, the most popular search engine, revolutionized the way search engines retrieve information by implementing the PageRank algorithm. Consider the internet as a vast network with nodes corresponding to webpages and links corresponding to edges. The Markov model represents this network with a square matrix P whose pij is the probability of moving from state i (page i) to state j (page j). Google figures that when one page links to another, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. Google calculates a page’s importance from the votes cast for it. Through this simplified analogy of votes, Google calculates the PageRank vector and uses this information to rank websites in order of relevancy when a query is processed. The incredible thing is, all of this is done in a matter of milliseconds.
This scholarly article written by Larry Paige and Sergey Brin, the founders of Google, explains their PageRank method mathematically and serves as the cornerstone for link analysis. Its interesting to note how a culmination of mathematical ideas stemming from combinatorics, graph theory, probability and computer science all blend together to create one of the most useful and heavily used products of our time. The adaptation of the internet as a network was an insight that has paved the way for revolutionary work in link analysis and the continued research in this area only means that there will be more creative destruction in a field that is bursting with growth.
http://infolab.stanford.edu/~backrub/google.html
