The Academic Paper That Started Google
https://www.sciencedirect.com/science/article/pii/S016975529800110X
In 1995, Larry Page met Sergey Brin. At the time, Page and Brin were Ph.D students at Stanford University. The two began collaborating on a research project nicknamed “BackRub” with the goal of ranking web pages into a measure of importance by converting their backlink data. Without knowing it at the time, Page and Brin developed the PageRank algorithm that became the original Google Search algorithm. By early 1998, Google had indexed around 24 million web pages. While the home page still marked the project as “BETA”, an article in Salon.com argued that Google’s algorithm gave results that were far better than leading search engines at the time like Hotbot, Excite.com, and Yahoo!. In April of 1998, Page and Brin published a research paper on the topic titled The Anatomy of a Large-Scale Hypertextual Web Search Engine.
In the introduction, Page and Brin noted the problems of popular search engines at the time. For example, while Yahoo! had built a power index that queries keyword matching, they return “too many low-quality matches”. Advertisers would also be able to exploit this by trying to gain people’s attention by populating web pages with “invisible” keywords. According to Page and Brin, Google is different. Its main goal is to “improve the quality of web search engines” by making use of both link structure and anchor text. However, Page and Brin recognize that creating a search engine that scales to the web of 1998 would be difficult as fast web-crawling technologies is necessary to collect and update web pages. In addition, databases must be used more efficiently to store and index the pages as “queries must be handled quickly, at a rate of hundreds to thousands per second”.
Page and Brin coins “PageRank” as an objective measure of a page’s citation importance that corresponds with a person’s subjective idea of importance. It uses the network structure of the Web to calculate a quality ranking for each web page. The founders deduced that given that page A has pages T1…Tn pages that point to it through citations; C(A) is the number of links going out of page A, and d is a constant variable between 0 and 1 (often 0.85) to pad the web pages; the PageRank of page A is:
PR(A) = (1 – d) + d(PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
At the time, the algorithm has been applied to its database of around 24 million web pages at http://google.stanford.edu. Full raw HTML text of all the pages was also available in its repository.
The rest of the paper provides a low-level view of Google system architecture and its performance, including its code stack, storage process, data structures, and the searching processing algorithm. The full paper can be viewed at http://infolab.stanford.edu/~backrub/google.html. It is around a 15-minute read, and I highly recommend everyone with an interest in databases to read in order to understand the concept of Google’s original algorithm that led to it becoming one of the biggest tech companies in the world.
The concept of PageRank taught in INFO 2040 represents the calculation that Page and Brin developed at a very high level and in a visually-intuitive way. It allows us to understand how components of the web interact with each other, and how PageRank resembles human behavior in ranking importance from authorities. Google succeeded because Page and Brin understood networks long before other tech companies in 1998. The academic paper that they published reflects it.