Skip to main content



The vastness of Google’s page rank algorithm

http://www.ams.org/samplings/feature-column/fcarc-pagerank

The article I read from the American Mathematical Society discusses deeply how Google ranks web pages so that you can best find what you’re searching for. It’s not an easy task as the search engine is essentially finding your very specific needle in a vast haystack. The article begins by discussing the grand problem that Google and any other search engine faces. “Imagine a library containing 25 billion documents but with no centralized organization and no librarians. In addition, anyone may add a document at any time without telling anyone. You may feel sure that one of the documents contained in the collection has a piece of information that is vitally important to you, and, being impatient like most of us, you’d like to find it in a matter of seconds. How would you go about doing it?” There’s a vast mathematical background, but here are the basics of how Google works its magic as specified in the article.

Google’s algorithm is centered on the mathematics found in Linear Algebra.  First we create a matrix H called the hyperlink matrix. This matrix is based around the amount of links going to a particular page, but also depends on how many times that previous web page, links to other pages. If you consider each page to have a vote of 1, then if that page A, links to page B and C, then both pages B and C add ½ to the amount of their importance when calculating rankings. This is done because then if a strong webpage with a high rank links to another page, we see that other page to be very important as well. If a web page with a high rank links to many different web pages however, then those web pages may not be as important. Then we create another matrix which holds each web page’s rank. We need to then find an equilibrium point from these two matrices because the results from one feed into the other. It begins a very cyclic pattern until some sort of equilibrium is found. Many theorize that if you do at least 100 repetitions of the cycle, you may be able to be sure that these rankings hold true as you should see the values reach some asymptote. This type of discussion is what we learned mostly in class, but there is even more.

Another big aspect to Google’s ranking method lies within its interpretation of the H matrix described above. We can think of it in terms of probability. We consider what would happen if we started at a certain webpage and then randomly selected a hyperlink on the page, and so on and so forth with each page we come to. Then we take into account the amount of time we spend at each page. This allows us to interpret a web page’s PageRank as the fraction of time that a random surfer spends on that web page. The issue that occurs here is that if we finally link to a webpage that does not link to anything else. To fix this error of the dangling node, we choose a page at random to link to in order to continue the chain. There are a wide variety of mathematical ways to analyze this but the overall goal is to again find a proper page ranking.

Google is very complicated because of the demand of its users. We ask for all sorts of data without thinking about all that we are actually asking. When we do any search, Google sifts through billions of entries to find what looks best to you, and it’s pretty accurate. It takes a lot of work and constant tweaking but it creates a vast connected library where it’s easy to find anything and everything.

Comments

Leave a Reply

Blogging Calendar

October 2015
M T W T F S S
 1234
567891011
12131415161718
19202122232425
262728293031  

Archives