Skip to main content



Jon Kleinberg – The HIT(S) Man

Jon Kleinberg is a member of the Cornell faculty that we should all greatly appreciate for his contribution to the understanding of the internet. Not only did he pioneer the use of the PageRank system, but he also came up with one of the most important network algorithm employed on the web – the HITS algorithm. A paper by Stoll et al. (2006) attempts to explain the complex mission of identifying hubs and authorities on the internet using the HITS method. This is no easy task, and few researchers have been able to explain why except for one simple fact; the complexity is continually changing. In essence, if we just one authority, there would only be a single search engine. The paper written by Chakrabarti et al. (1999) uses a concise algorithm for calculating authority and hub scores for websites using a single search engine which is an already reduced form of the internet, but nonetheless employs the same general principles that would be used for a complex network of search engines. It goes something along the lines of this:

You first need to look for a specific term of interest using a specific search engine and identify the first 200 web pages that are returned by the search. The next item of the task is to determine every webpage that links to and from those 200 web pages. This is going to be termed our root set and is considered a sample of the internet. One can then begin by calculating the hub and authority scores, much in the same way that we we discussed extensively in class. The initial hub score for any page is the number of pages in the root set that the page references. The initial authority score for any page is the number of pages in the root set that link to that page. This is the most difficult portion of the test because it requires an extensive amount of data acquisition about the network and the links going to and from each of the “nodes” or sites. Once the initial scores are assigned, we can begin calculating the update authority and hub scores using the exact same method we used in class. This update rule is then run until the normalized scores begin to stabilize.

This theory has been used across the web, and is a very substantial piece of how many search engines rank sites outside of their use of ad slots and auctions. This was first developed from PageRank which was developed by professor Jon Kleinberg at Cornell University. He came up with the specific algorithm for this which actually uses this same format of analysis which is termed the HITS algorithm which takes a sample of the first 200 webpages from any given search on a given search engine. I have been thinking about trying to figure it out for myself, but I do not believe I have the time or the coding knowledge to give you, the reader, such an example. (yet the internet is full of geniuses that I am sure have done this for search terms ranging from “peanut butter” to “subatomic particles”)

 

HITS algorithm: http://pi.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html

 

Hubs, Authorities, and Networks: Predicting Conflict Using Events Data: https://s3.amazonaws.com/academia.edu.documents/42340413/stollsubramanian.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1540520403&Signature=VoEQ4ftwgH3RSTDnnxFCDjI2Z1g%3D&response-content-disposition=inline%3B%20filename%3DHubs_authorities_and_networks_Predicting.pdf

 

Comments

Leave a Reply

Blogging Calendar

October 2018
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
293031  

Archives