Beyond PageRank: Harmonic Centrality
First proposed by Italian researchers at the turn of the millennium, harmonic centrality is a measure of closeness that is distance-based, not priority-based, as PageRank is. Essentially, the harmonic centrality of a node is the sum of the number of nodes it has at a certain distance divided by that distance. For example. if a node has 25 neighbors, 50 pages at distance 2, and 75 at distance 3, its harmonic centrality would be 25 + 50/2 + 75/3 = 75. It’s dubious that such a simple measure would even have the potential to outperform PageRank. So why is it interesting to compare it to one of the most powerful measures of connectedness we know?
The first aspect that makes harmonic centrality a good candidate for search engines to use is its relative computational simplicity. Software already exists that approximates harmonic centrality for extremely large graphs, and since updating the graph is as simple as adding or subtracting some easily calculated number from the scores of nodes, it certainly seems that it would scale better than PageRank, which requires iterative updating.
Aysun Akarsu ran a number of importance measures, including harmonic centrality and PageRank, on both toy example graphs and larger, established datasets of web graphs. She found that PageRank tended to rank a very small number of pages highly, with the other pages ranked extremely low. Harmonic centrality, however, had a much more multimodal (containing multiple maxima) distribution. Moreover, the two measures had very little overlap when it came to deciding which websites were important as well as their ordering when it came to less popular pages (not in the top 10,000). In terms of computational complexity, the two are asymptotically the same, but harmonic centrality performs much better on smaller graphs.
Inspired by Akarsu’s work, Dawn Anderson contacted others in the field of information retrieval, including the aforementioned Italian researchers. They noted that harmonic centrality seems to return more accurate/relevant results than PageRank; this may speak more to the inherent vulnerabilities of link-based (as opposed to click-based) measures of importance in an age of prolific junk/spam/nonworking links. While the bow-tie structure of the web still holds, it’s possible that being able to find a node within the web is more important than understanding the structure of it, since most websites are dynamically connected. Whether or not harmonic centrality is the way to go, as the internet evolves, it’s probable that the ways we explore it need to evolve as well.