The Profile of Today’s Web
This article was about the findings from a data analysis of the PageRank of approximately 2 billion links. A non-profit organization called Common Crawl peruses the Web each month and saves a version of the entire Web. Common Crawl then publishes information about this saved version of the Web and any associated data for all to see.
To conduct this data analysis, the article made several assumptions that were similar to the definitions in lecture. For instance, the word “edge” was defined as a link from one site to another site. And PageRank was defined as the number of links that a site receives rather than the number of links it gives out. The article analyzed Common Crawl’s data from May, June, and July 2019 and found surprising results.
One result showed that there is a correlation between the number of incoming links to PageRank. The graph depicts this correlation and it is apparent that most of the links have a low PageRank. This makes sense because a high PageRank signifies that it is a prominent site. On the other hand, a low PageRank means that it is an insignificant site. Since there is a small subset of sites that are very important and a large subset of sites that aren’t as important, these results validate the idea discussed in lecture that a high PageRank is associated with high importance and trust.
Another result showed that the number of incoming and outgoing links for a particular site are not dependent on each other. This is because the graph between incoming and outgoing links shows no correlation. Most of the data points are clustered to the lower left area and a few are spread out along the bottom of the graph, with only one or two data points at the top. The study also said that the sites with few incoming links are more willing to send out more links rather than the sites with many incoming links.
This relates to the definitions of hubs and authorities in lecture as hubs are characterized as good if they have many links pointing to nodes and authorities are characterized as good if they have many links pointing to themselves. Furthermore, this corroborates the idea that authorities are considered as nodes with prominent results, so it is logical that they receive more links than send them. This is likewise for hubs, which are considered as pages that simply hold links to other sites. It is logical that they send more links than receive them.
In the end, the article concluded that most sites have a low PageRank, or few incoming and outgoing links. It also suggested the above two results as its main conclusions. These findings are important because they suggest that as more sites are added to the Web, they will most likely have a low PageRank and provide more attention towards the sites with a high PageRank. This drastic split in PageRank is therefore exacerbated by the addition of more sites. It will be interesting to compare the results from the data analysis at this time period to the same data analysis at a later time period.