Skip to main content



The World Wide Web Becomes More Connected

In this class, we learned about the link structure of the Web.  The Web can be viewed as a directed graph with the web pages as the nodes and links between the nodes as the directed edges.  Broder et al. looked at a subset of the Web in 1999 and found that there was a “bow-tie” structure, including a giant strongly connected component (SCC) and regions they called IN and OUT.  From links in pages in IN, one can eventually reach pages in the SCC, but it is not possible to reach IN from SCC.  Similarly, one can reach pages in OUT through links from the SCC, but not vice versa.

For their study of the graph structure of the web, Broder et al. used Web crawls collected by the search engine in Altavista in 1999; those Web crawls resulted in a graph with 200 million pages and 1.5 billion links.  A later study involved a larger graph: Meusel et al. used data in 2012 from a web crawl done by a non-profit called Common Crawl; the graph included 3.5 billion pages and 128.7 billion links.  This study also identifies the components SCC, IN, and OUT in the graph.

Both studies only look at a portion of the web, but comparing the two, it seems like the structure of the Internet has changed over the decade between the two studies.  The SCC of the 2012 data contains 51% of the graph’s nodes, compared with 27% in 1999.  Also, in the newer graph, there are an average of 36.8 links per page, compared with 7.5 links per page in the older study.  Finally, Meusel et al. found that the average distance between pairs of nodes that have a path between them is approximately 12.8, down from 16.1 in the older study.  All of these differences suggest that over the past decade, the Web did not only gain more pages, but also became more connected.

 

Common Crawl. http://commoncrawl.org/

Broder, Andrei, et al. “Graph structure in the web.” Computer networks 33.1 (2000): 309-320.

http://www.sciencedirect.com/science/article/pii/S1389128600000839

Meusel, Robert, et al. “Graph structure in the web—revisited: a trick of the heavy tail.” Proceedings of the 23rd International Conference on World Wide Web. ACM, 2014.

http://dl.acm.org/citation.cfm?id=2576928

Comments

Leave a Reply

Blogging Calendar

October 2016
M T W T F S S
 12
3456789
10111213141516
17181920212223
24252627282930
31  

Archives