Approaches to Network Sampling
The advent of the internet and online social networks has yielded large-scale graphs with millions of nodes. There is huge potential in analyzing these networks for sociologists, epidemiologists, mathematicians, and others. However, with this possibility comes new challenges, including processing the enormous amount of data that comprises these networks. Because the task is so computationally intensive and time-consuming, it is often necessary to sample pieces of the network and draw representative conclusions from those subsets.
Several approaches to sampling have been explored in the network literature. One approach is to randomly chose a subset of nodes, and construct the graph consisting of only those nodes. However, this approach distorts the degree (number of adjacent edges) of the nodes, which may throw off the analysis. Another possible approach is to randomly select edges and construct the graph containing them. However, this results in sparsely connected graphs and favors nodes which have a high degree of edges.
When it comes to social network research, the sampling method is often constrained by the researcher’s access to the data. For this reason, some social network studies employ a crawl approach which starts with a single, given node and searches outward. The authors of Analysis of Topological Characteristics of Huge Online Social Networking Services utilized the “snowball approach,” a type of crawl, which starts at a node and then implements breadth-first-search (BFS). As the authors noted in their paper, the limitation of this is that hubs (dense areas with many connections) become oversampled as the BFS is likely to encounter them within the first few iterations.
There are many nuanced approaches to sampling networks which rely on probability and numerical factors. Ultimately, the ideal sampling method depends on which characteristics of the original graph (diameter, clustering coefficient, community structure) are most important to preserve. The challenge of accurately and efficiently sampling social network graphs for research is a good illustration of a major difference between theoretical and practical mathematics.
Sources:
Analysis of Topological Characteristics of Huge Online Social Networking Services
http://www2007.org/papers/paper676.pdf
Understanding Graph Sampling Algorithms for Social Network Analysis
http://cseweb.ucsd.edu/~longjin/Graph%20Sampling-Simplex11.pdf
Sampling from Large Graphs
http://cs.stanford.edu/people/jure/pubs/sampling-kdd06.pdf
Fast and Accurate Estimation of Shortest Paths in Large Graphs
http://www.francescobonchi.com/paper_7.pdf