Skip to main content



Bigtable: A Consequence of Distributed Computation on the Structure of the Web

Bigtable: a Consequence of Distributed Computation on the Structure of the Web

Assume we have some cluster of communication nodes in an network and we wish to perform large-scale computation in a reasonable amount of time to transform unstructured data into structured information. A MapReduce Job accomplishes this by using the functional programming paradigm of Map and Reduce functions, defined by the user, to perform parallel computation on the rather large data set. (Aside: If you wish to learn more about the MapReduce framework, please read my previous blog on the Google MapReduce Framework). In fact, it may very well be the case that Google uses its own MapReduce Framework to perform computational analysis on the structure of the Web. Performing computation on the structure of the Web has been the key focus of many companies that have developed any sort of search engine application. It is also the case that such companies use distributed storage systems to address the issue of storing such large data that is generated from the distributed computation on a large network.

A common question that might arise is the following: why would a company require a distributed storage system? Why not use a simple relational database to persist the structured results of distributed computation? In class, we discussed the Web and Google’s PageRank algorithm. But all the examples used in class to demonstrate the PageRank algorithm involved relatively small graphs. Such graphs are minuscule in comparison to the actual Web. One can easily use a relational database on a single node to store the the Pageranked results of computation on these small graphs. The problem, however, is that Google needs to handle all, if not, most of the webpages in the entire Web. It would be utterly unrealistic to shove the necessary petabytes  of data generated from performing computation on the structure of the Web down the throat of a single node with a miserable terabyte hard drive.

Fortunately, the ingenious engineers at Google have innovated the fascinating invention of Bigtable. Google Bigtable is essentially a distributed storage system for persisting large datasets. This solution came about when Google closely examined the future of computation at the distributed level. The engineers at Google recognized a concrete example that drove the innovation of BigTable. Google necessitated several copies of the large collection of web pages and their related information when considering their search engine model and Pagerank computation. Thus, they developed Bigtable to satisfy the need of storing the large structured data generated from computing the structure of the Web. They named such a table Webtable. In Webtable, the URLs uniquely define row keys and certain aspects of a webpage name columns. The most interesting aspect of BigTable is its three-dimensional structure that really captures the concept of respecting the structure of data. Relational databases can be typically viewed as being two dimensional. That is, they have rows and columns. And NoSQL databases often have the problem of harming the structure of enormous data sets. Google Bigtable combines the best of both worlds. It possesses the benefits of the NoSQL databases and relational databases. It is able to store an enormous amount of data, while maintaining structure of data guaranteed by relational databases. Google Bigtable has rows, columns, and each cell contains multiple versions of the same data. In other words, imagine tables superimposed upon tables. This effect creates the three-dimensional nature of Google Bigtable. Each version is uniquely identified with a timestamp. Hence, the result of several iterations of a page rank algorithm can be appropriately stored in the Bigtable.

Another fascinating aspect of Bigtable is its distributed nature, hence, the name of the original Google research paper: Bigtable: a Distributed Storage System for Structured Data. A Bigtable operates in a common pool of nodes that usually participates in some kind of distributed computation, such as the aforementioned MapReduce Job. Bigtable uses the communication network’s underlying cluster management system as a buttress for various distributed fault tolerance. This management system is often responsible for the scheduling of jobs, the management of resources on shared machines, the graceful handling of machine failures, and lastly, the monitoring of the overall health of such nodes in the communication network. For example, machine failures can cause the corruption of shared resources stored in Google Bigtable. Corrupt data is ultimately as good as no data. The Bigtable engine must effectively communicate with the underlying cluster management system to aptly distribute copies of Bigtable “tablets” among the nodes in the communication network. The Google Bigtable also utilizes advanced data structures such as bloom filters to speed up read operations and B+-trees for table organization. But such topics are much beyond the scope of this blog post. A curious reader will take the time to digest the brilliant research paper by renown Google engineers Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.

Ultimately, with the existence of Google Bigtable, it is no longer a mystery of how Google actually stores the results of their famous Pagerank computation. Due to its ingenious design, Bigtable can effectively store the results of several iterations of an ongoing Pagerank computation, as well other distributed computation, being performed on the structure of the Web.

Thank you.

– dbn777 (10 / 19 / 2015)

References:

Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

 

Comments

Leave a Reply

Blogging Calendar

October 2015
M T W T F S S
 1234
567891011
12131415161718
19202122232425
262728293031  

Archives