Skip to main content

Murder: Distributed, Large-Scale Code Deployment

Imagine running a popular web service that gets millions of hits per day. To maintain scalability, the operational load is distributed across thousands of servers, hosted in datacenters across the globe. All of these servers, known as nodes, run the same version of your system’s software, and thus need to be kept in sync. Now imagine trying to push a code update to all of these nodes. How does one automate such an operation? Further still, consider having multiple iterations of code and compiled binaries that need to be pushed out on a daily basis. How does one do that efficiently, all while maintaining an operational capacity with no perceivable downtime?

A few years ago, this exact situation was encountered by Twitter. Generally, most companies use a Git-based deploy system — there is a private server sitting somewhere, that runs Git, a source code management and revision control framework. Developers at Twitter will upload code changes to this server, and, when ready to distribute it to all of Twitter’s front-end servers — nodes — will instruct them to sync with this central server, herein referred to as the ‘head’.

Unfortunately, once Twitter grew past a few hundred servers, issues started cropping up. Specifically, issues of a centralized system trying to distribute the same content to thousands of clients. As an analog, consider the head server being a garden hose, and all the client servers being buckets. Filling a bucket constitutes the central server uploading a copy of a file to one client. Obviously, the time to fill all buckets — or, sync code updates to the client servers — is linear in the number of nodes. Once that number gets into the thousands, it may take quite some time for the nodes towards to the end to get the files. At this point, Twitter needed a dramatic way to speed things up, by several orders of magnitude. Using stop-gap measures like replicating the Git server only provided constant-time improvements (i.e. using multiple garden hoses — the time to fill all buckets is still linear on the number of buckets).

As a result, Twitter completely abandoned the centralized hierarchy, and switched to a distributed one, nicknamed ‘Murder’. After a few days of tinkering, they noticed that a generic code deployment process that used to take 40-60 minutes now completed in under 12 seconds.

Murder — a technical term for a flock of crows — leverages the power of a popular P2P protocol known as BitTorrent, combined with the unique qualities of a datacenter environment: low-latency access, high bandwidth, and physical locality. It is the equivalent of modifying each bucket in the above analogy to also have its own hose, but with one caveat. Since we are referring to digital media, it is possible to duplicate it without cost. Thus, these hoses have a special property that they can ‘duplicate’ the water in the bucket, rather than simply transferring it. Thus, once the bucket receives water, it can immediately begin to fill buckets who’ve yet to receive any. Immediately, one notices the key aspect: distributing water just went from a linear, one-at-a-time operation, to a concurrent one.

Equivalently, in the context of P2P, clients wishing to obtain a file are no longer constrained to relying on one central server. Instead, any client can receive multiple chunks of a file from other clients who’ve already received it themselves. As a result, in situations with many clients, or ‘peers’, file distribution rates are exponentially faster. Each additional peer actually contributes to speeding up everyone else’s downloads, rather than delaying them. In a way, each peer also functions as a ‘seeder’ — someone who shares pieces of the file they’ve already received, much like the central server shares its complete copy of the file.

In Twitter’s case, their datacenter setup gave them an optimal environment for a P2P setup. All of their peer servers are clustered together, thus giving them very high throughput rates to each other (i.e. a wide, high pressure hose). Thus, once the first few severs obtain chunks of the file, they can immediately being uploading it to their own peers, and so forth. This creates a clustered, wave-like effect, which is what allows for such a great reduction is deployment times.

As web-based services grow larger to accommodate increased usage patterns across the world, most companies are beginning to run up against similar issues — while they can invest millions in infrastructure fairly easily, the problem becomes how to keep everything in sync. With other tech companies beginning to adopt custom BitTorrent protocols as well, P2P may soon become the standard for far more than pirating movies.

Source: Fast datacenter code deploys using BitTorrent



2 Responses to “ Murder: Distributed, Large-Scale Code Deployment ”

  • as2447

    This is a very innovative application of P2P. It is funny that you connect P2P to piracy, that has been the most (in)famous use of this protocol. When it started out though, P2P was a really promising protocol. It would utilize idle resources on computers to give more efficient file transfers, as well as more resilient networks.
    If you think of it, the Internet routing network is probably the biggest peer-to-peer network out there (with some centralization, e.g. root nameservers). Skype, until recently, followed a peer-to-peer model which allowed millions of users to make video calls, while still maintain low costs.

    Coming back to the class discussions, I couldn’t help but think about the analogue in people-to-people networks, another form of P2P. Unlike (non-malicious) computers, humans have a natural way of rephrasing, omitting or modifying information as they pass it along.
    For instance, think of the TV as a centralized git server, while a social network like Facebook could be a P2P network. Thus, while twitter finds that information transfer between nodes sped up with P2P, transfer through human P2P networks is actually slow and unreliable.

  • da253

    This was a great article! I had previously heard about this in passing, but I didn’t quite realize how cool and fast it was. The speed of deployment with bittorrent is simple astonishing. I think you did a particularly good job of drawing an analogy to what’s happening and how it leads to such a speed up.

    This example demonstrates the power of network effects. Sure, the ‘network’ can be used for ‘evil’ (P2P sharing), but it can be clearly used for good (ridiculously fast software deployment). Clearly, networks and network effects, like much of scientific inquiry, is neither good nor evil and I’m glad that P2P technology has gain some positive prominence.

    Besides the positive press for P2P, I’m glad that a different approach was able yield such great results. It stands to reason that if technology and scientific communities re-thought traditional approaches, much progress can be made. Specifically, if we re-though traditional ‘linear’ approaches, perhaps great advances can be made. Besides innovative applications of P2P technology, asynchronous programming has gained prominence over the last several years. I imagine that game theoretic and network-centric thinking can benefit other problems and fields as well.

Leave a Reply

You must be logged in to post a comment.