Learn your letters
Consider for a moment how traditional deduplication would work across multiple nodes. To return to our “block as letters” analogy, let’s assume that this blog is what I want to replicate. If Node 1 were trying to replicate data to Node 2 it would send over the hash of every single character in the blog. Node 2 would then reply with a list of the characters it didn’t have and Node 1 would supply them.
While this is a lot better than the “no deduplication at all” approach, simply shipping metadata for every character in the blog is inefficient, resource intensive and time consuming. We can do a lot better than this.
Let's take a totally different approach to replication. Have Node 1 ask Node 2 if it had a copy of this blog or not. If Node 2 has the blog, then job’s a good’un and no further communication is required.
Let’s say that Node 2 doesn’t have an exact copy of the finished draft of this blog. The next layer of metadata down from the full blog is at the paragraph level. Node 1 would send over a hash of each paragraph and Node 2 would reply with the paragraphs it didn’t have.
Node 1 would then remove a layer from those paragraphs and send over hashes about the individual sentences that are contained in those paragraphs. Even if no sentences were the same, it's pretty much guaranteed that the next layer down – words – will find a match. Node 2 would probably discover that is had a copy of most of the words (there is no need to resend “the”, for example,) and so would reply with a list of the words it doesn’t have.
Finally, Node 1 would respond with hashes of the characters in those words and Node 2 would respond with a list of the (presumably very few) characters that it did not have. At this point, Node 1 would send over those characters (which, remember represent the actual unique data blocks) and Node 2 would have all the information required to assemble a copy of the final draft of this blog.
At first glance it might seem like there is a lot more metadata being shipped around using the new model than the traditional deduplication model. This can be the case, but in practice it has proven not to be. Here is where analogies don’t work quite so well and looking at a real-world use case helps.
Consider the use case of making a copy of a virtual machine on a regular basis and sending that from New York to Tokyo. The first copy is going to be fairly large. It will contain all of the blocks unique to that VM.
If, for example, this is the first and only VM on either Node then the first replication will have to send across the entire VM. One hundred per cent of the data blocks will be transferred and there will ultimately be more metadata used to communicate between the nodes about this transfer than there would have been under traditional deduplication. (Though the metadata exchange would be a fraction of the data proper in this scenario.)
Let’s say that this VM we transferred over was a Windows Server 2012 R2 VM configured with Microsoft SQL. We then create a new Windows Server 2012 R2 VM configured with IIS and set up our regular “snapshot and replicate”. This second VM will not transfer all blocks. Many of the blocks will be the same as the first VM and so for two VMs that were 30GB in size you might see only 3GB transferred for the second VM’s initial replication.
In this case, the nodes will communicate a lot less metadata during the second VM’s initial replication than traditional deduplication because there are a lot of data blocks in the second VM that will have the same pattern as in the first VM.
Going back to our analogy for a second, think of these initial VM replications like having transferred a copy of this blog and then a short whitepaper based on this blog. Many of the paragraphs would be the same and certainly most of the words will be the same.
Where this new storage and replication scheme really comes into it’s own is what happens after those initial replications of the VMs have taken place.
Revising the analogy, a little, let’s consider the initial replications to be first drafts. Every subsequent snapshot is another draft. Over time, the text may change substantially, but each iteration is really not that different than the one before. One paragraph out of 200 might have changed, and even then only a handful of words and some punctuation might ultimately have changed. No new words may ultimately have been created.
Suddenly, this method of considering not only individual blocks, but patterns of blocks, starts to deliver real benefits.
Consider the above “snapshot and replicate” scenario. Node 1 is regularly taking snapshots of VMs and shipping them to Node 2. Traditional replication would have Node 1 keep a journal of all changes and send over every change since the last replication occurred. This isn't how replication works with the new storage approach.
If Node 1 and Node 2 had never replicated anything before then the exchange would simply start from the top. Node 1 would send over a list of the largest objects it has, Node 2 would say "sorry, I don't have any of those," and this exchange would continue down until Node 1 knew what Node 2 had and would then send over the missing blocks and their associated metadata.
Node 1 doesn’t need a block-for-block list of the data Node 2 is storing. It doesn’t even need a real time list of the high level data structures that Node 2 is storing. Node 1 only needs to care about what is stored on Node 2 when there is data to be exchanged. One benefit of this is that it is perfectly OK to have slightly out-of-date information about what is where elsewhere in the cluster.