Disaster recovery is complicated and usually expensive. It comes in many forms, and many companies mandate a minimum of off-site backups for various reasons, from regulatory compliance to risk aversion.
Disaster recovery planning is considered to be part of standard IT best practices today, but which solutions are appropriate for which use cases?
Full-blown disaster recovery involves beating every single aspect of one's IT plant into shape and being able to spin it all back up somewhere else after the worst has happened. This alone causes anxiety. None of us has IT plant that is bug free. None of us has 100 per cent automated every aspect of our IT.
There is always some system, some application or some configuration that we know to be fragile. We know it could be done better, but we either don't know how or we lack the political capital within the organisation to get resources devoted to the effort.
Everyone has outages – Amazon, Microsoft, Google, even critical banking infrastructure. Though precious few systems administrators will admit it, every one of us goes to bed never completely sure that if all of the IT under our control was turned off it would all come back up again.
So the topic makes us edgy. To make matters worse, the tools that exist today for disaster recovery are… well, they are kind of crap, aren't they?
They are either too fragile, cumbersome to manage, restricted to specific applications and platforms, outrageously expensive or all of the above.
Tools change. They evolve. Even best practices are evolving as new technologies become available and options once restricted to a select few become usable by the many.
The stuff of dreams
In a perfect world, expense would be no obstacle. You would stand up three data centres: two close enough together to count as a "metro cluster" and another far, far away so that a disaster that strikes the primary and secondary sites can't hurt the third.
If money is no object, you would have multiple redundant, independent, dedicated fibre-optic links between all three sites and your workloads would be configured for fault tolerance (or at least high availability).
Workloads would also be regularly snapshotted and backed up to protect against configuration errors and Oopsie McFumblefingers, ninja master of the delete key.
The number of organisations that can afford this is vanishingly small. You would think that where lives depend on IT it would be the norm.
Sadly, even there you are unlikely to see anything quite so grand. Large banks often have these sorts of setups. Some public cloud providers do. Not too many others.
Most companies can afford to run some of their workloads with the kind of redundancy listed above. They accomplish this either by running those workloads in the public cloud, or by owning disaster recovery sites capable of handling some workloads with the above level of redundancy, but not all.
This means we must make some hard decisions about the value of those workloads and the data they contain.
Rate your data
Stepping way back from the realm of large enterprises, banks and so forth, let's look at the sorts of decisions one of my SMB clients has had to take. In principle, they aren't that different from those an enterprise has to make; there are just fewer of them so they are easier to fit into a single article.
To make life easy I divide workloads into four categories. The first is those where not even a moment's data could possibly be lost. The second is workloads where it is perfectly acceptable to lose a day's worth of data.
The third is workloads where refreshing the image every year or so is okay; and the last category is where if the primary site were to burn down there is really no point in having that data at all.
These are all recovery point objectives (RPOs). They are concerned with "how much data can we lose?” The other major element to consider is the recovery time objective (RTO): how quickly does a given workload have to be back online?
Let's work backwards. The SMB in question is a manufacturing plant. There are a series of workloads (and data attached to those workloads) that become absolutely useless if the plant no longer exists to make use of them. Half of workloads and more than 95 per cent of the data in use at this SMB doesn't need ever to go offsite.
There are several workloads – for example VDI virtual machines, the clockpunch virtual machine and so on – where nobody really cares if the virtual machines are backed up only once a year.
The only thing that ever changes on them is Windows Updates. The data they guard is critical, but backing up a "my documents" folder full of small files offsite or to the public cloud is child's play. It is also far less expensive than keeping a fully synchronised copy of the virtual machine itself.
In this case, pretty much any means of backing things up will do. Manually pull down a copy once a year, RAR it and toss it into some cloud storage? Works for me.
I use Data Deposit Box for these sorts of tasks but everyone will have their own favourite.
The user documents can be handled by something like OneDrive, and since most SMBs will be using Office 365 by this point, they probably already have access anyway.
There are a handful of virtual machines that need to be backed up every night. Here I would typically put virtual machines that contain websites. If you lose a day's worth of tinkering on the website, nobody cares except the poor fellow that has to redo the work.
Sometimes you are hosting files for clients and those might have to be uploaded again. None of that is a big deal. Real-time synchronisation is not a requirement so making sure we can recover these workloads is fairly easy.
If you have a secondary site of your own, this is where technologies like Hyper-V Replica, Veeam or VMware's Site Replication Manager become very useful. They have a configurable delay on how old the copies on the secondary will be compared with the primary site, but a good rule of thumb is "about 15 minutes".
If you don't own a secondary site then you need to consider your options more carefully. SMBs with VMware will probably look to VMware's vCloud Air.
Larger organisations with VMware should also give Microsoft's InMage Scout some serious consideration. InMage has a fantastic reputation and my sources tell me Microsoft has been an excellent steward of the technology.
InMage Scout is also worth considering if you are a Microsoft shop with a secondary site, but with complicated orchestration requirements that need backing up.
If you are a Microsoft shop that doesn't have its own secondary site then it is hard to top Microsoft's Azure Site Recovery for ease of use. If you are American, there is no question: Microsoft has pretty much wrecked the competition here.
If you are not American, ask about encryption options (in flight and at rest) or ask Microsoft if anyone has set up a service provider cloud in your jurisdiction that you could replicate to.
There are others out there, of course. Companies such as Unitrends and Barracuda sell backup appliances that can send your backups to the cloud. Veeam is embracing the multi-service provider public cloud model.
Almost every other name, big and small, will have some method of getting data off your site using a private network or the internet.
There are two types of workload in the first category for which the disaster recovery version needs to be up to the minute. The accounting package and the order-tracking package both occupy this space.
The same data can have different values in different scenarios. From a production standpoint, losing a day's worth of accounting information is not a problem: there are paper copies of everything, so if it all went splork the sales staff could go back to paper and pen for a day and they could assign some pour soul to enter all that info manually when the system came back.
The tax man, however, gets sniffy about even a day's worth of loss. This means that should the building burn down, we need a way to ensure that we still have up-to-the-minute data, and the paper copy thing is not really an option: that will burn down with the building.
Pitchforks and torches would be outside IT's door, even more than if the accounting package went down
The order tracking package is important for similar reasons. It might be possible to muddle through with paper and pen should an outage happen during production. Pitchforks and torches would be outside IT's door, however, even more so than if the accounting package went down.
If the building burns down, the data in the order tracking package is critical; it tells everyone what has been shipped, what has not, and thus which customers need to be notified that their orders won't be arriving.
If there are any guarantees or SLAs, some customers may have claims to make, and having the data to hand eliminates a whole lot of "he said, she said" from the proceedings.
The order tracking package is the more critical of the two. Customers will be asking questions the instant they hear there has been a fire, so making sure the website where they can check their orders is up helps lower tensions.
We are fairly lucky with this one: it is a modern LAMP application. MariaDB can be set up to replicate between two different sites at the database level, and rsync can keep the data files in lock step.
Toss one copy into a public cloud provider like Azure and Robert is your mother's brother. Azure can take snapshots, provide you with multiple sites, and even makes attaching a CDN easy so you can handle the onrush or worried clients.
Because it is a modern LAMP application, you can keep a copy on your site so if someone Code Spaces your Azure account, all is not lost.
The accounting package is another story entirely: it is a legacy Windows application. The cost of running it in the public cloud is prohibitive. A quick check with the cost estimator says I'd be looking at more than $5,500 a year.
Given the above-mentioned Code Spaces issue, I can at least double that cost (workloads running on public cloud services need backups too, no matter what the provider tells you).
That means I could actually buy an HA pair of servers for the accounting package and throw them away each year for the cost of the public cloud instance.
In this case, the solution is a combination of running the workload off a five-bay ioSafe and using Azure Site Recovery. If the building burns down, it will be a few days before we can get in there and pull the drives, but that is probably okay; we mostly need to be up to the minute so that we have all the data the tax man wants. He can wait a week.
The copy in Azure Site Recovery will be 15 minutes behind, but it would be more than good enough to start putting information together for the insurance company while we wait for things to cool down enough to fetch the ioSafe from the rubble.
These two workloads are great examples of workloads with real-time RPOs but divergent RTOs. Some backup vendors try to comingle the requirements, usually using the acronym RPTO.
It is worth bearing in mind that these two needn't be the same, and you might save yourself rather a lot of money by taking the time to map RPO separately from RTO for your workloads – especially legacy workloads that might be expensive to run 24/7 in the public cloud.
Of course, there is more than workloads to worry about. A business is not just a series of virtual machines. The bigger the business, the more orchestration and automation matter.
Unless it is purely a services or IP-based company, a small business that burns down is not coming back for months. Disaster recovery for SMBs like this is mostly focused on keeping a few critical services running with the goal of being able to retrieve relevant data for insurance and regulatory purposes.
What about larger companies?
If the HQ of a retail chain with 1,000 outlets burns down, it can't simply halt operations across all outlets. The HQ is one location among many; all workloads that were running at HQ have to be lit up elsewhere.
Here not only do you have to worry about the RTO and the RPO of the various workloads, but the concept of orchestration becomes paramount.
When your primary site burns to the ground and you flip over to your secondary site, which workloads need to start up in what order? If your domain controller comes up after your security appliances, for example, bad things might well occur.
In the SMB world, applications tend to be more isolated. A single application can be moved into the public cloud because it is largely a self-contained website. The accounting package and the access virtual machines could use a domain controller, but it is not the end of the world if the domain controller starts up late.
Backing up the virtual machines solutions are not going to cut it here. You need to be able to transfer your runbooks as well as your workloads.
Whether you are backing up to a second site you own or to Azure via Site Recovery, Microsoft seems to be taking the lead.
Assuming you are either using System Center or its InMage Scout appliance (if your virtualisation stack is VMware), you can do some really neat things with Microsoft's setup.
VMware is not quite there yet with vCloud Air, and Amazon is still playing catchup. Google… well, it is not even in the running. I expect VMware to be caught up by VMworld next year, and Amazon probably shortly thereafter. Who knows what Microsoft will accomplish in that time?
Regardless of who is on top at any given time, the various players are competing fiercely not only to get you using their public and hybrid cloud offerings, but also to make the whole ordeal of disaster recovery far less frustrating.
Bump up the bandwidth
More traditional methods exist, of course. You could try rotating tapes, hard drives and other such things. I still have a few sites where there is so much data to be moved that the only feasible way is to have a courier show up every day.
You need to be careful about these, of course, as anything that relies on human beings to remember to do something is inherently fallible. Factor in the risk of missing a day or two's worth of backups in a row when you calculate your costs.
Disaster recovery setups don't remove the requirement for various layers of internal backups. Backups exist not just to deal with the building burning down, but also with someone accidentally deleting something they shouldn't have. Those vast enterprise tape libraries will be slow to disappear.
When looking at backups – for disaster recovery or not – bandwidth is a serious consideration. You may want to use something like Azure Site Recovery and be willing to pay the toll but just can't get a big enough pip from your provider.
If you are serious about hybrid cloud backups, consider a dedicated line. At the SMB level this may mean a second ADSL connection. At the enterprise level it almost certainly means more fibre.
Look beyond the headline prices. Azure Site Recovery, for example, says it is $54 a month per virtual machine. But you need to factor in the cost of storage or networking.
The subtext reads: "If you purchase this offering through the Microsoft Enterprise Agreement via a plan SKU, you will receive 100GB of GRS Blob Storage, 1M storage transaction and 100GB of egress, per instance included.”
Similarly, remember your security and privacy training, and keep asking those pesky questions. The more instances you put into a single public cloud account, the more of your company is gated by a single administrative logon.
Never use the public cloud without a minimum of two-factor authentication. Try to break up your workloads into groups that can stand apart from each other and thus run in different accounts.
Don't run everything in one place; back up your data either from your site to a public cloud provider, or from one public cloud provider to another. Also remember that just because a public cloud provider has a data centre in your jurisdiction does not mean it won't be forced by a foreign government to give up data that resides in the data centre that is local to you.
If you are storing your data in a foreign country or with a service provider which has a significant legal attack surface in a foreign country, make sure you can encrypt everything.
Perhaps most importantly, test your disaster recovery plans. If they are so complicated that you fear testing them, throw them away and make better ones.
If you have been hesitant to try hybrid cloud disaster recovery solutions, now is emphatically the time to revisit that choice.
With Microsoft Azure Site Recovery and VMware's vCloud Air we now have two solid providers. We have reached that very important enterprise moment where there are two serious providers of the product or service and vendor lock-in is no longer a real concern.
DRaaS (disaster recovery as a service) is no longer just some cutesy acronym marketers use. It has earned its place as an everyday tool in our sysadmin toolkits. ®