This article is more than 1 year old
We suck at backups. So let's not have a single point of failure any more
Avoid my nightmares and change your threat model
Sysdamin blog The more that I move away from front line systems administration and towards data centre architect the more I absolutely convinced that systems administration can cause PTSD.
With the exception of a handful of clients, I am no longer the individual responsible for making the computers go. Despite this, I still wake up at least twice a month in a cold sweat, heart racing, from an all-too-real nightmare about data loss. Recently, these nightmares have gotten worse.
I've lived through a lot of data loss events. I've recovered data from failed RAIDs and had RAIDs I couldn't save. I can explain to you in painful detail what happens when the CacheCade SSDs on an LSI RAID card hit their write limit simultaneously and I have pre-canned diatribes prepared for people using two disk USB drives that default to RAID 0.
I have had power out events corrupt file systems and watched happily as storage software designed to provide high availability via replication worked flawlessly when one node failed, only to watch in horror as the failed node decides that it is "master" when it comes back up and overwrites 1000 VMs with day-old corrupted data. I have seen a dozen ways customers can lose data in the public cloud.
None of it, not one of a career's worth of front-line horror stories, compares to my nightmare scenario. A scenario that, if it hasn't happened yet, I know will happen to someone soon.
Most backup designs are not good enough
My nightmare scenario can be summed up thusly: encryption malware gets hold of administrative privileges and manages to wipe out not only production storage, but the backups too. There are a few ways this can happen, but two scenarios keep jumping to the fore.
The first scenario is where the only backups a company has are held on production storage. Basically this revolves around the use of snapshots and clones as backups. Whether these snapshots/clones occur at the level of the storage device or through the virtualization software doesn't matter: if your only backups are on the same device as your production data, you're probably looking at a lot of risk.
Now, don't get me wrong here, snapshots and clones as backups are a good thing. They're quick, they take up next to no space and any vendor worth their salt can peer into the snapshots to do things like single-file-restore. The end result is more frequent backups with quicker restore times. This is good and useful. It just isn't enough.
Imagine, for example, that the encryption malware gets hold of the admin creds to your storage array or infects your hypervisor. In every storage solution I know of this would let the malware encrypt both production storage and backups at the same time. No matter how badass or superior at operational security the sysadmins think they are, we're all human and it only takes one mistake for this to become possible.
The answer to this, of course, is to have some form of backups running from a separate unit. It would periodically read all data from the production units and create separate backups on a separate device, preferably one where the administrative password to the primary storage/hypervisor doesn't work. One compromised account can't wipe out production and backup storage.
Or can it?
What if the malware somehow infects the backup device? In a lot of cases the backup software has administrative rights to storage units and/or hypervisors in order to extract the data. The backup software also usually has full rights to the storage where it is writing its backups. In many shops, if you can infect the backup server you can ruin everything.
Role based administration and WORMs
It doesn't have to be this way. Your backup software doesn't need to be have write privileges to your production storage or hypervisors. Nor does it need delete or overwrite permission on the destination media.
In theory, a good bit of backup software would read from the source and write to the destination and never need to overwrite, modify or delete anything. I am sure there are vendors who would disagree but, quite frankly, I don't care what they think. What matters is that I can sleep at night knowing that there is no single server or administrative account that, once compromised, can make all my data go away.
Role-based administration is a thing. Good software uses it. Knowledgeable administrators use it. Write Once Read Many (WORM) setups can be created by combining role-based administration with filesystem permissions, or even investing in WORM media.
Data protection is a multi-dimensional environment. We need to worry about hardware failure, the loss of an entire site, Oopsy McFumblefingers deleting the wrong thing, gross administrative errors deleting everything and, increasingly, about malware that can and will encrypt everything it can.
We need to worry about time to perform, recover and transmit backups. We need to worry about keeping multiple versions of backups and we need to worry about encryption of data, be it directly or because the encryption went unnoticed for so long that it is in all the versioned backups as well.
In short, we need to design all aspects of our networks – including our backups – with the idea in mind that our networks will eventually be compromised. Eggshell security is simply not viable. It's all necessary for a good night's sleep. ®