When planning for disaster recovery, our natural inclination is to focus on the technical design. We work to strike the perfect balance between controlling infrastructure spend and the required capacity.
Technical considerations are of course paramount – replication schedules based on delta changes and available bandwidth, the impact of synchronous versus asynchronous writes, calculations of recovery time and recovery point objectives – all to ensure that the required data and systems are available at the secondary site.
This is of course the primary purpose of the disaster recovery solution, and nobody would argue that the technical implementation isn’t paramount.
It’s easy to get caught up in these finer technical details, though, and overlook some fundamental pitfalls that could turn your recovery into a bigger disaster than any problem your system was supposed to cope with.
Let’s examine some of these scenarios and consider how you can mitigate against them.
This is a morbid scenario to open up with, but the possibility is very real. More often than not, you’ll be forced to switch to a disaster recovery site because of an incident that is long-term enough to warrant the disruption of switching sites, but the primary site is ultimately recoverable once the incident is over.
There is always the possibility that the disaster you’re recovering could be a real killer, though. Your primary site could be subject to flood, fire or natural disaster. Nobody wants to admit it, but the absolute worst-case scenario is that your primary site is reduced to a smoking hole in the ground.
What happens if your sysadmins were in the building at the time of the disaster? What happens if they were hauling ass to the data centre together to put out the fire, and were in a car wreck en route? What happens if they were on the plane that left your data centre as a big smoking wreckage?
Sure, you might say this is ridiculous, but when you’re planning for a disaster you should always expect the worst. Back in the real world, your team could be out sick or away on a vacation or conference, and ultimately non-contactable.
If your sysadmin team is large then having two or three of them incapacitated might not be the end of the world, but there’s always the chance you might find yourself with all of your sysadmins unavailable to implement the disaster recovery plan. So what happens then?
It helps to have nominated seconds within your organisation. You might have some of the technical skills required and not even realise it – perhaps a DBA or developer with the requisite abilities, or a senior manager for whom it used to be their bread and butter – and having your ‘deputies’ arranged in advance can save vital time and confusion in the event of an incident.
If you don’t have the skills in the team, then being able to bring a consultant or contractor in at short-notice could be a lifesaver in a crunch. A former member of the team could step in to save the day – if they left on good terms – or a sysadmin borrowed from a supplier or an agency might make all the difference.
Whether the knight in shining armour is internal or external though, they certainly won’t have access at the beginning of the incident, and given that your sysadmins would ordinarily take care of that, you need to have a fall-back plan there too. A “break glass” admin account, locked in a safe or kept secret by a member of senior management (to ensure it isn’t tampered with) could give your hero quick access into the system.
You’ll need to make sure someone outside of the sysadmin team is able to control the access lists to the disaster recovery site, too – nothing would be worse than having a ready-to-go failover system and being kept away from it, stuck on the wrong side of the security fence.
It goes without saying, but the best disaster recovery system in the world could be rendered somewhat pointless without proper documentation to support it, particularly if you’re relying on one of the aforementioned deputised sysadmins to save the day.
It may sound like a stereotype, but IT folk are notorious for not documenting their processes well. Whether it’s down to innocent absent-mindedness or a cynical desire to protect their own position through knowledge-siloing, there will be very few among us who could honestly say they literally couldn’t document any better.
When disaster strikes and you need to shift your entire service wholesale from one location to another the quality of your documentation is crucial, and could reduce hours (or even days) of downtime to a matter of minutes.
You need to document every step of switching from your primary to secondary site, in the most excruciating detail. It sounds obvious (and tedious) but this cannot be overstated.
You need to consider: what steps do you take to access the disaster recovery site? How can you check that all data and services have been replicated before you allow customers access? What method do you use to bring databases back online at the secondary site?
Don’t take it for granted that the person following the instructions will be you, or even that they will have the full technical skill-set that you do. You might be fully aware that you need to drop and recreate the security principles on the database (since the database has been restored on a new server) for example, but your makeshift DBAs might not know that in the midst of a full-blown emergency, and find themselves stumped.
When stressing the importance of this to colleagues and clients, I find myself re-telling the same cautionary tale. On one occasion, a customer who hosted their hardware and systems on their own site chose to initiate an impromptu disaster recovery test: without warning to my technical team or their own business, one of the directors walked into the server room and pulled all the power supply for every single production device.
Unfortunately, they were running a complex set of Oracle RAC clusters, and our Oracle DBA happened to be on holiday for a fortnight. We were up the proverbial river of excrement without any means of propulsion, and worse, we were worried that even bringing the finicky Oracle servers back online in situ could have resulted in corruption and data loss following the somewhat abrupt ‘disaster’ they were subjected to.
Luckily, our Oracle specialist had left us a folder tantalisingly named In case of emergency, that nobody had any reason to click on before that day. When we opened it, we found in-depth instructions (with screenshots and full commands laid out in order) of not just how to bring the Oracle cluster online at the secondary site, but how to fail it back to the primary site again when we were done.
I’m not ashamed to say I jumped in the air with joy on finding those instructions, and I hugged the (rather surprised) DBA to within an inch of his life when he returned from holiday.