Deploying disaster-proof apps may be easier than you think
When the next cloud datacenter fails, will your software go with it?
Interview In the wake of Google and Oracle's UK datacenter meltdowns earlier this week, many users undoubtably discovered that deploying their apps in the cloud doesn't automatically make them immune to failure.
But according to an upcoming report from the Uptime Institute, building applications that can survive these kinds of cloud outages – heat induced or otherwise – doesn't have to be difficult or even all that expensive.
Whether an application remains operational or not in the event of a cloud outage depends entirely on how it's been deployed, Uptime Institute analyst Owen Rogers told The Register in an exclusive interview.
This shouldn't come as a surprise to most, yet Rogers says half of enterprises surveyed were under the impression application resiliency was the cloud provider's responsibility.
"There's still a lack of clarity about who takes ownership of the resiliency issue when it comes to cloud," he explained, adding that while the public cloud providers offer many tools for building resilient applications, the onus is on the user to implement them.
An analysis of these tools showed that achieving high degrees of resiliency was a relatively straightforward prospect – especially when the cost of lost business and cloud SLAs were taken into consideration.
For the report, Rogers investigated seven scenarios for deploying stateless applications in virtual machines in the cloud, each with varying degrees of fault tolerance.
On its own, a VM running in the cloud provides zero protection in the event of a service, zone, or regional outage.
For those that aren't familiar, cloud regions provide services to large geographic areas and are typically made up of two or more independent datacenters, referred to as availability zones.
By deploying a load-balancer and second VM in an active-active configuration, where traffic is routed across each of the instances, Rogers claimed customers could achieve basic protections against application faults at a cost just 43 percent higher than a single VM.
And for those that can tolerate a 15-minute downtime, an active-failover approach – where the second VM is spun up upon the first's failure – cost just 14 percent more than the baseline.
However, this approach only provides protection in the event of an application fault, and won't do the user any good if the entire zone experiences an outage.
The good news, according to Rogers, is it doesn't cost any more to employ either approach across multiple availability zones within a cloud region, but offers substantially better resiliency – in the neighborhood of 99.99 percent.
"The cloud providers have made it really easy to build across availability zones," he said. "You would have to almost find a reason not to use availability zones considering they're so easily configured, and they provide a good level of resiliency."
In this style of deployment, the application would survive anything short of a complete regional failure. Overcoming that rare occurrence is a bit trickier, Rogers explained.
For one, traffic between cloud regions is often subject to ingress and egress charges. In addition, load balancers alone aren't enough to route the traffic, and an outside service – in the case of Uptime's testing, a DNS server – was required.
"The load balancer can easily distribute traffic between two active virtual machines in the same region," Rogers explained. "When you want to balance it across different regions, that's when you have to use the DNS server."
The investigation explored several applications for multi-region deployments. The first involved mirroring a zoned-based, active-active deployment across multiple regions using DNS to distribute traffic between the two. The approach offered the greatest resiliency – six nines of availability – but it did so at the highest cost: roughly 111 percent greater than a standalone VM.
- Google, Oracle clouds still affected by UK heatwave
- Concerned about cloud costs? Have you tried using newer virtual machines?
- Microsoft previews next Azure Stack HCI release
- Edge compute, AI on track for meteoric growth – or so these predictions say
Rogers also looked at what he called a "warm standby" approach, which used a zone-based, active-active configuration in the primary region and a standalone VM in the failover region. The deployment offered similar availability and resilience as the mirrored regional deployment, at a cost 81 percent higher than the baseline.
Finally, for those that want to hedge their bets against a regional failure, but are willing to contend with some downtime, a regional active-failover approach could be employed. In this scenario, if the primary region failed, the DNS server would reroute traffic to the failover region and trigger the backup VM to spin up. The deployment was also the least expensive multi-region approach explored in the report.
However, the report cautions that if the application was under pressure at the time of the outage, a single VM in the failover region may not be sufficient to cope with the traffic.
"The load balancer always provides a day-to-day level of resiliency, but the DNS level of resiliency is far more of an emergency type," Rogers said.
Because of this, he argues multi-zone resiliency is likely the sweet spot for most users, while a multi-region approach should be carefully considered to determine whether the benefits outweigh the added complexity.
SLAs aren't an insurance policy
What customers should not do is expect SLAs to make up for downtime resulting from a lack of application resiliency.
"The compensation you get is not necessarily proportional to the amount you spent," Rogers explained. "You'd obviously assume the more you spend on the cloud, the greater your compensation will be if something goes wrong – but that's not always the case."
Different cloud services have different SLAs, he said, adding that customers may be surprised to find that in the event of a failure they're only compensated for the service that actually went down, not the application as a whole.
"SLA compensation is poor and is highly unlikely to cover the business impacts that result from downtime," Rogers wrote in the report. "When a failure occurs, the user is responsible for measuring downtime and requesting compensation – it is not provided automatically."
That's not to say SLAs are completely worthless. But customers should think of them as an incentive for cloud providers to maintain reliable services, not as an insurance policy.
It's almost like they're [SLAs] used as a mechanism to punish the cloud provider rather than to refund you for the damage.
More to be done
Rogers's analysis of cloud resilience is far from over. The report is the first in a series that aims to address the challenges associated with highly available cloud workloads from multiple angles.
"We still haven't scratched the surface of how all these other cloud services are actually going to have reliable and assured levels of resilience," he said.
For example, the workloads in this report were stateless – meaning that data didn't need to be synced between the zones or regions, which would have changed the pricing structure once ingress and egress charges were taken into account.
"If there is data transferring from one region to another – for example, because a database is replicating to a different region – that can have a massive cost implication," Rogers explained.
However, as it pertains to VMs, he said, "once you've done the bare minimum, the cost incremental to add more resiliency is fairly slim." ®