Analysis With Amazon now recovered from a four-hour outage that brought a large portion of the internet to a grinding halt, analysts are looking back to see what lessons companies can learn from the ordeal.
The system breakdown – or as AWS put it, "increased error rates" – knocked out a single region of the AWS S3 storage service on Tuesday. That in turn brought down AWS's hosted services in the region, preventing EC2 instances from launching, Elastic Beanstalk from working, and so on. In the process, organizations from Docker and Slack to Nest, Adobe and Salesforce.com had some or all of their services knocked offline for the duration.
According to analytics firm Cyence, S&P 500 companies alone lost about $150m (£122m) from the downtime, while financial services companies in the US dropped an estimated $160m (£130m).
The epicenter of the outage was one region on the east coast of America: the US-East-1 facility in Virginia. Due to its lower cost and familiarity with application programmers, that one location is an immensely popular destination for companies that use AWS for their cloud storage and virtual machine instances.
As a result of developers centralizing their code there, when it fell over, it took out a chunk of the web. Startups and larger orgs find it cheaper and easier to use US-East-1 out of all the other regions AWS provides. It's Amazon's oldest location, and the one they are most familiar with.
Coders are, ideally, supposed to spread their software over multiple regions so any failures can be absorbed and recovered from. This is, to be blunt, too difficult to implement for some developers; it introduces extra complexity which means extra bugs, which makes engineers wary; and it pushes up costs.
For instance, for the first 50TB, S3 storage in US-East-1 costs $0.023 per GB per month compared to $0.026 for US-West-1 in California. Transferring information between apps distributed across multiple data centers also costs money: AWS charges $0.010 per GB to copy data from US-East-1 to US-East-2 in Ohio, and $0.020 to any other region.
Then there are latency issues, too. It obviously takes time for packets from US-East-1 to reach US-West-1. In the end, it's easier to plonk your web application and smartphone app's backend in one friendly region, and ride out any storms. It's rare for a whole region to evaporate.
"Being the oldest region, and the only public region in the US East coast until 2016, it hosts a number of their earliest and largest customers," said IDC research director Deepak Mohan. "It is also one of their largest regions. Due to this, impacts to the region typically affect a disproportionately high percentage of customers."
Cost was a big factor, says Rob Enderle, principal analyst at the Enderle Group. "The issue with public cloud providers – particularly aggressively priced ones like Amazon – is that your data goes to the cheapest place. It is one of the tradeoffs you make when you go to Amazon versus an IBM Softlayer," Enderle said.
"With an Amazon or Google you are going to have that risk of a regional outage that takes you out."
'Pouring one hundred gallons of water through a one gallon hose'
While those factors made the outage particularly difficult for customers who had come to rely on the US-East-1 region for their service, even those who had planned for such an occurrence and set up multiple regions were likely caught up in the outage. After US-East-1's cloud buckets froze and services vanished, some developers discovered their code running in other regions was unable to pick up the slack for various reasons.
"It is hard to say exactly what happened, but I would speculate that whatever occurred created enough of an issue that multiple sites attempted to fail over to other zones or regions simultaneously," Charles King, principal analyst with Pund-IT, told El Reg.
"It's like trying to pour one hundred gallons of water through a one gallon hose, and you end up with what looks like a massive breakdown."
The takeaway, say the industry analysts, is that companies should consider building redundancy into their cloud instances just as they would for on-premises systems. This could come in the form of setting up virtual machines in multiple regions or sticking with the hybrid approach of keeping both cloud and on-premises systems. And, just like testing backups, testing that fail overs actually work.
"I think we have grown accustomed to the idea that the cloud has become a panacea for a lot of companies," King said. "It is important for businesses to recognize that the cloud is their new legacy system, and if the worst does occur the situation can be worse for businesses using cloud than those choosing their own private data centers, because they have less visibility and control."
While the outage will probably do little to slow the move of companies into cloud services, it could give some a reason to pause, and that might not be a bad thing.
"What this emphasizes is the importance of a disaster recovery path, for any application that has real uptime requirements, be it a consumer-facing website or an internal enterprise application," said IDC's Mohan.
"The biggest takeaway here is the need for a sound disaster recovery architecture and a plan that meets the needs and constraints of the application. This may be through usage of multiple regions, multiple clouds, or other fallback configurations." ®