Amazon's EC2 contract promises its infrastructure cloud will provide 99.95 per cent "uptime" over the course of a year. But that doesn't mean the company will dish out credits in the wake of the outage that affected some users for as many as four days, if not more.
Though the EC2 service level agreement says users will be eligible to receive credits if the service doesn't meet a 99.95 per cent "annual uptime percentage" within a particular geographical region, this only applies to users who have spread their applications across multiple "availability zones" – subsections of Amazon's regional services designed not to fail at the same time.
The outage did hit multiple zones in EC2's East Region – served up from at least one facility in Northern Virginia – but it appears that multiple zones were affected for only about three hours.
Amazon has yet provide details about the outage, and many third-party commentators have failed to realize that the service level agreement is more complex that it seems. The availability zone setup continues to cause confusion, in part because people don't actually read SLAs, but also because Amazon has yet to describe how the zones are designed and how they operate.
At 1:41am Pacific time on Thursday, Amazon said with a post to its status page that it was investigating connectivity issues with its Elastic Compute Cloud (EC2) service, which provides on-demand access to processing power across the net. According to one status message, the problem began with a "network event" that caused the service to re-mirror a large number of Elastic Block Storage volumes in the East Region. Elastic Block Storage provides storage that's independent of particular server instances on EC2.
Amazon divides EC2 into multiple geographic regions, and some regions – including the East Region – are divided into multiple "availability zones". Amazon has always said that these zones are protected from each other's outages. "Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones," the company's website reads. But the East Region outage spread across multiple zones.
Some felt that Amazon had broken its promise over availability zones. But the particulars of the service-level agreement add a new twist to this discussion. "'Annual Uptime Percentage' is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of 'Region Unavailable'," the agreement reads. "'Region Unavailable'...means that more than one Availability Zone in which you are running an instance, within the same Region, is 'Unavailable' to you."
According to Amazon's status messages, multiple availability zones experienced problems for about three hours on Thursday, then the problem was isolated in the zone where it began. John Engates, the chief technology officier at Rackspace, which operates a cloud service similar to Amazon's, believes Amazon is unlikely to provide many credits in the wake of the outage.
"More than one availability would have to go down for you to receive a credit, and you have to be down for a considerable about of time," Engates told us during a conversation at this week's OpenStack design summit in Santa Clara, California. "I really doubt they're pay a lot on credits."
Rackspace's Cloud Servers service does not provide a setup analogous to Amazon's availability zones. The Rackspace service-level agreement guarantees uptime for particular components within each service region, including its network, its data center infrastructure, and individual hosts. The company operates separate data centers in Texas, Chicago, and London.
Judging from Amazon's status messages, Engates says, he believes that Amazon's outage spread across multiple availability zones because the company was using availability zones to mirror Elastic Block Storage data for other zones. "Rather than replicating data within a zone, I think they were replicating between zones," he said. "And it seems that when they had a failure in one zone, traffic waterfalled into the other zones. It's like if there was a fire in a hotel. We would have to evacuate to the hotel across the street, and there may not be enough room in the hotel across the street for everyone to get a room."
It appears that the outage affected only those who were using Amazon's Elastic Block Storage service.
Engates says that Amazon's cloud service and its service-level agreement is set up in such as way that users must ensure redundancy across zones – if not across entire regions. "You have to think about how to allocate your application across multiple resources to maximize that SLA," he said. "Those that did so – NetFlix is one example of a big customer – did not experienced the same kind of outages as people who were very localized. You could put some of the blame on Amazon, but some of the blame on the customer."
Yes, multiple zones were hit by the outage. But Amazon does not promise 100 per cent availability. The company has said, however, that it is unable to restore EBS volumes for some customers. About 0.07 per cent of EBS volumes in the East Region, a status message indicates, "will not be fully recoverable". ®