What is it with cloud status pages not reflecting reality?
Is AWS down? It depends who you ask
Analysis Internet services in the US on Thursday were far more stable than those in Ukraine and Russia, but even so reports of problems surfaced.
DownDetector.com, which tracks service outages from individuals along with real-time data analysis, showed spikes reflecting connectivity issues for Amazon Web Services at about 1700 UTC for at least some netizens.
And yet as was the case on Tuesday this week, when a few people similarly complained on DownDetector that AWS was briefly unavailable, an Amazon spokesperson today insisted all is and was well with AWS.
"I can confirm there are no issues with AWS services," an Amazon spokesperson told The Register. The internet giant said there was no breakdown on Tuesday nor on Thursday, adding: "We have not had a single service event this week."
You can see as much from the AWS status page: no recent events are noted and every listed service shows a green check icon to indicate normal operations.
How then to reconcile reports of problems with the insistence there are no problems? Amazon believes DownDetector.com's data is unreliable. Luke Deryckx, CTO of Ookla, which owns DownDetector.com, reportedly said though Tuesday's outage was not widespread, there were "a small number of credible reports that some AWS users had issues with their platform."
What gives? Is this a point of semantics or magnitude? Is the internet so elaborate that small transient breakdowns, with out-of-sight underlying causes, deserve no explanation or acknowledgement?
At the time of this article was filed, the page signaled "many users reporting issues" for AWS while the AWS status page reports everything is okay. And there have been similar efforts of this sort, such as stop.lying.cloud.
"Outages are inconvenient, of course, but they're inevitable for any major service, something is always going to break eventually – these are complex and constantly changing systems under pressure from heavy usage," Perry told The Register.
"The status page situation is much more frustrating though. It's very common to see status pages fail to match reality, and I made statuspagestatuspage.com specifically because for big tech services like Slack, GitHub and AWS the delay in updating the status page has become such a running joke."
- UK regulators to scrutinise cloud resilience in response to financial services sector's reliance on the fluffy stuff
- Facebook rendered spineless by buggy audit code that missed catastrophic network config error
- IPv6 still 5-10 years away from mainstream use, but K8s networking and multi-cloud are now real
Perry said it's absurd that small sites like downdector.com can tell when AWS is having issues to some degree but AWS cannot.
"I strongly suspect this is because the published status is linked to contractual SLAs [Service Level Agreements] for enterprise clients, and financial penalties in those contracts for the service provider," said Perry. "These SLAs create disincentives to proactively update the status of course, but worse they really discourage many useful improvements for issue detection that would fix this completely.
"Publishing any indication of downtime has a major and direct financial impact, so automated anomaly detection and reporting is out, crowdsourced reporting is out, and everything has to run at the speed of manual confirmation by somebody high up enough in the company to sign off on the consequences."
Perry pointed to the GitHub status page as an example, noting that it used to show automated statistics about performance, failures per second, and other metrics, such as recent views.
"That was removed a few years ago, stripping it down to a simple manually controlled red-yellow-green status for their set of services, which reliably lags behind the reality," he said.
Behind the curtain
There's some support for Perry's contention that AWS isn't being as upfront as it could be in reporting what's going on. In a post to Hacker News in 2017, infrastructure engineer Nick Humrich described his experience working as a software engineer on the Elastic Beanstalk team at AWS in 2015.
"When I was on an AWS team, posting a 'non-green' status to the status page was actually a manager decision," he wrote. "That means that it is in no way real time, and it's possible the status might say 'everything is ok' when it's really not because a manager doesn't think it's a big enough deal.
"Also there is a status called green-i which is the green icon with a little 'i' information bubble. Almost every time something was seriously wrong, our status was green-i, because the 'level' of warning is ultimately a manager's decision. So they avoid yellow and red as much as possible. So the statuses aren't very accurate. That being said, if you do see a yellow or red status on AWS, you know things are REALLY bad."
Humrich in an email asserted the accuracy of his post to The Register.
"Does SLA avoidance explain it? Maybe, though if so, it's indirect," he said.
"The reality is it's explained by two aspects: first, the page is a human decision, not automated. Second, managers want to look good. AWS is a massive place and 'number of outages' is probably the best 'measure' of a team."
The deeper question, he said, is why the page isn't automated.
"Maybe because Amazon assumes on-call teams are responding quickly enough. Maybe it's because they learn that SLA is lower? Hard to know, or prove. I don't think SLA avoidance would really be too big of a factor, though, because large enterprise companies almost always get their money back if they ask for it. Doesn't even require much proof."
Corey Quinn, chief cloud economist of The Duckbill Group, a cloud service consultancy, told The Register that there are two sets of issues with the AWS status page.
"First, it's just a useless 'sea of green dots' that doesn't tell us anything useful; that's what I built stop.lying.cloud," he said, explaining that this service strips cruft from the AWS status page and escalates the severity to better reflect reality.
It's just a useless 'sea of green dots' that doesn't tell us anything useful
"Second, it's hard to comprehend the scale of hyperscale cloud providers. US-East-1 in Virginia is something like a hundred or so buildings spread across a number of different towns, but it's expressed as six availability zones. If an entire building falls off the internet, some customers see their service explode, others have no idea that anything's wrong whatsoever. The question isn't 'is AWS down?' so much as it is 'at this point in time, how down is it?'
"At scale, something is always broken; building durable and reliable systems that can survive those breakages is what the game is all about. The communication challenge is that when the service you use is down for your environment, the sea of green dots is infuriating. Conversely if they showed every outage they experienced on their status page, it'd be an equally useless but far more alarming sea of red dots instead."
Quinn illustrated the problem with excessive transparency by pointing to how Slack used to provide highly detailed outage data and then Reuters uses the data as the basis for a story questioning Slack's stability.
He also argued that SLA compliance isn't a plausible explanation for why people perceive problems that the AWS status page does not reflect. Enterprises, he said, have access to their own uptime metrics and generally just want things to work.
"SLA credits are effectively useless for companies," he said. "It's not enough money to move the needle on the lost business opportunity for most companies."
Quinn said AWS' approach to making its status data more meaningful to customers has been providing an account-specific "Personal Health Dashboard" that offers a far more granular view of current service status. But, he added, this isn't well-known because it requires a login and is customer-specific.
It's not enough money to move the needle on the lost business opportunity
The existence of performance monitoring services suggests cloud providers aren't always up front about what's going on. After all, there would be no need to involve a third-party service to verify service availability and SLA compliance if cloud companies reported everything completely accurately.
Monitoring service Ably, for example, contends, "Amazon AWS flat out lies" on its status page.
Malik Zakaria, managing director at managed IT service provider ExterNetworks, described the issue in more diplomatic terms.
In a phone interview with The Register, he said, "When we talk to our customers, SLA monitoring and compliance monitoring is required to ensure that the SLA agreement is being met."
There are times, he said, when service outages occur and are not being reported, and per agreements, we help resolve those. But 99.99 per cent of the time, he said, we're able to provide restoration of service within agreed upon times.
In general, Zakaria said, the system works very well, and the support staff at cloud providers are very good and work around the clock.
The Register spoke at length with an Amazon spokesperson about the aspersions being made against the AWS status page. Amazon disputed the notion that SLA concerns affect its status reporting, and questioned the accuracy of crowd-sourced reporting that may reflect service disruptions linked to ISPs or network-layer providers.
“Third parties speculating on AWS availability almost always get it wrong," an Amazon spokesperson told us.
"Just this week, Downdetector walked back its own false reporting by saying, 'we do not believe there was a widespread service issue on AWS’s platform.' The AWS Service Health Dashboard (SHD) is the only reliable source of AWS availability data, providing customers with timely and accurate information on AWS services and regions.
"It is not connected to our Service Level Agreements (SLAs) in any way. Our SHD provides more details and transparency on service availability than any other cloud provider." ®
Heroku, which is hosted by AWS, suffered a partial outage at about 1700 UTC on Thursday, around the same time complaints against AWS rose on DownDetector. The Rust programming language's Crates.io, which relies on Heroku and Amazon's cloud, was temporarily down at 1700 UTC also, citing problems with an infrastructure provider.