Cloud computing was meant to solve the reliability problem, but in practice, it still has a long way to go. Is that an endemic problem with the complexity of cloud computing, or a problem with the way people use it?
Cloud infrastructures are meant to be resilient, because they tend to use lots of cheap servers and scale out. The idea, following the "pets vs cattle" theory, is that if one server dies, you just get rid of it, which increases redundancy.
On the surface, then, it seems like cloud infrastructure should be bulletproof. But of course, clouds are about more than just servers. They’re about network traffic that misbehaves, software applications that doesn’t do what they are meant to, and administrators who don’t either.
Back in 2011, Amazon’s Elastic Block Store was borked after someone routed traffic to a backup network that didn’t have enough capacity. That caused a ‘remirroring storm’ as EBS nodes lost contact with their replicas and tried to create new ones, spawning yet more network traffic which in turn knocked out other nodes.
“We sure learned our lesson,” said Amazon, or something to that effect. So the world was shocked – shocked, I say – in 2012, when it borked again, this time due to a software bug that stopped a DNS propogation, and last September, when the DynamoDB service that manages metadata for its NoSQL data services fell over and knocked out a bunch of other services, too.
In December, Office 365 went tooltips-up, following some misrouted requests to Azure Active Directory, which in turn maxed out system resources in the European region, leading to a four-hour outage.
We could go on. Citing cloud outages (cloutages?) is like shooting fish in a barrel. But all of the ones cited here have one thing in common: cascading effects. One small problem snowballed, creating a series of larger ones with a wide-reaching effect on services.
Academics have been mulling the cascading effects problem in cloud infrastructure for a while. In 2012, Bryan Ford (then an assistant professor at Yale, now at École polytechnique fédérale de Lausanne in Switzerland) published a paper on the topic at the 4th USENIX Workshop on Hot Topics in cloud computing.
He described cloud security risks as the tip of a far bigger iceberg, in which a collection of complex dependencies between hardware resource pools, load balancing and other reactive mechanisms could cause feedback loops that cause cloud infrastructures to spin out of control and fail.
Preparing the bedroom for Mr. Cock Up
The problem may get worse as different cloud providers’ structures rely on each other. Ford described infrastructures that seem independent, but that share resource dependencies. These could range from network peering down to sharing, at the most aggregated point of some labyrinthine physical network, the same fibre optic cables. These kinds of dependencies might create correlated failures that bring portions of the cloud crashing down, along with the business services that run on them.
The more complex a network becomes, the harder it is to predict how such conditions will emerge inside of it and knock it off track, according to Mark Peacock, principal and IT transformation practice leader at Hackett Group that handles strategic consulting and best practices implementation for global firms.
“There is a mathematical complexity. It’s difficult to get to the root cause, because you’re stacking a lot of independent actors on top of each other. It becomes less empirical, less predictable,” says Peacock.
In short, predicting failures in large, complex systems with lots of non-obvious interdependencies isn’t an exact science. That’s a problem for administrators who like to think about things in deterministic ways.
In an ideal world, you test for problem A, but the conditions to cause can be proven not to exist. There are no dragons. In a non-deterministic world, you can only give the dragon a probability score. It may pounce from anywhere to fry your infrastructure, and the best you can hope for is an informed guess, apparently, about whether it’s in the shadows.
That’s a scary world to live in. What can you do about it?
Netflix is one of the companies often cited as a survivor of cloud outages. When an AWS region falls over, Netflix mostly keeps on going – though there have been exceptions. It has a couple of ways of doing this. The first is that it replicates is data and processing not just across multiple availability zones, but across multiple regions - the East and West coast. It uses traffic shaping and DNS to route user requests where they need to go, meaning that if a region dies, it can easily fail over.
The second way it does this is to use a variety of tools to deliberately induce failure. Its Chaos Monkey tool will knock out a server. Another algorithm, Chaos Kong, knocks whole AWS regions out of service, for Netflix’s purposes, just to prove that the service can still perform. The company actually lets this stuff loose on production infrastructure, which is pretty ballsy.
Do all this right, and the cloud is still a solid bet for resilience, according to former Netflix cloud architect Adrian Cockcroft, who now works at Battery Ventures. “There's a lot more ability to distribute systems in the cloud because you can fire up machines all over the world,” Cockcroft says.
The problem is that most people don’t distribute systems that widely according to Kamesh Pemmaraju, vice president of product marketing at OpenStack consulting and support vendor Mirantis. “In the AWS failure, many companies didn’t even use the other regions that Amazon provided,” he told us.
Insuring against the butterfly effect
That comes at a cost. Cockcroft suggests it adds about another quarter to the cost of your technical hosting – in the Netflix case, putting a few thousand Cassandra nodes across each region - but it’s a good insurance policy, he explained.
To protect your software against the unpredictability of the cloud, don’t trust what’s happening at the infrastructure layer. Pemmaraju advised monitoring the situation and to be ready to act automatically if it goes out. This requires some kind of automated monitoring that should operate at the application layer. In short: your software should know if the computing rug’s being pulled from under it.
“It requires intelligence – understanding of what happens in the infrastructure. If a network goes down I need to know ASAP that it’s down. That information is relayed to the application in a timely manner so that the application can take appropriate remedial action,” Pemmaraju advised. “Netflix already had built in redundancy at the application layer because that's best practice.”
You may not be able to control the hidden, weird potential cascading feedback loops in the cloud, but you can at least control how your applications react to it. ®