Chaos is good for you, says first 'state of chaos engineering' report

Spend more on resilence, get more resilience: who whould've thunk it?


The 2021 State of Chaos Engineering report from Gremlin, based on a survey of 400 companies, has shown a correlation between high availability and frequent use of chaos engineering.

Chaos engineering is the practice of deliberately injecting faults into a service to test its resilience. AWS is introducing its own version of this, called Fault Injection Service, early this year, according to CTO Werner Vogels at last month's re:Invent.

The new survey from chaos engineering company Gremlin, with contributions from Dynatrace, Epsagon, Grafana Labs, LaunchDarkly and PagerDuty, all companies in the DevOps and observability space, attracted respondents from a range of company sizes from small businesses to enterprises.

Over 70 per cent of responses were from software and services companies or techies in banking and financial services. 58 per cent said they run more than half their workloads on public cloud, with AWS grabbing 38 per cent of that business, followed by Google and Microsoft at 12 per cent each. Asked "what is your database", the respondents identified MySQL or PostgreSQL at 22 per cent each, MongoDB at 16 per cent, DynamoDb at 14 per cent and Cassandra at 5 per cent. We are not sure what happened to the Oracle or SQL Server users, or how the survey handled those who use more than one database manager (perhaps the majority in most organisations).

Dr Werner Vogels expounding the benefits of observability at an ancient food processing factory.

AWS catches up to Azure and GCP with CloudShell, adds deliberate injection of chaos

READ MORE

According to the survey, 57.5 per cent of respondents achieve more than 99.5 per cent "average availability", though most also report between one and 10 "high severity incidents" per month, with mean time to resolution (MTTR) of over one hour in more than 75 per cent of cases. Bad things that happen include faulty code deployed, dependency issues, and configuration errors, in that order of frequency.

The heart of the report is the section which looks at what the top performers, with 99.99 per cent availability and a MTTR of under one hour, did differently from others. No surprise (given the source of the survey) that this group performed more chaos experiments than the others, with 23 per cent of them performing a chaos experiment weekly or daily, versus 10.8 per cent who did the same in the lowest performing group (less than 99 per cent availability).

We would be wary, though, of concluding much from this statistic. Those top performers also showed more use of autoscaling (65 per cent vs 43 per cent), more use of DNS failover or elastic IPs (49 per cent vs 24 per cent), more use of load balancers (77 per cent vs 71 per cent) and more use of multi-region resilience (38 per cent vs 19 per cent active-active and 46 per cent vs 30 per cent active-passive).

In an active-active configuration, deployments in multiple regions are live and can take over in the event of failure, whereas in active-passive a deployment in another region stands ready for use if needed.

Another resilience technique is the circuit-breaker pattern, where if a service becomes overloaded or fails, calls are automatically switched to an alternative service, rather than making repeated retries which may cause further problems. 32 per cent of the best performers use circuit breakers, compared to only 16 per cent of the worst performers.

Other popular techniques mentioned are database replication, retry logic (applications that do not give up on the first failure but retry), selective rollouts such as deploying a feature enhancement to a subset of customers, caching pages in case dynamic content fails, and performing regular health checks. In all these cases the story is the same: those who use these techniques have less downtime, on average, than those who do not.

There is a cost to all these resiliency measures, first in the cloud infrastructure to implement them, and second in the administration time and skill to configure them. Another way then of interpreting the report is that those who invest more in resiliency, get better resiliency. The data does show that greater use of chaos engineering correlates with higher resiliency, but we do not know how it rates as a factor in this, versus all the other measures taken by the top performers.

It is wrong though to think that the larger organisations like chaos more. The greatest use, according to the report, is in businesses with 5,001-10,000 employees, over 70 per cent of whom use the technique to some extent. Both larger and smaller organisations have lower usage.

Do organisations risk injecting chaos into production systems? A substantial minority (34 per cent) do just that. Otherwise, chaos tests are reserved for staging or development systems. Caution is understandable, since injecting faults into production is not without risk.

That said, there is limited validity to resilience tests on a development system, which is unlikely to be at the same scale as production or have the same pattern of usage. 11 per cent of respondents are put off chaos engineering by "fear that something might go wrong."

Resilience is hard and resilience is also expensive. Chaos engineering is becoming more popular, as the forthcoming AWS service demonstrates, but it is no substitute for all the other common techniques; it is not designed to replace them, but rather to show that they work.®


Biting the hand that feeds IT © 1998–2021