Fancy doing a little... experiment? Chaos comes to AWS as Fault Injection Simulator goes live

Complete with nervous warnings: how much chaos is too much?

AWS has rolled out its Fault Injection Simulator (FIS), designed to introduce deliberate faults into its cloud services so that users can test the resilience of their applications.

Chaos engineering is useful for discovering what actually happens in the event of a failure, observing the principle that administrators cannot know whether something like a failover system will work as expected until there is an actual outage.

It can also be used to test the impact of things like services that are slow to respond, or which encounter bad data, or which run out of memory. The idea is not only to avoid catastrophe but also to measure the outcome and its potential impact on a business.

The 1-2-3 of AWS Fault Injection Simulator – but the scope of the service is limited

The 1-2-3 of AWS Fault Injection Simulator – but the scope of the service is limited

The AWS Fault Simulation Service (FSS) was introduced at the company's virtual re:Invent conference late last year and is now available – though in its first iteration it has limited scope.

Services targeted are EC2 (virtual machine instances), ECS (contains), EKS (Kubernetes) and RDS (relational databases). Using FSS involves creating what AWS calls an “Experiment”, or rather, an experiment template which enables experiments to be run.

Experiments have actions – such as reboot-instance or inject-api-internal-error – and targets, such as a specific VM instance or EKS nodegroup. The greatest flexibility comes by specifying AWS Systems Manager command documents, called SSM documents, in YAML or JSON format, though these are fiddly to author and require an SSM agent on target resources.

The Reg tries it out

We set up a small experiment, targeting a MySQL database on AWS RDS (Relational Database Service), and encountered some frustrations: the range of actions was limited and the console for creating experiments did not indicate which action applied to which service. When we tried to inject a “wait” delay to simulate the database being slow to respond, we got an error “no actions are associated with this target.” We then changed the action to Reboot Db Instance and it worked. There was also the usual complexity around AWS Identity and Access Management roles and policies.

chaos, white noise, glitch

Chaos is good for you, says first 'state of chaos engineering' report


All this done, we were able to run an experiment successfully, but not before being given a couple of warnings. The experiment “might perform destructive actions on your AWS resources”, said the dialog, encouraging us to “review the best practices and planning guidelines.”

The aforementioned guidelines took us to this page, which was not all that illuminating, though it did suggest that “you first complete a planning phase and a test in a pre-production or test environment.”

There was also strong encouragement to link an experiment to CloudWatch monitoring (another AWS service) in order to set stop conditions, so that if something went wrong it could be terminated early. It would seem that AWS has concerns about newly enthused experimenters causing more chaos than anticipated.

Are you sure you know what you are doing? AWS warns of the risks.

Are you sure you know what you are doing? AWS warns of the risks

The cost of the service is $0.10 per action/minute, plus of course the cost of any extra resources provisioned for an experiment, if there are any. As with almost all AWS services, there is a FIS API and one possibility is to run FIS experiments as part of a DevOps pipeline.

It is worth noting that chaos engineering on AWS resources can be done by other means, whether it is via a DIY script in a cron job on an EC2 instance, or a service from a third-party specialist such as Gremlin. The initial AWS effort will not worry the specialists too much, though according to AWS technical evangelist Jeff Barr "We'll be adding support for additional services and additional actions throughout 2021," so expect improvements soon. ®

Biting the hand that feeds IT © 1998–2021