A new component of AWS Systems Manager aims to assist with handling incidents.
AWS Systems Manager (formerly called SSM – Simple Systems Manager) was introduced in 2017, though it really goes back to EC2 Systems Manager, launched late 2016.
"It started as a way to manage your EC2 instances," AWS evangelist Julien Simon told The Register, "for example, checking the patching status of your EC2 fleet. If you have a couple, you can do it manually, but if you have 100 instances, Linux and Windows, it's pretty difficult."
The current Systems Manager has capabilities including organising AWS resources, scheduling maintenance tasks, gathering automated inventory data, and viewing summaries of metrics and alarms. Admins can also run commands and scripts on multiple instances via SSM documents.
SSM depends on agents installed on EC2 instances, the code for which is open source on GitHub.
Last week AWS added a new feature to Systems Manager, called Incident Manager. Why Incident Manager? "The main reason is customers are looking for a fully integrated solution on AWS," said Simon.
The idea is that incidents are triggered by alarms in CloudWatch, the AWS monitoring service, or by an event sent through the EventBridge event bus. These trigger response plans in Incident Manager, where a response plan is a combination of contacts, an escalation plan, and a runbook.
An example given by AWS creates an event when CloudWatch detects high CPU usage on an EC2 VM. The runbook in the AWS solution contacts responders and then offers instructions on attaching the instance to an auto-scaling group.
This would be a good solution if an ecommerce site is overwhelmed with customers trying to place orders, but not so good if an errant process is consuming the CPU for no good reason; one hopes that the responders would pick this up rather than blindly scaling up and giving more money to AWS. A typical runbook has steps for triage, diagnosis, mitigation and recovery, including a post-incident report.
Who's it for?
The Incident Manager is not intended for end users, and these are technical incidents rather than support cases. That said, there is an API so it would be possible to have support cases in some other system trigger incidents if that was appropriate. There is also integration with OpsCenter, another part of Systems Manager, which is a manager for operational work items (OpsItems), essentially a list of admin tasks. Incident Manager generates OpsItems for the tasks it identifies. These OpsItems can by synchronised to third-party ticketing systems like Jira and ServiceNow.
Can Incident Manager be fully automated so that problems are fixed automatically? Sadly, this is unlikely, even though a runbook is capable of executing commands. "For sophisticated problems, only a human expert will figure it out," said Simon.
That said, an automated process could "collect the logs, make sure the maintenance page is up on the website, the early steps," he added.
The question perhaps is how much value Incident Manager adds over simply having CloudWatch alarms notify engineers. CloudWatch is already capable of triggering actions such as rebooting an EC2 instance, or configuring auto-scaling.
"We always build a basic first version and then we listen to customers," said Simon, who when asked what he would like to see in future told us he hoped for more sophisticated rules with conditions – "a little bit more intelligence, a little more flexibility."
"My advice for customers would be, we're trying to build best practices for incident resolution, and this is designed by the large-scale incident management team at Amazon. The people who actually deal with those big problems when something breaks inside Amazon and AWS," Simon added.
He also recommended chaos engineering as a means of testing incident response, for example by using the AWS Fault Injector Simulator introduced a couple of months ago.
It is apparent that Incident Manager will only perform well when carefully configured, and that the challenge is fine-tuning CloudWatch alarms to hit the sweet spot of alerting engineers when necessary, but only when necessary. The second challenge is figuring out exactly what has gone wrong, since in many cases fixing issues is relatively easy, once the root cause has been established. ®