It's been a long time coming but AWS has at last enabled an interactive serial console for de-borking VMs
Handy in an emergency, but only for Nitro instances and requires work in advance
AWS has introduced the "interactive EC2 Serial Console", enabling troubleshooting of virtual machines when normal SSH access is not working, with one user gushing: "I have been waiting 10 years for this moment."
The purpose of serial console access is to enable troubleshooting when an SSH connection is impossible, for example, because of an out-of-memory condition. "It provides a one-click, text-based access to an instances' serial port as though a monitor and keyboard were attached to it," said the AWS post. Previously, admins could see serial console logs, using the command get-console-output, but not enter any commands.
Back in January 2011, a user reported on the AWS forum (login required) about a case where the console output was "Continue to wait; or Press S to skip mounting or M for manual recovery."
Unfortunately, "there is no way for me to hit 'S'," he said.
Reasons he gave for requiring the interactive console feature included when boot failed and the SSH daemon did not start, errors configuring the firewall or network which blocked all access, broken networking on the instance, or denial-of-service attacks. This person was building a base instance for a system image, which is the kind of case where fatal errors are more likely.
Admins confronted with an inaccessible EC2 (Elastic Compute Cloud) VM may have another option, which is to stop the instance, detach the storage, mount the storage on a working instance, and edit or recover the files from there. This is not always possible, though.
If the VM uses instance type storage, this cannot be detached. It also requires interruption of service. "I had a customer once that erased their SSH keys, and had a running database cluster on EC2 that they couldn't get access to anymore. That was... fun," said a user on Hacker News, looking forward to the new feature.
If a VM uses ephemeral storage that is not designed for persistent data, why troubleshoot a VM rather than simply deleting it and creating a new one? There may still be good reasons such as analysing the fault or recovering more quickly.
"I used to work on GCE [Google Compute Engine]," said another user, making the point that if a VM has a faulty image where an out-of-memory condition is killing the SSD daemon: "If you replace your instance with another one, you just get another OOM kill."
Several restrictions apply to the interactive serial console, the most severe being that instances must use the Nitro system, a combination of network, hypervisor, and security hardware which AWS has adopted for many but not all of its EC2 instance types. Second, AWS users do not have permission for the interactive serial console by default. Third, once connected to an interactive serial console, the user with which you log in must have a password. This is often not the case by default with AWS instances, where key pairs may be used instead. Setting this will not be possible when troubleshooting an otherwise inaccessible VM so must be done in advance.
The interactive serial console is also available for Windows instances, where it enabled access to the Special Administration Console (SAC), part of the Emergency Management Services (EMS) tools. These have to be enabled on the Windows instance in advance. If enabled, admins can get access to a range of troubleshooting commands and PowerShell.
Most AWS users will never need this feature. SSH access does not often fail, and the range of use cases is relatively narrow. But the warm welcome from those who do need it makes it surprising that it has taken so long to implement, and a shame that it is restricted to Nitro instance types. ®