Google's cloud thought it was out of memory for a couple of hours last week.
The outage wasn't a biggie – it hit last Wednesday, June 7th, and meant that “7.7% of active applications on the Google App Engine service experienced severely elevated latency; requests that typically take under 500ms to serve were taking many minutes.” Google says users would have seen timeout errors or waited a while for apps to do their thing.
The cause? Google says “The incident was triggered by an increase in memory usage across all App Engine appservers in a datacenter in us-central.” Those servers are “... responsible for creating instances to service requests for App Engine applications” and if their memory usage increases to unsustainable levels they shut down some instances “so that they can be rescheduled on other appservers in order to balance out the memory requirements across the datacenter.”
Fair enough: a sick server offloading workloads sounds reasonable. But Google says that when instances are moved around it puts some extra load on servers' CPUS, which isn't usually a problem that users notice. But “in isolated cases” when there's lots of other traffic hitting the same data centre, things can get tricky.
And that's what happened during the June 7th incident: Google says “increased load and memory requirement from scheduling new instances combined with rescheduling instances from appservers with high memory usage resulted in most appservers being considered 'busy' by the master scheduler.”
Requests to reschedule instances “needed to wait for an available instance to either be transferred or created before they were able to be serviced, which results in the increased latency seen at the app level.”
Interestingly, Google's remediation actions for the problem mention “ re-evaluating the resource distribution in the us-central datacenters where App Engine instances are hosted.” Those are The Register's italics, because the passage suggests Google's bit barns have different resource levels and individual quirks rather than relying on homogenous building blocks. ®