Google promises proper patch preparation after new cloud outage
Network egress issues hit Google Compute Engine for the second time in three weeks
Google Compute Engine (GCE) users experienced a brownout over the weekend, after an incident that bears plenty of likeness to a worse outage that took down the service in February.
The February FAIL came about when “The internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated routing information.”
This new outage, which started at 9:55 AM Saturday March 7th, Pacific Standard Time, was caused by “packet loss on egress network traffic” and meant users experienced symptoms ranging from “... no visible impact, to unusually slow responses, to timeouts attempting to contact the VM.”
Things were back to normal 43 minutes later and Google says virtual machines stayed up, but the cause of the mess this time was a botched patch.
Google's offered the following explanation of the incident:
The root cause of the packet loss was a configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM. The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner.
This outage was less severe than the February incident and Google says its engineers are “investigating why the prior testing of the change did not accurately predict the performance of the isolation mechanism in production.” The company has a sterner response this time around, explained as follows.
Future changes will not be applied to production until the test suite has been improved to demonstrate parity with behavior observed in production during this incident. Additionally, Google engineers are immediately amending the rollout protocol for network configuration changes so that future production changes will be applied to a small fraction of VMs at a time, reducing the exposure in the event of undetected behavior.
This new outage was brief and minor. But Google's clearly been caught on the hop with its patching procedures.
Google and its ilk are looked to as the experts in computing at hyperscale. Outages like this suggest we've all got a lot to learn. ®