This article is more than 1 year old
Google may have taken this whole 'serverless' thing too far: Outage caused by bandwidth-killing config blunder
Engineers struggled to restore own packet-starved systems during four-hour SNAFU
Google says its four-hour wobble across America and some other parts of the world on Sunday was caused by a bungled reconfiguration of its servers.
The multi-hour outage affected Google services including Gmail, YouTube, Drive, and Cloud virtual-machine hosting, and knocked out apps like Uber and Snapchat that rely on the web goliath's systems. During the kerfuffle, affected netizens noticed their connections to the G mothership were slow, unreliable, or in some cases down completely.
According to veep of engineering Benjamin Treynor Sloss on Monday, the problems began when Googlers tried to push new configuration settings to a handful of servers in one of Google's data center regions. Somehow, that update ended up also being applied to many other servers in several other regions, causing them to give up more than half their network bandwidth.
That in turn starved systems of packets, ultimately leading to services appearing to fall over or slow down for users.
"In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region," said Sloss. "The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity. The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not."
The Google 24x7 VP went on to explain how, despite clocking the traffic crush almost immediately, the engineers were unable to quickly fix up the broken configurations due to the network overload. In the meantime, they deprioritized less latency-sensitive traffic to let interactive and latency-sensitive packets through as a priority.
Sunday seems really quiet. Hmm, thinks Google, let's have a four-hour Gmail, YouTube, G Suite, Cloud outageREAD MORE
"The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam," he continued.
"Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage.
"The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelize restoration efforts."
As a result, what should have been a quick fix ended up taking hours to fully resolve.
During this time, Sloss estimated that Google Cloud storage systems suffered a 30 per cent drop in traffic – perhaps due to the low-latency deprioritization, we reckon – while YouTube views dropped 2.5 per cent for an hour, and some search queries were slowed. Around one per cent of Gmail users (about 15 million accounts, give or take) experienced connectivity difficulties with their webmail during the outage.
Now, Sloss said, his team is performing a full postmortem study on the incident in hopes of crafting new procedures and policies to prevent a similar outage from occurring during future configuration rollouts.
"We will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event," he said. ®
While the Chocolate Factory's status page reports that, right now, everything is fine and dandy, if you scroll to the very bottom there's a warning box with the following message, which we spotted as we were preparing to publish this article:
Intermittent IO errors with Google Compute Engine Persistent Disk in us-east4-b and us-east4-c. Affected customers may observe IO errors on Persistent Disks attached to instances in us-east4-b and us-east4-c. Mitigation work is currently underway by our Engineering Team and the rate of errors is decreasing. We will provide another status update by Tuesday, 2019-06-04 12:45 US/Pacific with current details.