Google has revealed the cause of its very unwelcome Gmail outage and on The Register’s reading of the situation it boils down to forgetting to take an obsolete version of software out of production.
Google’s shorter explanation for the mess is: “Credential issuance and account metadata lookups for all Google user accounts failed. As a result, we could not verify that user requests were authenticated and served 5xx errors on virtually all authenticated traffic.”
The longer version, detailed in Google’s incident report, kicks off by revealing “The Google User ID Service maintains a unique identifier for every account and handles authentication credentials for OAuth tokens and cookies. It stores account data in a distributed database, which uses Paxos protocols to coordinate updates. For security reasons, this service will reject requests when it detects outdated data.”
The G-cloud uses “an evolving suite of automation tools to manage the quota of various resources allocated for services” and the User ID service was moved to a new quota system in October.
The changeover wasn’t perfect. Google admitted: “previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0 (zero). An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident.”
Google told BGP to forget its Euro-cloud – after first writing bad access control listsREAD MORE
While Google checks for quota messes, the tests in place at the time of the incident “did not cover the scenario of zero reported load for a single service”.
“As a result, the quota for the account database was reduced, which prevented the Paxos leader from writing. Shortly after, the majority of read operations became outdated which resulted in errors on authentication lookups.”
So down it went at 03:46 PT on the 14th of December. Google engineers were paged two minutes later, identified the root cause and a fix by 04:08, then disabled the quota system in one data centre at 4:22. Five minutes later they liked what they saw from that change, so made the same change across the G-cloud.
Error rates returned to cloud-as-usual level at 04:33, but Google Calendar kept sending errors “up to 05:21 due to a traffic spike following initial recovery.” And some Gmail users “experienced errors for up to an hour after recovery due to caching of errors from identity services.”
While there’s plenty of egg on Google’s face over this one, you’d probably rather be in the ad giant’s position of being able to sort this out in 45 minutes than Facebook’s position of just having flubbed a three-month deadline.
That deadline was set by the European Union with this September 10th adjustment to the ePrivacy Directive which expanded it to cover messaging services. Such services had until today, December 21st, to change some aspects of their operations.
“In order to comply with the law, we needed to adjust the way our services work, such as further segregating messaging data from other parts of our infrastructure,” Facebook said today, in a post titled “Changes to Facebook Messaging Services in Europe”.
Those changes have reduced functionality of messenger and Instagram.
“We prioritized core features, like text messaging and video calling, and have made sure the majority of our other features are available,” the post says. “However, some advanced features like polls that require the use of message content to work may be disrupted as we make changes to align with the new privacy rules. We’re working to bring back features that we can as quickly as possible.”
Which sounds an awful lot like Facebook’s code and infrastructure are complicated enough that the time since September 10th wasn’t enough to do the whole job. The Register hopes it’s sorted before Friday, because no developer should be forced to miss Christmas. Not even Facebook developers. ®