Google has posted more details about its 50 minute outage yesterday, though promising a "full incident report" to follow. It was authentication that broke, reminiscent of Microsoft's September cloud outage caused by an Azure Active Directory failure.
In an update to its Cloud Status dashboard, Google said that: "The root cause was an issue in our automated quota management system which reduced capacity for Google's central identity management system, causing it to return errors globally. As a result, we couldn't verify that user requests were authenticated and served errors to our users."
Not mentioned is the fact that the same dashboard showed all green during at least the first part of the outage. Perhaps it did not attempt to authenticate against the services, which were otherwise running OK. As so often, Twitter proved more reliable for status information.
Services affected included Cloud Console, Cloud Storage, BigQuery, Google Kubernetes Engine, Gmail, Calendar, Meet, Docs and Drive.
"Many of our internal users and tools experienced similar errors, which added delays to our outage external communication," the search and advertising giant confessed.
No better than you auth to be
Authentication is a tricky problem for the big cloud platforms. It is critically important and cannot be fudged; security trumps resilience. Rival Microsoft has had persistent issues keeping Azure Active Directory up and running, most severely on 28 September when a bad update caused a three-hour partial but global outage affecting Office 365 and Azure.
Other lesser failures continue: just yesterday, "a subset of customers using Azure Active Directory may have experienced high latency and/or sign in failures while authenticating," said the latest update.
Google has a better track record in this respect than Microsoft and its outage was shorter, but the latest incident shows that it is not immune.
The company has also said that its automated quota management system was the cause of the error. Should such a critical service be subject to quotas? "Yes, it absolutely makes sense. You never know when a typo in a config file or bug in your job-management code will go bonkers and try to take over all resources," opined a commenter on Hacker News who said they used to work in Site Reliability Engineering at Google.
Cloud services in general may be more reliable, on average, than on-premises services, but the impact when they fail is huge. It is in all of our interests if efforts to further improve their resilience succeed. ®