Google has revealed that the wheels almost literally fell off some of its servers.
A late Friday post about the virtues of its site reliability engineering (SRE) teams told the story of a “recent incident” in which its uptime squad found “evidence of packet loss, isolated to a single rack of machines”.
On closer inspection the servers in said rack were found to be rife with CPU throttling and some border gateway protocol weirdness to boot.
After plenty of remote probing by the SRE team failed to diagnose the problem, a Googler was despatched to endure the indignities of meatspace and inspect the problem rack with their actual eyes.
And here’s what they found:
“The wheels (castors) supporting the rack had been crushed under the weight of the fully loaded rack,” wrote Google Cloud Solutions Architect Steve McGhee. “The rack then had physically tilted forward, disrupting the flow of liquid coolant and resulting in some CPUs heating up to the point of being throttled.”
The rack was duly propped back up and McGhee says Google has since performed “a systematic replacement of all racks with the same issue, while avoiding any customer impact” and also considered how to better transport and install its kit.”
The post is of course self-promotion for how seriously Google takes its quest for uptime. But it is nonetheless interesting for revealing that Google has two internal aphorisms. One states that "All incidents should be novel" and should never occur more than once. The other posits “At Google scale, million-to-one chances happen all the time.”
The Register suggests the first is applicable anywhere. And the second is, thankfully, hardly ever a problem for our readers. Until they move into a hyperscale cloud.
One more thing to note: the post includes a photo of the leaning rack, a rare image of a Google bit barn's innards even if it reveals very little. ®