Google has admitted that some customers running Persistent Disks in its europe-west1-b region have been forced to recover data from snapshots after a combination of lightning and old storage kit was is to blame.
The outage hit last Friday and left some users unable to connect Persistent Disks – a disk that exists independently of a virtual machine – for several hours. Problems persisted across the weekend.
Google's now published its analysis of the fault and says that on August 13th, “four successive lightning strikes on the electrical systems of a European datacenter caused a brief loss of power to storage systems which host disk capacity for GCE instances in the europe-west1-b zone.” [We've since been told by Google folk that post isn't correct, and that the local grid, not Google's bit barn, was hit by lightning.]
“Although automatic auxiliary systems restored power quickly, and the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain,” Google 'fesses up.”
“In almost all cases the data was successfully committed to stable storage, although manual intervention was required in order to restore the systems to their normal serving state. However, in a very few cases, recent writes were unrecoverable, leading to permanent data loss on the Persistent Disk.”
About five per cent of disks in the data centre recorded “at least one I/O read or write failure” during the incident. Read failures persisted into Monday for about 0.05 per cent of users and Google now says just 0.000001% of disk space has proved impossible to recover.
Which isn't a bad result, even if plenty of customers were inconvenienced, especially as either snapshots or other backups would have allowed restoration.
“This outage is wholly Google's responsibility,” the document continues, but then goes on “... to highlight an important reminder for our customers: GCE instances and Persistent Disks within a zone exist in a single Google datacenter and are therefore unavoidably vulnerable to datacenter-scale disasters.”
In other words, should lightning strike twice, you should remember that a datacentre in the hand can't beat two in the bush.
Google's confessional also says the company “has an ongoing program of upgrading to storage hardware that is less susceptible to the power failure mode that triggered this incident. Most Persistent Disk storage is already running on this hardware.” The company adds that it's conducted a review of the incident and “Several opportunities have been identified to increase physical and procedural resilience.” ®