Google gave some of its cloud customers a rotten weekend by breaking a bunch of virtual machines.
Detailed in this incident report, the company first noticed problems at nearly beer o’clock on Friday afternoon, June 15th, Pacific Time – just after midnight on Saturday for European users and early Saturday morning in Asia.
The problem was described as “Google Compute Engine VM instances allocated with duplicate internal IP addresses”.
By 17:11 the company said “We believe that customers can work around the issue by launching then stopping f1 micro instances until no more duplicate IP addresses are obtained. We are awaiting confirmation that the provided workaround works for customers.”
By 20:03 the company had a better handle on the mess, telling users that “Instances that were stopped at any time between 2018-06-14 08:42 and 2018-06-15 13:40 US/Pacific may fail to start with networking. A newly allocated VM instance has the same IP address as a VM instance which was stopped within the mentioned time period.”
That advisory suggested the matter was serious because it said the next update wouldn’t land until 03:30 on Saturday the 16th.
By now Google also had a mitigation: “instances should be recreated, that is a delete (without deleting the boot disk), and a create.”
Google delivered its promised update promptly, at 03:33 with news that it was working on the problem at “Google Cloud Engine VMs that have an internal IP that is not assigned to another VM within the same project, region and network should no longer see this issue occurring, however instances where another VM is using their internal IP may fail to start with networking.”
And then after lunch on the 16th, Google declared the problem mostly fixed.
“The issue with Google Compute Engine VM instances being allocated duplicate external IP addresses has been resolved for all affected projects as of Saturday,” the company stated.
But the pain isn’t over for users, because the mitigation advice for affected VMs remains the same: “to delete (without deleting the boot disk), and recreate the affected VM instances.”
We’ve asked Google to detail the incident as we’d like to know how many VMs were impacted and more details about the cause of the matter. And also how it managed to mess up IP address management, which has not a problem of note for many years!
If the company brings us more information, we’ll update this story. ®