Google cloud outage caused by failure that saw admins run it manually ... and fail
Sysadmins, fat thumbs, and BGP black-holes conspire
A mistaken peering advertisement from a European network took Google Cloud's europe-west1 region offline last week for around 70 minutes.
The slip-up happened when an unnamed network owner connected a new peering link to Google, and in the process, it advertised reachability for far more traffic than it could handle.
As a result, as The Chocolate Factory explains in this post, most of the lost traffic carried destination addresses in eastern Europe and the Middle East.
“The peer's network signalled that it could route traffic to many more destinations than Google engineers had anticipated, and more than the link had capacity for. Google's network responded accordingly by routing a large volume of traffic to the link. At 11:55, the link saturated and began dropping the majority of its traffic”, the post states.
That kind of error, Google's report continues, would usually be caught by automated safety checks, but “the automation was not operational due to an unrelated failure, and the link was brought online manually, so the automation's safety checks did not occur”.
“To prevent recurrence of this issue, Google network engineers are changing procedure to disallow manual link activation”, the post notes.
Route announcement errors are a continuing headache on the Internet.
In June, Telekom Malaysia mis-advertised routes to Level 3, causing the US provider to sling most of its traffic onto a network that couldn't cope and dropped the packets.
However, there are times when a mis-routing makes engineers suspicious that it was deliberate, such as in 2010 when China Telecom hijacked US military and government traffic via BGP, and in 2013, when traffic started heading for Belarus and Iceland.
BGP's problem is that the protocol trusts route announcements, putting a premium on trustworthy network operators and robust processes. ®