Content delivery network CloudFlare has apologized in part for the massive outages its customers experienced yesterday, but placed the blame squarely on the shoulders of Tier 1 provider Telia.
In a blog post, the company's Network Engineering Manager Jérôme Fleury put up a post-mortem of the incident – and of an incident a few days earlier – which showed massive packet losses in the US and Europe after a Telia engineer seeemingly misconfigured a router.
CloudFlare was particularly badly affected since it runs roughly half its traffic through the Swedish company. "Telia used to be our most reliable Tier 1 provider," Fleury said in response to a reader comment. "We still want to trust them. It happens to many companies, including us, to have bad runs."
That was a more measured response than the one from CloudFlare CEO Matthew Prince yesterday who announced that the company would be "deprioritizing them until we are confident they've fixed their systemic issues."
In packet loss graphs posted to the blog, there was a 12-minute collapse in traffic just a few days earlier on June 17. CloudFlare said that while it was investigating the problem, it suddenly cleared up.
But then on June 20, at 12:10 UTC, Telia fell over again – this time with much more severe consequences. The packet-loss graph for Telia's AS1299 is not pretty.
"Typically, transit providers are very reliable and transport all of our packets from one point of the globe to the other without loss," Fleury writes, "that's what we pay them for." He then adds in an animated map of traffic that shows quite how bad – and fast – the impact was.
"Because transit providers are usually reliable, they tend to fix their problems rather quickly," he pleads. "In this case, that did not happen and we had to take our ports down with Telia at 12:30 UTC."
The upshot was a massive spike in 522 errors for an hour and a lot of angry CloudFlare customers.
It is at this point that the company apologizes. Not for the outage but for tardy communications.
"Our customers understandably expect prompt, accurate information and want the impact to stop as soon as possible," the post reads.
"In today's incident, we identified weaknesses in our communication: the scope of the incident was incorrectly identified in Europe only, and our response time was not adequate." Which is a nice way of saying there were lots of people calling and asking what the hell was going on.
"We want to reassure you that we are taking all the steps to improve our communication, including implementation of automated detection and mitigation systems that can react much more quickly than any human operator."
So what is CloudFlare doing to make sure people aren't stranded again? Well, aside from deprioritizing Telia until it sorts out what problems it has, the company is planning to pull an automated packet-loss warning system it has for its remote network locations into its main network.
"We have been working on building a mechanism (which augments BGP) to proactively detect packet loss and move traffic away from providers experiencing packet loss."
As it acknowledges, that is a risky move given how easily things can be triggered and their immediate impact.
It will also "invest in greater resiliency" through more interconnection and increased failover capacity. In other words, big buffers and more contracts with Tier 1 providers.
As for Telia, it is not exactly used to receiving a lot of attention. It has promised network operators a report in due time but has so far said nothing about its failure except for a single tweet. ®
Sorry for recent outages! Extra checks and balances in place. Now, full attention on working directly with customers to sort out.— Telia Carrier (@TeliaCarrier) June 21, 2016