Oh no, you're thinking, yet another cookie pop-up. Well, sorry, it's the law. We measure how many people read us, and ensure you see relevant ads, by storing cookies on your device. If you're cool with that, hit “Accept all Cookies”. For more info and to customize your settings, hit “Customize Settings”.

Review and manage your consent

Here's an overview of our use of cookies, similar technologies and how to manage them. You can also change your choices at any time, by hitting the “Your Consent Options” link on the site's footer.

Manage Cookie Preferences
  • These cookies are strictly necessary so that you can navigate the site as normal and use all features. Without these cookies we cannot provide you with the service that you expect.

  • These cookies are used to make advertising messages more relevant to you. They perform functions like preventing the same ad from continuously reappearing, ensuring that ads are properly displayed for advertisers, and in some cases selecting advertisements that are based on your interests.

  • These cookies collect information in aggregate form to help us understand how our websites are being used. They allow us to count visits and traffic sources so that we can measure and improve the performance of our sites. If people say no to these cookies, we do not know how many people have visited and we cannot monitor performance.

See also our Cookie policy and Privacy policy.

This article is more than 1 year old

CloudFlare apologizes for Telia screwing you over

Unhappy about massive outage

Content delivery network CloudFlare has apologized in part for the massive outages its customers experienced yesterday, but placed the blame squarely on the shoulders of Tier 1 provider Telia.

In a blog post, the company's Network Engineering Manager Jérôme Fleury put up a post-mortem of the incident – and of an incident a few days earlier – which showed massive packet losses in the US and Europe after a Telia engineer seeemingly misconfigured a router.

CloudFlare was particularly badly affected since it runs roughly half its traffic through the Swedish company. "Telia used to be our most reliable Tier 1 provider," Fleury said in response to a reader comment. "We still want to trust them. It happens to many companies, including us, to have bad runs."

That was a more measured response than the one from CloudFlare CEO Matthew Prince yesterday who announced that the company would be "deprioritizing them until we are confident they've fixed their systemic issues."

In packet loss graphs posted to the blog, there was a 12-minute collapse in traffic just a few days earlier on June 17. CloudFlare said that while it was investigating the problem, it suddenly cleared up.

But then on June 20, at 12:10 UTC, Telia fell over again – this time with much more severe consequences. The packet-loss graph for Telia's AS1299 is not pretty.

"Typically, transit providers are very reliable and transport all of our packets from one point of the globe to the other without loss," Fleury writes, "that's what we pay them for." He then adds in an animated map of traffic that shows quite how bad – and fast – the impact was.

"Because transit providers are usually reliable, they tend to fix their problems rather quickly," he pleads. "In this case, that did not happen and we had to take our ports down with Telia at 12:30 UTC."

The upshot was a massive spike in 522 errors for an hour and a lot of angry CloudFlare customers.

Comms

It is at this point that the company apologizes. Not for the outage but for tardy communications.

"Our customers understandably expect prompt, accurate information and want the impact to stop as soon as possible," the post reads.

"In today's incident, we identified weaknesses in our communication: the scope of the incident was incorrectly identified in Europe only, and our response time was not adequate." Which is a nice way of saying there were lots of people calling and asking what the hell was going on.

"We want to reassure you that we are taking all the steps to improve our communication, including implementation of automated detection and mitigation systems that can react much more quickly than any human operator."

So what is CloudFlare doing to make sure people aren't stranded again? Well, aside from deprioritizing Telia until it sorts out what problems it has, the company is planning to pull an automated packet-loss warning system it has for its remote network locations into its main network.

"We have been working on building a mechanism (which augments BGP) to proactively detect packet loss and move traffic away from providers experiencing packet loss."

As it acknowledges, that is a risky move given how easily things can be triggered and their immediate impact.

It will also "invest in greater resiliency" through more interconnection and increased failover capacity. In other words, big buffers and more contracts with Tier 1 providers.

As for Telia, it is not exactly used to receiving a lot of attention. It has promised network operators a report in due time but has so far said nothing about its failure except for a single tweet. ®

 

Similar topics

Similar topics

Similar topics

TIP US OFF

Send us news


Other stories you might like