Cloudflare coughs, half the internet catches a cold
Outage leaves users staring at error pages while recovery crawls along
Updated Internet services provider Cloudflare is suffering a major outage that has knocked chunks of the web offline – including The Register.
The company acknowledged problems at 1148 UTC on November 18, stating: "Some services may be intermittently impacted." After a long half-hour, it reckoned systems were returning to normal, but "customers may continue to observe higher-than-normal error rates" as engineers continue to investigate and fix the underlying issue.
Cloudflare provides security and infrastructure for a substantial chunk of websites. As such, X (formerly Twitter) and even El Reg were either knocked offline or malfunctioned as the outage continued. Even that stalwart of system uptime, Downdetector, reported "Please unblock challenges.cloudflare.com to proceed" at one point.
Cloudflare has yet to confirm the cause of the outage – we will issue an update when it does – but it follows hot on the heels of problems at AWS and Azure, and is a reminder for enterprises that a service is only as good as the weakest link in the chain... and that weakest link might not reveal itself until it breaks.
The problem appears to be global, and the company was forced to do the equivalent of turning off and on its WARP access in London as engineers worked to deal with the glitch. WARP is similar to a VPN, except it routes traffic through Cloudflare's network. If the network is having a bad day, turning off WARP seems a sensible option.
At 1309 UTC, Cloudflare announced it had identified the root cause and a fix was being implemented. It did not, however, give an estimate for when sites would stop becoming available and then become unavailable again, seemingly at random.
A Cloudflare spokesperson told The Register: "We saw a spike in unusual traffic to one of Cloudflare's services beginning at 1120 UTC. That caused some traffic passing through Cloudflare's network to experience errors.
"We do not yet know the cause of the spike in unusual traffic. We are all hands on deck to make sure all traffic is served without errors. After that, we will turn our attention to investigating the cause of the unusual spike in traffic." ®
Updated to add at 1555 UTC, November 18
A Cloudflare spokesperson told The Register that the incident began at 1120 UTC and was fully resolved at 1430. They said: "The root cause of the outage was a configuration file that is automatically generated to manage threat traffic. The file grew beyond an expected size of entries and triggered a crash in the software system that handles traffic for a number of Cloudflare's services.
"To be clear, there is no evidence that this was the result of an attack or caused by malicious activity.
"We expect that some Cloudflare services will be briefly degraded as traffic naturally spikes post incident but we expect all services to return to normal in the next few hours.
"Given the importance of Cloudflare's services, any outage is unacceptable. We apologize to our customers and the internet in general for letting you down today. We will learn from today's incident and improve."