Vanishing power feeds, UPS batteries, failover fails... Cloudflare explains that two-day outage
A little peek behind the control panel, analytics curtain
Cloudflare has explained how it believes it suffered that earlier multi-day control plane and analytics outage.
The executive summary is: a datacenter used by Cloudflare apparently switched with little or no warning from operating on two utility power sources, to one utility source and emergency generators, to just UPS batteries, to nothing, all within the space of hours and minutes. And then Cloudflare found out the hard way that its failover plans from that datacenter to other facilities didn't work quite as expected.
In a postmortem report in the aftermath of the downtime, CEO Matthew Prince walked through the outage, which ran from 1143 UTC on Thursday, November 2 to Saturday, November 4 at 0425 UTC, from his CDN outfit's point of view.
During the IT breakdown, not only were Cloudflare's analytics services disrupted or straight-up unavailable, which included logging, so was the control plane; that's the customer-facing interface for all of its services. We're told the control plane was mostly restored by 1757 UTC on Thursday by using a disaster recovery facility. Cloudflare's main network and security duties continued as normal throughout the outage, even if customers couldn't make changes to their services at times, Prince said.
In reality, the batteries started to fail after only four minutes
At the heart of the matter is a US datacenter we're told is run by Flexential. Cloudflare calls that facility PDX-DC04 or PDX-04, and it's one of three separate centers in Hillsboro, Oregon, in which Cloudflare primarily houses the servers that provide its control plane and analytics services.
According to Cloudflare, PDX-04 has two 12.47kV utility power feeds from Portland General Electric (PGE), a collection of emergency generators, and a bank of UPS batteries. If the two utility feeds were to cut out, the generators can take on the full datacenter load, and the batteries are good for 10 minutes.
At 0850 UTC on Thursday, one of those utility lines went dead. As Prince put it, PGE had "an unplanned maintenance event affecting one or their independent power feeds into the building." PDX-04 at this point had its second utility feed still, but decided to spin up its generators anyway.
We're told by Prince that "counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power," and so didn't have a heads up that maybe things were potentially about to go south and that contingencies should be in place.
"We haven't gotten a clear answer why they ran utility power and generator power," he added. We've reached out to Flexential, too, and haven't heard back.
It could be that Flexential didn't trust the second line, and that PGE might have another "unplanned maintenance event," and so the operators brought the generators online just in case. Cloudflare, claiming it still hasn't had all the answers it wanted from Flexential, speculated the datacenter firm had a deal with PGE in which the biz would use its on-site generators to feed additional power into the grid when needed, with the utility company assisting in the maintenance and operation of those generators. That may have led to the generators starting up early.
Whatever the reason, a little less than three hours later at 1140 UTC (0340 local time), a PGE step-down transformer at the datacenter – thought to be connected to the second 12.47kV utility line – experienced a ground fault. That caused the facility to cut off the second line and the generators as a safety precaution.
At that point, according to Cloudflare, PDX-04 was running on no outside utility lines and no generators.
That meant Flexential had to turn to its UPS batteries, which Prince said were supposed to provide 10 minutes of power in which there was hopefully enough time to bring the generators back up and maintain an uninterrupted supply. "In reality, the batteries started to fail after only four minutes based on what we observed from our own equipment failing," he said.
By that, he means at 1144 UTC - four minutes after the transformer ground fault – Cloudflare's network routers in PDX-04, which connected the cloud giant's servers to the rest of the world, lost power and dropped offline, like everything else in the building.
At this point, you'd hope the servers in the other two datacenters in the Oregon trio would automatically pick up the slack, and keep critical services running in the absence of PDX-04, and that was what Cloudflare said it had designed its infrastructure to do. Some non-critical things would temporarily halt but critical stuff would continue to work.
Here's where things get a bit fuzzy because according to Prince, "generally that worked as planned." However, we're also told "unfortunately" a subset of those services had dependencies that were "exclusively running in PDX-04." It sounds as though there were two analytics services in particular that only ran in PDX-04, and other services depended on them. With no PDX-04, those interconnected services weren't coming back easily.
"We had never tested fully taking the entire PDX-04 facility offline. As a result, we had missed the importance of some of these dependencies," wrote Prince, and we appreciate the honesty.
Ultimately, when the lights went out at PDX-04, recovery was not as swift and graceful as hoped, and some services were left unreliable or unavailable, as customers came to discover.
Got the power
We're told the generators were finally starting to get back online at 1248 UTC – so much for that 10 minute UPS window. Flexential went to flip on the circuit breakers to Cloudflare's portion of the datacenter, and the breakers were found to be faulty, we're told, perhaps due to the ground fault, perhaps some other reason.
Flexential didn't have enough working parts on hand to replace the breakers, it's claimed, leaving Cloudflare's servers and network equipment offline.
That led to Cloudflare at 1340 UTC deciding to fail over some tasks to systems in Europe, which in turn led to a "thundering herd problem" where a rush of API calls from clients overwhelmed those servers, which didn't help in the recovery.
The control plane services were able to return online, allowing customers to intermittently make changes, and were fully restored about four hours later from the failover, according to the cloud outfit. Other services had to wait longer, some until PDX-04 was fully powered up again.
It seems to us that Cloudflare was able to balance at least some of its services, particularly the control plane stuff, across its two other Oregon datacenters after PDX-04 went dark, and eventually failed a portion of that control plane work over to Europe to ease the burden. And other services, particularly the analytics, were returned gradually as the biz negotiated its way through the storm.
When a datacenter fails, cloud services shouldn't go offline to the extent of what happened to Cloudflare, and Prince copped to several blunders in that regard.
"We believed that we had high-availability systems in place that should have stopped an outage like this, even when one of our core data center providers failed catastrophically," Prince said.
"And, while many systems did remain online as designed, some critical systems had non-obvious dependencies that made them unavailable."
"The redundancy protections we had in place worked inconsistently depending on the product," the CEO added.
- Cloudflare opposes Europe's plan to make Big Tech help pay for networks
- Cloudflare exiles baseboard management controller from its server motherboards
- Microsoft admits 'power issue' downed Azure services in West Europe
- Cloudflare hikes prices by a quarter, blames the accountants
As a result of the downtime, Prince said Cloudflare is making several operational changes, including removing dependencies on core datacenters for control plane configuration, testing the "blast radius" of system failures to minimize the number of services impacted by a failure at a single facility, implementing more rigorous testing of "all datacenter functions," and making improvements to its datacenter auditing.
"I am sorry and embarrassed for this incident and the pain that it caused our customers and our team," Prince said. "This will have my full attention and the attention of a large portion of our team through the balance of the year." ®