Cloudflare outage caused by techie pulling out the wrong cables

Single point of failure, imprecise instructions and not enough labels are a bad, bad, mix

Thu 16 Apr 2020 // 02:57 UTC

Cloudflare has admitted that a four-and-a-bit-hour outage today was caused by someone pulling out cables that should have been left in place, but which were yanked because techies were given unhelpfully imprecise instructions.

The incident started with some “planned maintenance at one of our core data centers” that saw techies told “to remove all the equipment in one of our cabinets.”

Cloudflare said the cabinet in question “contained old inactive equipment we were going to retire and had no active traffic or data on any of the servers in the cabinet.”

But there was more to this cabinet than met the eye:

The cabinet also contained a patch panel (switchboard of cables) providing all external connectivity to other Cloudflare data centers. Over the space of three minutes, the technician decommissioning our unused hardware also disconnected the cables in this patch panel.

It turned out that patch panel was a single point of failure for Cloudflare’s data centre. Or as Cloudflare has explained in its incident report: “Starting at 1531 UTC and lasting until 1952 UTC, the Cloudflare Dashboard and API were unavailable because of the disconnection of multiple, redundant fibre connections from one of our two core data centers.”

The company scrambled to sort things out, but that took time because cables weren’t clearly labelled. Coronavirus-caused off-site working didn’t help matters either.

The company has had the good grace to not throw its techies under a bus, writing that it needs a process change along these lines: “While sending our technicians instructions to retire hardware, we should call out clearly the cabling that should not be touched.”

At least customers were merely disrupted, rather than damaged, as all configuration data was preserved during the incident.

The company is nonetheless "very sorry for the disruption", wrote Cloudflare CTO John Graham-Cumming. ®

Topics

Special Features

Vendor Voice

Resources

Networks

Cloudflare outage caused by techie pulling out the wrong cables

Single point of failure, imprecise instructions and not enough labels are a bad, bad, mix

More about

More about

Narrower topics

More about

More about

More about

Narrower topics

TIP US OFF

Other stories you might like

911 goes MIA across multiple US states, cause unclear

Alibaba Cloud reveals network telemetry tool that helped cut number of engineers needed by 86%

Cisco creates architecture to improve security and sell you new switches

Industrial systems integrating digitalisation

Sacramento airport goes no-fly after AT&T internet cable snipped

Zero-day exploited right now in Palo Alto Networks' GlobalProtect gateways

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Datacenter outages are on the decline, but when they hit, they hit hard

HPE bakes LLMs into Aruba as AI inches closer to network takeover

Starlink clashes with Telecom Italia over frequency data sharing

Cloudflare says it has automated empathy to avoid fixing flaky hardware too often

About Us

Our Websites

Your Privacy