On-Prem

Networks

How four rotten packets broke CenturyLink's network for 37 hours, knackering 911 calls, VoIP, broadband

FCC delivers postmortem after blunder cripples US fiber links


A handful of bad network packets triggered a massive chain reaction that crippled the entire network of US telco CenturyLink for roughly a day and a half.

This is according to the FCC's official probe [PDF] into the December 2018 super-outage, during which CenturyLink's broadband internet and VoIP services fell over and stayed down for a total of 37 hours. This meant subscribers couldn't, among other things, call 911 over VoIP at the time – which is a violation of FCC rules, and triggered a formal investigation.

"This outage was caused by an equipment failure catastrophically exacerbated by a network configuration error," America's communications regulator said in its summary of its inquiry, published yesterday.

"It affected communications service providers, business customers, and consumers who directly or indirectly relied upon CenturyLink’s transport services, which route communications traffic from various providers to locations across the country, resulting in extensive disruptions to phone service, including 911 calling."

CenturyLink has six long-haul networks that make up the backbone of its digital empire, interconnecting regions of America. These networks use Infinera-built nodes to switch packets over high-speed optic fiber: data flowing into each node is directed to other nodes, ultimately pumping VoIP, regular internet traffic, and more, across the nation as needed.

We're told four malformed network packets were the root cause of the outage: they were generated by a switching module in a node in Denver, Colorado, for reasons still yet unknown, and sent on to other nodes. The broken packets all had the following qualities:

1. a broadcast destination address, meaning that the packet was directed to be sent to all connected devices;

2. a valid header and valid checksum;

3. no expiration time, meaning that the packet would not be dropped for being created too long ago; and

4. a size larger than 64 bytes.

Each dodgy packet would arrive at a node, get rejected and be passed along a chain of filters until it was injected into a management channel and handed to all connecting nodes. Here's a flow diagram, courtesy of the FCC, showing how the corrupted packets ended up being forwarded on to all neighboring nodes, and so on and so on, producing a growing chain reaction of corrupted packets...

Click to enlarge

"Due to the packets’ broadcast destination address, the malformed network management packets were delivered to all connected nodes. Consequently, each subsequent node receiving the packet retransmitted the packet to all its connected nodes, including the node where the malformed packets originated," the FCC said in its report.

"Each connected node continued to retransmit the malformed packets across the proprietary management channel to each node with which it connected because the packets appeared valid and did not have an expiration time. This process repeated indefinitely."

As you might imagine, the exponentially growing storm of packets soon overwhelmed CenturyLink's optic-fiber backbone, causing regular traffic to stop flowing: VoIP phones stopped working, internet access slowed to a halt, and so on. Folks in New Orleans were first to spot their connections stalling, at roughly 0356 EST on December 27.

Here is where things went from really, really bad to terrible: the nodes along the fiber network were so flooded, they could not be reached by their administrators to troubleshoot the issue. It wasn't until some 15 hours later the techies could finally track down the single errant node in Colorado responsible for sparking the deluge, not that replacing it helped. The packet tsunami was still washing back and forth, knocking nodes over.

US states join watchdog probing CenturyLink's Xmas data center outage that screwed 911 system

READ MORE

"At 2102 on December 27, CenturyLink network engineers identified and removed the module that had generated the malformed packets," the report noted. "The outage, however, did not immediately end; the malformed packets continued to replicate and transit the network, generating more packets as they echoed from node to node."

It would be another three hours before CenturyLink's network admins could begin to get through to the other nodes, and get them to kill off the spread of bad packets. It took until 1130 on December 28 to get visibility of the network back, and it wasn't until 2336 that all nodes had been restored. On December 29, just after midday, CenturyLink finally declared the crisis over.

"The event caused a nationwide voice, IP, and transport outage on CenturyLink’s fiber network. CenturyLink estimates that 12,100,108 calls were blocked or degraded due to the incident," the FCC said.

"Where long-distance voice callers experienced call quality issues, some customers received a fast-busy signal, some received an error message, and some just had a terrible connection with garbled words."

The outage also knackered local governments and telcos that relied on the CenturyLink network for portions of their services. State governments in Illinois, Kansas, Minnesota, and Missouri all had portions of their networks down for roughly 36 hours thanks to CenturyLink, and phone services sold by Comcast, Verizon, TeleCommunication Systems, General Dynamics IT, and West Safety Services – including 911 call centers – saw connectivity interrupted for some or all of the outage period.

As to what can be done to prevent similar failures, the FCC is recommending CenturyLink and other backbone providers take some basic steps, such as disabling unused features on network equipment, installing and maintaining alarms that warn admins when memory or processor use is reaching its peak, and having backup procedures in the event networking gear becomes unreachable.

"Currently, CenturyLink is in the process of updating its nodes’ Ethernet policer to reduce the chance of the transmission of a malformed packet in the future," the report notes. "The improved ethernet policer quickly identifies and terminates invalid packets, preventing propagation into the network. This work is expected to be complete in Fall, 2019."

The report did not mention any possible fines or penalties against CenturyLink. ®

Send us news
53 Comments

Miscreants make off with $150m of digital assets in BitMart security breach

Or it might be nearer $200m. Even the amounts stolen seem to be volatile in the crypto world

Cryptocurrency exchange BitMart has coughed to a large-scale security breach relating to ETH and BSC hot wallets. The company reckons that hackers made off with approximately $150m in assets.

Security and analytics outfit PeckShield put the figure at closer to $200m.

"We have identified a large-scale security breach related to one of our ETH hot wallets and one of our BSC hot wallets today. At this moment we are still concluding the possible methods used. Hackers were able to withdraw assets of the value of approximately 150 million USD," BitMart said.

Continue reading

MySQL a 'pretty poor database' says departing Oracle engineer

PostgreSQL a better option for open source RDBMS, he claims

You've collected your leaving card, novelty presents, and perhaps a bottle of wine – what's next on the list for the departing developer? For one, it's a blog rubbishing the technology he's been working on for five years.

That was the choice of Steinar Gunderson, a former principal software engineer at Oracle and member of the MySQL optimiser team.

In an online missive, the engineer, who has now taken up a role in Google's Chrome team, left no reader in doubt of his views on MySQL.

Continue reading

Uber's gig economy business model takes a blow from London legal double-whammy

Free Now taxi app unlawfully registered by regulator – and Ts&Cs didn't comply with the law

London taxi-hailing apps cannot dump their legal obligations on gig economy drivers, the Court of Appeal of England and Wales has ruled in a blow to Uber.

The court said this morning [PDF] that Germany-based taxi app Free Now could not operate in the English capital without taking on legal liability for delivering the taxi journey, giving a rolled-up judgment on two separate but closely linked cases.

In the first, Free Now's UK arm – aka Transopco UK Ltd – argued that as a middleman it was not contractually obliged to deliver taxi journeys, saying this was the legal responsibility of its drivers. Judges ruled there was "no material difference" between Free Now's business model and Uber's.

Continue reading

Helios-NG: An open-source cluster OS that links the Atari ST and Commodore Amiga

Does anyone have the stones to revive this long-forgotten software?

What is old is new again: linking open source Unix-alikes, native cluster OSes for massively parallel computers, and 1980s platform rivalries. You get all this in a somewhat dusty project hoping to "breathe new life" into Helios, a manycore OS from the '90s.

Parallel computing is back in fashion. Just last week, The Reg covered an inexpensive Arm cluster in a box; and support in the next Linux kernel for 24-core Atom chips and 64-core ARM ones.

Back in the 1980s, Intel couldn't build you a box with that many cores – but a small British outfit called Inmos could. While a remote descendant of Inmos provides one of the processors in relatively recent Amiga hardware, there's a much older connection.

Continue reading

Cuba ransomware gang scores almost $44m in ransom payments across 49 orgs, say Feds

Hancitor is at play

The US Federal Bureau of Investigation (FBI) says 49 organisations, including some in government, were hit by Cuba ransomware as of early November this year.

The attacks were spread across five "critical infrastructure", which, besides government, included the financial, healthcare, manufacturing, and – as you'd expect – IT sectors. The Feds said late last week the threat actors are demanding $76m in ransoms and have already received at least $43.9m in payments.

The ransomware gang's loader of choice, Hancitor, was the culprit, distributed via phishing emails, or via exploit of Microsoft Exchange vulnerabilities, compromised credentials, or Remote Desktop Protocol (RDP) tools. Hancitor – also known as Chanitor or Tordal –  enables a CobaltStrike beacon as a service on the victim's network using a legitimate Windows service like PowerShell.

Continue reading

Graviton 3: AWS attempts to gain silicon advantage with latest custom hardware

Key to faster, more predictable cloud

RE:INVENT AWS had a conviction that "modern processors were not well optimized for modern workloads," the cloud corp's senior veep of Infrastructure, Peter DeSantis, claimed at its latest annual Re:invent gathering in Las Vegas.

DeSantis was speaking last week about AWS's Graviton 3 Arm-based processor, providing a bit more meat around the bones, so to speak – and in his comment the word "modern" is doing a lot of work.

The computing landscape looks different from the perspective of a hyperscale cloud provider; what counts is not flexibility but intensive optimization and predictable performance.

Continue reading

The Omicron dilemma: Google goes first on delaying office work

Hurrah, employees can continue to work from home and take calls in pyjamas

Googlers can continue working from home and will no longer be required to return to campuses on 10 January 2022 as previously expected.

The decision marks another delay in getting more employees back to their desks. For Big Tech companies, setting a firm return date during the COVID-19 pandemic has been a nightmare. All attempts were pushed back so far due to rising numbers of cases or new variants of the respiratory disease spreading around the world, such as the new Omicron strain.

Google's VP of global security, Chris Rackow, broke the news to staff in a company-wide email, first reported by CNBC. He said Google would wait until the New Year to figure out when campuses in the US can safely reopen for a mandatory return.

Continue reading

This House believes: A unified, agnostic software environment can be achieved

How long will we keep reinventing software wheels?

Register Debate Welcome to the latest Register Debate in which writers discuss technology topics, and you the reader choose the winning argument. The format is simple: we propose a motion, the arguments for the motion will run this Monday and Wednesday, and the arguments against on Tuesday and Thursday. During the week you can cast your vote on which side you support using the poll embedded below, choosing whether you're in favour or against the motion. The final score will be announced on Friday, revealing whether the for or against argument was most popular.

This week's motion is: A unified, agnostic software environment can be achieved. We debate the question: can the industry ever have a truly open, unified, agnostic software environment in HPC and AI that can span multiple kinds of compute engines?

Our first contributor arguing FOR the motion is Nicole Hemsoth, co-editor of The Next Platform.

Continue reading

Sun sets: Oracle to close Scotland's Linlithgow datacentre

Questions for tenants as Ellison's gang executes its OCI strategy

Oracle's datacentre in Linlithgow, Scotland is set to close over the next few months, leaving clients faced with a cloud migration or a move to an alternative hosted datacentre.

According to multiple insiders speaking to The Register, Oracle has been trying to move its datacentre clients to Oracle Cloud Infrastructure – with mixed results.

The Linlithgow facility dates back to the days of Sun Microsystems, which opened a manufacturing plant there in 1990.

Continue reading

The dark equation of harm versus good means blockchain’s had its day

Put crypto back in the crypt

Opinion In 1960, Theodore H Maiman made the first laser.

Famously described at birth as a solution in search of a problem, it delivered a Nobel prize four years later, was in barcode scanners in shops 10 years after that, and in 1979 gave birth to the compact disc.

Not content with enabling digital audio, revolutionising many sciences and much else besides, it has since become the glowing heart of the global internet. Yay lasers.

Continue reading

How to destroy expensive test kit: What does that button do?

Fidgety fingers and boredom = trouble

Who, Me? All aboard for a nautical installment of Who, Me? where the words "Don't Touch That Button!" have an altogether damper meaning.

Today's tale comes from a reader Regomised as "Trev" and has a slightly naval tinge to it.

"I was involved in installing a system in a corvette for a Middle Eastern navy," he told us. "Our customer was equal parts naive, hopeful, and bloody difficult."

Continue reading