Updated A bug in software pushed out by Cloudflare resulted in failures at the heart of the web's infrastructure, according to a report published this week by the Internet Systems Consortium (ISC).
ISC runs the so-called F root server; one of the world's 13 root DNS servers, labeled A through M. These are the central computers that underpin the global internet: they ensure, for instance, that when you visit theregister.com, you are directed to the correct system serving our homepage.
On January 23 this year, ISC received a report of a breakdown with .net domains. When it investigated, it discovered crucial A and AAAA records, which glue .net domain names to their IPv4 and IPv6 network addresses, were missing.
In essence, all internet addresses ending in .net – one of the internet’s largest registries with 13.4 million domain names – vanished from ISC's F root machine. Any browser, app, computer or device that, ultimately, relied on the F root machine to connect to websites and services would, worst case scenario, have been unable to reach those systems via their .net addresses.
The issue wasn’t restricted to just ISC's F root, either; the report [PDF] said similar problems were experienced by the E root, run by NASA.
ISC quickly figured out – within five minutes, according to its timeline – that the issue lay with internet nodes it operates in partnership with Cloudflare, and escalated the issue to the web infrastructure business. Cloudflare also acted quickly: within 21 minutes it had identified that a specific code release, designed to fix a bug that it had introduced four hours earlier, was responsible.
Here's where the report takes a hard left into the fragile world of BGP: the Border Gateway Protocol used by the internet's sprawling galaxy of networks to automatically organize each other and maintain connections between themselves. How BGP is involved in a DNS root zone issue is not clear, and we've asked Cloudflare for a more detailed explanation.
Regardless, it took nearly two hours to withdraw a BGP announcement that was causing the problem, something ISC notes should have happened faster. “In retrospect, we should have initiated the withdrawal of the route prefixes from BGP as soon as it was identified that incomplete / incorrect data was being served,” the report stated under “lesson learned.”
It continued: “The withdrawal of routes did not go as smoothly as expected and Cloudflare and ISC have agreed to perform regular tests to exercise that function... The test suite has been updated to include tests for missing glue, and ISC and Cloudflare will work to devise further conformance tests.”
Hello money, goodbye stability
Thanks to the way that the world's DNS works, with information cascading down through a distributed hierarchy of name servers, redundancy provisions, and caches – and globally updated every few hours to every few seconds – the impact on netizens was absolutely minimal. With the E and F roots temporarily knackered, browsers and apps would have found other ways to look up .net addresses.
However, the situation is serious in large part because a fundamental underpinning of the public internet’s global addressing system was knocked over through a minor software update by one private organization.
Firefox now defaults to DNS-over-HTTPS for US netizens and some are dischuffed about thisREAD MORE
A software update carried out by Cloudflare, a commercial entity that uses a mix of open and closed software. The internet has achieved such a remarkable degree of uptime despite decades of exponential growth due to its tradition of open-source software, carefully checked and tested updates, and maintainer organizations that are kept separate from commercial considerations.
As one veteran internet engineer, Bill Woodcock, noted on Twitter: “What happens when critical functions of the public Internet are co-opted for private benefit? Transparency and accountability are lost, infrastructural spending cut, things break.”
The issue is not an academic one either. Woodcock sounded the alarm recently over the proposed sale of .org to an unknown private equity company – his company provides technical back-end services for the internet registry. Given the profit motive of the proposed purchaser, he concluded that there was likely to be a significant cutback on technical spending, putting the stability of the critical registry at risk. He sent a letter to DNS overseer ICANN about the issue, recommending that it stop the proposed sale.
That’s not the only internet engineer concerned either. Bert Hubert, whose company produces open-source DNS software, noted with respect to the ISC report that “closed source Cloudflare software had a bug which caused closed source Akamai to break over at a large US cable access provider.”
Hubert has recently been very vocal over his concerns that Firefox will be using Cloudflare as the default provider for its secure DNS, DoH, protocol: something that happened for all Firefox users in the US this morning.
If a software bug in closed Cloudflare software can cause a root server to vanish an entire, significant piece of the internet then it is all too possible – in fact, likely – that at some point a similar issue will cause Firefox users to lose their secure DNS connections. And that could cause them to lose the internet altogether (it would still be there, but most users would have no idea what the cause was or how to get around it.)
There is a famous phrase often repeated by internet engineers, and originally coined by EFF co-founder John Gilmore, that “the Net interprets censorship as damage and routes around it.” That statement has taken on much broader meaning and is often employed by engineers to basically say “don’t worry about it, the internet breaks all the time.” And it does, every second, and mends itself almost immediately.
But with the growing commercialism of the internet and with private and profit-driven companies increasingly inserting themselves into the foundational layer of the internet’s infrastructure, the report by ISC over this F root incident may well be a warning for what is coming.
We've asked Cloudflare for comment and will update this story when it gets back. ®
Updated to add
In a conversation with El Reg after this article was published, Cloudflare's spokespeople rejected any notion that the closed nature of its software was to blame. "This was very much an edge case," Cloudflare's distinguished engineer Martin Levy told The Register. "It's not fair to look at this in binary terms: that open source is good, and closed bad. We put an enormous amount of software into the open source world."
He added Cloudflare put its code-to-be-deployed through "extreme testing but we hadn't noticed this special case," and that any disruption was "highly localized." The code change was, to put it simply, an improvement in character encoding handling for a particular customer that had an unexpected knock-on effect on the root servers.
To us, it seems a BGP route break caused the root servers to abandon their gTLD A and AAAA records, possibly because they were fetching those details from another system they could no longer reach. See the final two pages of the report PDF.
Also, this affected all domain names – not just .net – handled by the F and E root servers, though .net stood out to ISC because it's a rather large and important corner of the web.
Full disclosure: The Register is a Cloudflare customer.