Stop shaming service providers for outages, argues APNIC chief scientist
Tech companies should behave like the aviation industry and detail failures to improve safety for all
The chief scientist of the Asia-Pacific Network Information Centre (APNIC), the region's internet registry, as called for operators of digital infrastructure to share more info about their outages.
In a post that opens by considering the major outage at Australian telco Optus, APNIC chief scientist Geoff Huston opined "If this were a bank heist the site would no doubt be saturated with investigators from the police force."
But the cause of the Optus incident was a "routing heist."
"So where are the routing police to investigate the incident?" he asked. "How can we understand the exact nature of the triggers for this outage and identify if there was some level of contributory negligence from the network operator or their suppliers that amplified a minor issue of a route leak into a major issue that impacted millions of consumers?"
Huston lamented that routing police don't exist, and argued that internet governance organizations, standards bodies, and network operator groups are not appropriate entities to take on the task.
He's also not keen on national regulators doing the job, given the internet crosses borders and governments do not.
- APNIC close to completing delegation of its final /8 IPv4 block
- APNIC: Big Tech's use of carrier-grade NAT is holding back internet innovation
- APNIC left a dump from its Whois SQL database in a public Google Cloud bucket
- Starlink creates risk of internet investment doom cycle, says APNIC researcher
The aviation industry, he suggested, offers a better model.
"We need to respond to outages and related incidents in the internet in a way that does not immediately attempt to sweep it under the closest rug and deny that anything untoward ever happened at all," he wrote. He cited the airline industry as "a case in point where the object of an investigation is not necessarily to apportion blame, but to unearth the root causes and potentially propose measures that … operators can adopt that would prevent a recurrence of the mishap."
"It would be good if all service providers in the public internet spent the time and effort post-rectification of operational problems to produce detailed and thorough outage reports as a matter of standard operating procedure," Huston argued.
Doing so is "not about apportioning blame or admitting liability." Rather, sharing incident analyses is "all about positioning these services as the essential foundation of the public digital environment and stressing the benefit of adopting a common culture of open disclosure and constant improvement as a way of improving the robustness of these services."
"It's about appreciating that these days these services are very much within the sphere of public safety and their operation should be managed in the same way," he added. ®