This article is more than 1 year old
After config error takes down Rogers, it promises to spend billions on reliability
Routers flooded with internet traffic in filter blunder, watchdog told
Canadian telecom giant Rogers will spend C$10 billion ($7.7 billion) to ensure that day-long outage earlier this month doesn't happen again, its CEO has said.
We have also learned that, according to Rogers, the IT breakdown was caused by an early morning configuration update that jammed too much traffic through the ISP's central routers, leaving them unable to function properly for wired and wireless customers.
Tony Staffieri made the spending pledge on Sunday in a letter shared via Rogers' website. He also described a four-point "enhanced reliability" action plan that he hopes hopes will "restore [customer] confidence in Rogers and earn back [their] trust." The ISP offers cellular, cable, and broadband internet services in Canada, and has roughly 10 million wireless subscribers alone.
We're told that Rogers will make its offerings more robust by "physically separating our wireless and internet services to create an 'always on' network." Staffieri said the move would prevent broadband internet customers from losing service in the event of a wireless outage. In other words, if the cellular network core goes down, it won't take out wired internet connectivity, and presumably vice-versa.
Canadian ISP Rogers falls over for hours, takes out broadband, cable, cellphones
PREVIOUSLYIn addition, Staffieri said Rogers is partnering with unnamed "leading technology firms" to do a full review of its network. Rogers said the report will be made available to the wireless industry "for the benefit of every Canadian."
Rogers' multi-billion-dollar investment will come over the next three years. Staffieri said additional oversight and testing as well as "greater use of artificial intelligence" is all on the table to improve service. Staffieri didn't elaborate, and Rogers spokespeople were unable to comment further.
Outside of its own operations, Rogers said it is taking steps to ensure 911 call centers, which Rogers subscribers were unable to reach during the outage, remain accessible in future during any network downtime. "We have made meaningful progress on a formal agreement between carriers to switch 911 calls to each other's networks automatically," Staffieri said.
Canadian officials aren't happy
On Saturday, July 9, one day after the mega-outage, Rogers issued a memo from Staffieri that said the outage had been largely resolved, and attributed the cause to "a network system failure following a maintenance update in our core network, which caused some of our routers to malfunction early Friday morning."
That didn't satisfy Canada's communications watchdog, which in a letter to Rogers on July 12 said the outage bore striking similarities – and justifications – to a screw-up in April 2021 that knocked Rogers' services offline.
"Rogers has publicly attributed the cause of this [July 2022] service outage to a maintenance upgrade in its core network. This is reminiscent of another significant network outage in April 2021 that Rogers similarly attributed to a software update," said Fiona Gilfillan, an executive director at the Canadian Radio-television and Telecommunications Commission (CRTC).
In her letter, Gilfillan said the CRTC wanted "comprehensive information" about the lead up to the outage, as well as what happened during and after, as well as the provider's plans to prevent another IT breakdown.
Rogers responded [.DOCX] to the CRTC last Friday. The copy available on the CRTC's website is redacted.
However, some new details were left public, such as the admission that an update was made to the configuration of Rogers' routers that allowed an overwhelming amount of internet traffic to pass through the equipment. This caused the ISP's core devices to fail, we're told.
As we suspected, this sounds like a BGP blunder. Here's the relevant passage:
The configuration change deleted a routing filter and allowed for all possible routes to the Internet to pass through the routers. As a result, the routers immediately began propagating abnormally high volumes of routes throughout the core network. Certain network routing equipment became flooded, exceeded their capacity levels and were then unable to route traffic, causing the common core network to stop processing traffic. As a result, the Rogers network lost connectivity to the Internet for all incoming and outgoing traffic for both the wireless and wireline networks for our consumer and business customers.
"While every effort was made to prevent and limit the outage, the consequence of the coding change affected the network very quickly," the response added.
Once the team figured out what went wrong, they "began the process of restarting all the internet gateway, core and distribution routers in a controlled manner to establish connectivity to our wireless (including 9-1-1), enterprise and cable networks which deliver voice, video and data connectivity to our customers. Service was slowly restored, starting in the afternoon and continuing over the evening.
"Although Rogers continued to experience some instability issues over the weekend that did impact some customers, the network had effectively recovered by Friday night."
We're also told that the config update was the sixth phase of a seven-part maintenance job that had been going on for weeks. The update was rolled out early in the morning to cause minimal disruption, though in this case, it knocked out the whole ISP.
"At 4:43AM EDT, a specific coding was introduced in our distribution routers which triggered the failure of the Rogers IP core network starting at 4:45AM," the note added. ®
Bootnote
The outage also took out Rogers' radio stations. Some were off the air for minutes, others many hours. Some resorted to rather ad-hoc solutions to resume transmission. For instance, CHST-FM fell off the airwaves all morning.
"During that time, evergreen programming was aired from an MP3 player at the base of the transmitter until our engineering team was able to establish a connection between the studio and transmitter site using an alternate Internet connection," the note disclosed.