Updated Backbone provider Level3 says an outage that knocked out VoIP service for much of the US Tuesday morning was the result of improperly configured equipment.
It seems the outage, which smashed call services offline for much of the country, was not the result of any fiber cuts or facility damage, but rather some classic bad switch settings.
As the network provider returned operations to normal, customers received a technical note from Level3, seen by The Register, describing the issue and its resolution:
Root Cause: Calls were not present in Level 3 voice switches, impacting voice services in multiple markets of the United States.
Fix Action: Configuration adjustments were implemented to allow calls to reach the switches and complete appropriately.
Reason for Outage (RFO) Summary: The Voice NOC investigated reports of voice service impact throughout multiple markets in the United States. The equipment vendor was engaged for assistance with investigations. No inbound or outbound calls were present in voice switches. The Voice NOC was able to implement configuration changes that were successful in allowing calls to reach the switches and process accurately. The Voice NOC will continue to fully evaluate the incident and appropriate actions will take place to ensure incidents of this nature do not recur in the future. Services are currently restored and stable.
Corrective Actions: We know how important these services are to our customers. As an organization, we are putting processes in place to prevent issues like this from recurring in the future.
To us, this sounds like Level3 – or a partner – misconfigured its network equipment to drop voice traffic. We asked the carrier if that was the case, and we were told the following:
On October 4, our voice network experienced a service disruption affecting some of our customers in North America due to a configuration error. We know how important these services are to our customers. As an organization, we’re putting processes in place to prevent issues like this from recurring in the future. We were able to restore all services by 9:31am Mountain time.
Make of that what you will. ®
Updated to add
Here's some more detail on the cock-up from Level 3 – the backbone biz has forwarded to us this advisory it sent out to customers:
Repair Area: Human Error Occurrence
Repair Action: Human Error
Reason for Outage (RFO) Summary: On October 4, 2016 at 14:06 GMT, calls were not completing throughout multiple markets in the United States. Level 3 Communications¿ call center phone number, 1-877-4LEVEL3, was also impacted during this timeframe, preventing customers from contacting the Technical Service Center via that phone number. The issue was reported to the Voice Network Operations Center (NOC) for investigation. Tier III Support was engaged for assistance isolating the root cause. It was determined that calls were not completing due to a configuration limiting call flows across multiple Level 3 voice switches. At 15:31 GMT, a configuration adjustment was made to correct the issue, and Inbound and outbound call flows immediately restored for all customers. Investigations revealed that an improper entry was made to a call routing table during provisioning work being performed on the Level 3 network. This was the configuration change that led to the outage. The entry did not specify a telephone number to limit the configuration change to, resulting in non-subscriber country code +1 calls to be released while the entry remained present. The configuration adjustments deleted this entry to resolve the outage.
Level 3 Communications knows how important these services are to customers. As an organization, this incident is being evaluated at the highest levels to prevent reoccurrence. Process has been put in place to alert this specific Provisioning team of how this incident could have been avoided. Access restrictions have been made to mitigate the possibility of large-scale configuration changes, and a future process for these types of provisioning activities will be evaluated to involve additional technical support. System tools are being investigated to place additional guardrails against this type of trouble.