IBM blames 'external' network provider, incorrect routing, traffic flood for its two-hour cloud outage
No data loss or attack detected. But aren't hyperscale clouds supposed to be more resilient than this?
IBM has blamed a third party for yesterday's hours-long outage of its entire cloud. And while it says no data loss or attack was detected, it's still not a good look: major clouds are supposed to be more resilient than this.
A brief notice on the IT titan's cloud status page offered the following explanation for the breakdown:
IBM is focused on external network provider issues as the cause of the disruption of IBM Cloud services on Tuesday, June 9. All services have been restored.
A detailed root cause analysis is underway. An investigation shows an external network provider flooded the IBM Cloud network with incorrect routing, resulting in severe congestion of traffic and impacting IBM Cloud services and our data centers. Mitigation steps have been taken to prevent a reoccurrence. Root cause analysis has not identified any data loss or cybersecurity issues.
That's rather vague verbiage but is consistent with the sort of traffic flood that can happen when an inadvertent border gateway protocol (BGP) blunder directs packets to the wrong place. BGP hijacking or misconfiguration is a known problem, and you'd think an outfit like IBM would be alert to that sort of error, and have defenses or countermeasures in place to mitigate it.
Yet IBM may not be great at handling traffic spikes: when it ran Australia's e-census it mistakenly identified a flood of inbound connections as a denial-of-service attack, and had to pull the plug on a router to sort things out.
IBM to power down Power-powered virtual private cloud, GPU-accelerated optionsREAD MORE
Another possibility is that a supplier to IBM's cloud messed things up. We know that IBM uses Akamai for its content delivery network, offers Cloudflare-as-a-service, and has an expansive relationship with AT&T. The Register does not suggest any of those companies had a role in the outage, but all do have sufficient scale to play a part in a global outage.
Whatever the cause, hyperscale clouds are supposed to be sufficiently resilient to handle unexpected nastiness of this sort. That IBM succumbed is not reassuring, given the tech goliath says it is now a cloud business. That's a tricky position to sustain given its cloud remains rather less elegant to operate than rivals' services. ®