Flash drive meltdown fingered in Swedish IT blackout
Tieto's EMC VNX5700 array sparked 5-day disarray - new claim
Tieto's five-day outage disaster started with multiple failures of its EMC VNX5700 array's FAST Cache, according to a Finnish source close to the matter.
Tieto is a major IT services organisation across Scandinavia and the Nordic region – although it also provides services globally – and pulls in net sales of SEK17bn (£1.59bn). Its large customer base in Sweden means that when it had a five-day outage in November, it caused chaos to IT services across that country. The stoppage was caused by failures in an EMC storage array and compounded by an inadequate disaster recovery plan involving Networker tape backup files which could not be read. The circumstances are not clear and seemed to involve a VNX array with an upgrade to an NS480 (Celerra) system for flash, which is a logical nonsense.
El Reg has been sent a Tieto slide deck (PDF) describing why the service provider migrated from its Celerra NS480 to a VNX5700 and the resulting performance improvements: namely lower latency and more IOPS. This deck is in Swedish but Google Translate gets around that little problem.
Based on the translated slide deck text, the story goes like this: in the 2010/2011 period, with a EMC Celerra NS480 array, Tieto saw its storage challenges as performance, response time, scalability and capacity. So it migrated from RAID (4 + 1) groups to Thick Pools composed of 60 disks and began to segment data types into Fibre Channel and NAS. The next step was to install EMC's FAST Cache with four 200GB SSDs and the cache license, which was beneficial as response times were more than halved to less than 20ms. However the NS480 CPUs were maxed out.
Tieto upgraded to a VNX5700, but retained the 4 x 200GB SSD capacity and Fast Cache license and the 60-disk Thick Pool, although the disks changed from 450GB FC to 600GB SAS ones. 14 x 1.04GB chunks were created in each pool and only FC block access was allowed. The outcome was a boost in IOPS and a further reduction in latency as shown in the chart.
Chart showing IOPS increase and latency decrease with move from NS480 to FAST Cache and then VNX5700
So here we have the basic VNX5700 array setup in which the hardware failures that led to the five-day debacle took place. EMC won't comment on any details, having referred us to the Tieto statement seen in our article yesterday. Our source said, for what it's worth: "What basically happened (in my understanding from Twitter rumours) is that Tieto had multiple SSD failures on [its] VNX5700 array Fast Cache, this resulted in data loss."
What needs to be stressed is that Tieto's DR processes were dreadfully inadequate and obviously untested for the eventuality of such a failure. Lawsuits over data loss and business interruptions at Tieto's affected customers are bound to follow. ®