HPE says 3PAR problem that broke Australia was a one-off. Probably
Crack team trying to figure out what went wrong at the tax office
Hewlett Packard Enterprise says the 3PAR Storage SNAFU that took the Australian Taxation Office offline for a week does not appear to be a systemic problem.
The ATO went down hard last week, with many online services that citizens, businesses and the bean-counting industry rely on disappearing from the internet as a result. The organisation quickly pointed the finger at HPE storage for the problem and defended its choice, saying it was acquired “after a lengthy and thorough selection process, and was seen to be ‘state-of-the-art’ at the time.”
The ATO also ’fessed up that, “What compounded the problem beyond the initial failure was the subsequent failure of our back-up arrangements to work as planned. The failure of our back-up arrangements meant that restoration and resumption of data and services has been very complex and time consuming.”
HPE’s now told The Register that it’s got a team at the ATO conducting “its own root-cause analysis investigation to determine why storage hardware went offline, preceding a series of events that led to the broader system outage experienced by the ATO.”
“We refrain from speculation on possible causes while the investigation is underway, and at this time, HPE does not believe that other customers are at risk.”
The Register’s sources say user chatter is settling on a story of sudden simultaneous multiple disk failure, perhaps because of components in a disk drawer failing. Losing a lot of disks at once disturbed a RAID set and the rest is history.
The ATO yesterday announced all its services are back, after previously announcing an independent review of the incident. The reviewers are charged with delivering:
- A definitive description of the failure and its root cause.
- Factors leading to the outage and contributing to the duration, scale, and scope of the outage.
- Adequacy of back-up and contingency strategies and arrangements and explanation as to why fail-over to our secondary site did not work.
- Adequacy of restoration and resumption procedures of technology, infrastructure and applications.
- Whether there is anything unique or unusual in the physical and technical ATO technology infrastructure and/or architecture and/or environment that suggests there is a high risk of a repeat or like failure.
- Adequacy, speed and robustness of the critical event response provided by various vendors and other non-ATO entities.
- Any other observations or advice on improvements to IT systems to prevent a re-occurrence of the issues, along with the indicative costs and benefits for any improvement options to be considered.
The review is due in 2017 and The Register will analyse it upon release in case there are lessons in there for 3PAR users everywhere. And also, to be honest, because it will be good sport to see this mess dissected. ®