The HPE 3PAR SANs that twice failed at the Australian Taxation Office had warned of outages for months, but HPE decided the arrays were in no danger of major failure. Combined with decisions to place the SANs' recovery software on the SANs themselves, and HPE's configuration of the SANs for speed, not resilience, the failures proved difficult to recover from even if no data was lost.
And now the ATO is much more interested in cloud.
That's the gist of the Australian Taxation Office's (ATO's) report (PDF) into the twin outages in December 2016 and February 2017 that took out most of its online services.
The ATO's timeline of the incident starts in early 2015 with the decision to replace an EMC array with 3PAR kit, in part because it “was supported by HPE operating procedures and technical expertise.” The ATO engaged HPE to create a "turnkey solution" - the company was to build and operate the SAN.
Between November 2015 and May 2016 HPE's people designed and implemented the SAN, but HPE:
- Did not include available, automated technical resilience and data/system recovery features (such as 3PAR Recovery Manager and Peer Persistence)
- Did not define or test “recovery procedures for applications in the event of a complete SAN outage”
- Did not define or verify “Processes for data reconciliation in the event of an outage of this nature”
The report also offers this observation:
Sufficient detail on design and/or implementation choices related to technical resilience and recovery capacity was not presented by HPE to the relevant ATO governance forum(s) to allow them to fully appreciate, communicate and mitigate the resultant business risk.
The ATO takes some of the responsibility for letting that happen, saying its “associated governance was not robust and relied heavily on HPE recommendations.” It also admits that “Full automated fail‑over for the entire suite of applications and services in the event of a complete Sydney array failure had not been considered to be cost‑effective.”
So even though it had a backup site, it couldn't flick the switch to use it.
Once the main SAN was up and running, it generated 159 alerts between May and November 2016. The report says that a contractor named Leidos that the ATO uses for “problem management” recorded 77 issues pertaining to the components that later failed. HPE escalated some of those incidents for further investigation at its US labs, a probe that “highlighted the potential consequences but not likelihood of a major incident.” HPE also replaced some of the cables.
But those efforts didn't stop the errors and the ATO says the fact they kept appearing “indicated these actions did not resolve the potential SAN stability risk.” That risk was again made obvious on November 2016, when the SAN experienced a two-to-three-hour outage. But the ATO soldiered on until December 11th when it went down, hard.
At 11:27PM that night, “Excessive errors were observed on two data paths leading to a changed state (changed from normal operations) on multiple drives across two drive cages (cages 12 and 13).” The SAN then tried to relocate data on the relevant drives and hard reset itself, but nothing worked.
A dozen drives were later re-started and found to be in “erroneous states”, leaving the SAN without sufficient capacity to retain its desired n-1 parity.
Conflicting priorities in the small hours of the night
By now it was early on the 12th and the ATO felt this was a Priority 1 event. The report says HPE “did not make this categorisation at this time” and only did so between 6AM and 7AM.
The report says that the SAN used wide-striped disks, to “improve performance by reading and writing blocks of data to and from multiple drives at the same time, preventing single-drive performance bottlenecks.” But those disks failed after the complete SAN outage, due to a “firmware issue”.
The report names no disk vendors. Whatever the source of the problem, “a small number of drives temporarily and in some cases permanently prevented access to a significant amount of application data.”
Happily, all data was restored. But the ATO was not fully operational again for eight days because recovery tools were stored on the same SAN that had just failed so spectacularly.
HPE's fat-fingered February
The cause of the February incident looks more straightforward: the ATO says that HPE conducting “further remedial work by HPE on these SAN fibre optic cables” when, “during one cable replacement exercise, we were informed that data cards attached to the SAN had been dislodged.”
And down it went again.
The cables were replaced in late March and the report says “SAN alerts ceased completely once the new fibre optic cables were installed.” HPE has since installed completely new 3PAR SANS at the ATO.
The ATO is quite hard on itself in the report, promising to get better at defining its continuity requirements and hardening existing infrastructure.
An HPE spokesperson sent us the following statement:
As previously stated, this issue was triggered by a rare set of circumstances within the ATO’s system that has never been seen anywhere else in the world. HPE has treated this as a top corporate priority and has worked closely with the ATO to address the issue and help them enhance their storage system’s performance and resilience in the process.
The company also tells us that “Our initial reviews have been focused on identifying what triggered the outage so we could resolve the ATO system issues as quickly as possible. HPE will be conducting additional analysis in our labs later this year.”
But the ATO may not care about the outcome of that analysis, because the report says that the February 2nd incident was ameliorated by the fact it had moved its website to the cloud, which made for less down time than it experienced during the December incidents. It has also pledged to “engage in new technology to enhance performance and resilience (for example, the use of the cloud environment)”.
HPE's failure to heed its own SANs' warnings may therefore cost it many clients who also decide the cloud is safer than any SAN.
There are still plenty of loose ends here, such as just how cabling could cause an outage of this magnitude foremost among them. Why HPE didn't just replace cables that kept producing errors is another matter we imagine many readers would like resolved. It's not like cables are big ticket items and the ATO had a failover system so HPE should have been able to find a change window.
The ATO has also stayed schtum about the manufacturer of the disks: The Reg imagines readers will be keen to know which company's kit gets corrupted firmware when SANs crash. And of course there's also the question of how HPE let itself build a configuration with the recovery tools on the array. We'll let you know if and when those loose ends can be made neat. ®