Labyrinth of 371 legacy systems hindered hospital's IT meltdown recovery
Guy's and St Thomas' in London spent two months getting back on its feet after heatwave fried datacenter
Last summer's datacenter outage at one of the UK's largest hospitals took two months to completely rectify because of the complexity associated with 371 legacy IT systems, a new report has found.
Guy's and St Thomas' NHS Foundation Trust suffered an IT outage at the peak of last summer's heatwave, when temperatures hit 40°C (104°F), causing two linked datacenters to fail simultaneously. Each had been designed as backup for the other.
The failure resulted in most of the clinical IT systems at the trust's London hospitals and related community services becoming unavailable to users, forcing staff to employ a paper-based system to keep records and find information.
The trust incurred £1.4 million ($1.7 million) in out-of-plan spending on technology services to respond to the incident. This included a cloud-hosted environment to provide resilience for data backups and a third-party specialist recovery service to image and extract data from the corrupted disks damaged during the datacenter failure.
The report identified one event of moderate harm caused to a patient and evidence of more cases may come to light.
The impact on hospital staff was severe. "The incident took a heavy toll on staff, who reported fatigue, stress and an adverse impact on morale. In particular, this affected frontline clinical and operational staff, who worked tirelessly to provide safe patient care, and also the IT team who worked tirelessly, often around the clock, to recover critical IT systems under immense pressure," according to a board report published last week.
The trust declared a critical incident on July 19, 2022. Despite the best efforts of IT staff, it was not lifted until September 21, more than two months later, although the core clinical systems were recovered in six weeks.
"There was widespread frustration with how long it took to recover core clinical IT systems: several weeks rather than hours or days. This was not a reflection on the effort or professionalism of the Trust's IT team, but demonstrated the limited number of individuals who had a detailed understanding of the Trust's legacy IT systems which were too numerous, complex and inter-linked to be recovered quickly," the report said.
Guy's and St Thomas' has 371 legacy IT systems that support patient records, patient administration, clinical services and infrastructure across Guy's Hospital, St Thomas' Hospital, Evelina London Children's Hospital and the trust's community services. The outage affected electronic patient records, electronic prescribing systems, electronic ordering for investigations and e-Notation.
Responsibility for the datacenter and clinical systems was divided between the trust's in-house Data, Technology & Informatics Directorate (DT&I), its in-house estates and facilities management group, IT service provider Atos, storage area network manufacturer NetApp, and Secure IT, a third-party company responsible for servicing the datacenter air conditioning.
- Lessons to be learned from Google and Oracle's datacenter heatstroke
- Major IT outage forces UK emergency call handlers to use 'pen and paper'
- Google: We had to shut down a datacenter to save it during London's heatwave
- Hospital IT melts in heatwave, leaving doctors without patient records
The Guy's datacenter was constructed in 2007, while St Thomas' was built in 2012. The IT infrastructure was updated between 2015 and 2016.
The combination included suboptimal cooling systems, an ageing technological infrastructure, and overly complex and distributed roles and responsibilities for managing elements of the datacenter environment.
Linked to the last point was an insufficient cooling response, both in terms of speed and scale, taken on the day of the incident to mitigate extreme ambient temperatures, the report said.
For example, preparations had been made in advance of July 19 to hose down the condensers at the St Thomas' site. However, problems with a hose connector meant this was delayed and not as effective as it could have been. Later that day, temperatures inside the datacenter were recorded at 50°C (122°F).
There was also a failure to link the environmental risks associated with two closely located datacenters when one provided backup for the other.
"Given the relative proximity of the two datacenters, it could have been foreseen that an environmental cause, such as a heatwave, could affect the two datacenters simultaneously. There had been previous concerns about the cooling systems at both datacenters: the ventilation at St Thomas' was known to be sub-optimal, though significant mitigations had been put in place following a trip-switch activation during a previous heatwave. The Guy's datacenters cooling system was approaching the end-of-life, though the manufacturer had confirmed earlier in the year that it was still just within its expected operating lifespan," the report said.
The failure of both datacenters at the same time left some of the backup servers in a conflicted state, which could not be resolved by the internal IT department or Atos. Zerto was contacted to troubleshoot and identified a workaround for the affected server groups. The solution was a time-consuming manual process of extracting and copying files.
The outage also revealed problems with the level of technical knowledge required for disaster response and recovery. The report noted that Atos staff "required technical direction by DT&I staff when managing system shutdown in Guy's datacenter." ®