The IT cockup at the National Air Traffic Services (NATS) that grounded hundreds of flights in December occurred because both of its System Flight Server (SFS) channels went down, an independent report has revealed.
"The disruption on 12 December 2014 arose because – for the first time in the history of the SFS – both channels failed at the same time," said the NATS System Failure 12 December 2014 – Interim Report.
The cockup resulted in 120 flights being cancelled and 500 flights being delayed for 45 minutes, and affected 10,000 passengers in total.
In December, sources told El Reg that the problem happened when an IBM S/390 mainframe running the flight data processing system fell over.
The incident started with the failure at 14:44 of a computer system used to provide data to Air Traffic Controllers to assist in their decision-making when managing the traffic flying at high level over England and Wales, said the report.
At 14:55, all departures were stopped from London Airports and at 15:00, all departures were stopped from European airports that were planned to route through affected UK airspace. The systems were restored at 20:10, it said.
All of the operational roles performed within the London Area Control have a unique identifier known as an Atomic Function, the label which ensures that the SFS supplies the appropriate information and communication capabilities to each workstation.
But a latent defect meant the capacity for the maximum number of Atomic Functions had been incorrectly set to 151 instead of the correct amount: 193.
"The primary SFS believed that it had more active Atomic Functions than the maximum capacity, a situation that should not be allowed to occur," said the report. "When an error of this kind occurs SFS is programmed to shut down in order to prevent the risk of supplying corrupt data to controller workstations. When responsibility transferred to the secondary SFS the command to enter Watching Mode was replayed triggering the same error."
NATS has denied accusations that the problem occurred because the body "skimped" on its IT investment. The body became a public-private partnership in 2001.
The final report will be published before 14 May 2015 and will address the wider issues behind the failure including the panel’s views on the root causes lying behind the incident. ®