Analysis London's airspace was effectively shut down on Friday afternoon after a flight data server fell over, the National Air Traffic Services (NATS) has confirmed to The Register after multiple sources gave us specific details of the cockup.
Hundreds of flights were cancelled or diverted after NATS was forced to restrict the airspace over the capital for less than an hour. Operations are now up and running, and NATS says investigations are continuing – but a couple of well-placed sources familiar with the situation have explained what went wrong.
Air traffic control systems are enormously complicated and riddled with failsafe systems; when you're directing flying tubes with hundreds of people and highly combustible jet fuel, nobody at NATS just wings it. Today's problems were down to a combination of a server and data connection failure triggering the automatic safety systems that make sure everybody gets onto the ground safely, albeit a little late in some cases.
Safety in duplication
In order to keep the skies running smoothly, the air traffic control (ATC) system uses dual feeds of data. On the one hand all aircraft have to deliver a flight plan showing exactly when, where, and at what altitude, they intend to fly over the UK. This data is stored on a server, dubbed the flight data processing system.
At the same time, all flights in UK airspace are tracked on radar and that information is sent to a central flight server. This flight server matches the actual progress of air traffic on radar with the planned information from the flight data processing system and feeds that data to operations controllers directing air traffic.
On Friday afternoon an IBM S/390 mainframe running the flight data processing system fell over, according to sources familiar with the matter.
It hasn't been confirmed if it was a hardware or software flaw, but one well-placed source said the machine had never had a hardware failure before so software was more likely. There is a backup flight data processing system, which kicks in within seconds if the primary fails.
"Invariably someone puts a flight plan wrong and it borks the system," one source told El Reg on condition of anonymity.
"If the same data goes into the backup server it will sometimes fall over on same processing problems, and start switching back and forth with the main server. When we get a switchover then the first thing that's usually done is to shut down the backup processor."
Doing that takes time, but the engineers at NATS are used to sorting out problems quickly. They have to, since once the flight data processing system fails then a countdown begins before emergency safety measures take control.
If the flight data processing system is down for more than eight minutes, the flight server alerts controllers that the data it is getting is stale. Aircraft can travel a long way in those few minutes and the flight server alerts controllers, who then focus on radar and start to shut down flights in a process described as "graceful degradation."
This involves reducing the flow of aircraft into UK airspace, either by directing flights elsewhere or bringing them down as quickly and safely as possible at their intended destination. Given the amount of lives involved aircraft and passenger safety always takes precedence over convenience.