The biggest British Airways IT meltdown WTF: 200 systems in the critical path?

It's not the velociraptor you can see that kills you

One of the key principles of designing any high availability system is to make sure only vital apps or functions use it and everything else doesn't – sometimes referred to as KISS (Keep It Simple Stupid).

High availability or reliability is always technically challenging at whatever systems level it is achieved, be it hardware or software. The colossal systems failure at British Airways has been blamed on a "power surge" trigger followed up by a messaging system failure.

However, within the comments of the BA chief executive there is one telling statement:

Tens of millions of messages every day that are shared across 200 systems across the BA network and it actually affected all of those systems across the network.

Sorry for the text speak, but WTF? How does it require 200 systems to issue a boarding pass, check someone in and pass their security details on to the US – even if they aren't going there? Buried deep in The Register comments on the article is an allegedly former BA employee claiming that this is in fact the case, that all of these systems are required for BA to function. How did BA get to the point that there are 200 systems in the critical path?

The problem with current IT systems is that even with no high availability elements in path, once an initial burn-in period has passed, they are hugely reliable. Failures even in this setting are sufficiently rare that unless you look at IT systems as a whole, it can seem like they never occur.

So, sure, we need this new function in path, just add another server (virtual machine) and let's go, maybe spread over a couple of data centres won't be a problem. Can't be a problem – we've never seen a failure, so why do those IT guys keep telling me I have to spend millions on re-factoring the system to ensure it is highly available?

Another organisation that struggled to internally communicate the true nature of the reliability risk they were facing was NASA – and the consequences of that were even more visible than BA's. This also demonstrated a spectacularly poor understanding of the nature of risk on the part of senior management.

During the Rogers Commission's investigation into the Challenger disaster, Richard Feynman examined the NASA approach to estimating failure rates. NASA's management believed that the risk of shuttle failure was "necessarily" one in 105. This figure seemed "fantastical" to Feynman and so he estimated the failure rate himself and obtained a figure of one in 100.

Moreover, once he involved NASA's engineers in the calculation, this figure came in between one in 50 and one in 200. How could there be such a disconnect between the engineers' view of the failure rate of the system they designed, and the management's view of the system they commissioned?

In the case of the shuttle, many engineers had raised the issue that ultimately led to the failure, but their warnings fell on deaf ears. Indeed, it was far from clear that even senior NASA management were actually capable of understanding the warnings their engineers were raising – often having neither an engineering or a scientific background.

NASA were well aware of the exposure that a failed space shuttle, likely to be both explosive and public, would cause. Indeed, from a risk consequence perspective, the outcome was regarded as having similarly negative connotations to the assassination of a president. So how could they get it so wrong? There are almost no organisations – actually there are none – that like or encourage prophets of doom.

So what, if any, are the parallels for large-scale IT systems?

For many organisations (citing practically all UK government utterances on IT issues as evidence), senior management has practically no meaningful IT knowledge beyond the ability to press the buttons on their smartphone or tablet. Within the IT function, senior management figures are generally chosen – by non-technical managers – for management rather than their technical abilities.

How many organisations, BA included, have a detailed model of why their systems are fit for purpose? Just as the space shuttle was "necessarily" good for 1 in 105, how many IT systems are claimed to be five nines on the basis of a box and line diagram showing the presence of duplicate resilient systems? Are the models used by IT management to understand the underlying failure rate of their systems any better than the ones used by NASA management to achieve their necessary 99.999?

It is unlikely that any of the 200 systems BA needs to be functional to keep operating is a simple computational unit. Each of these sub-systems will themselves have complex internal interdependencies between the servers, network, storage and the software that come together to deliver the function. The sheer number of potential points of failure that BA was exposed to is hard to believe. Fortunately, as a default they fail very, very rarely, so it is easy to believe that failure simply cannot occur.

It is clear that BA is suffering from criticality bloat. They have permitted systems to be added to the critical business path willy-nilly. The systems fail so rarely that surely this cannot be a problem – but what about the system you add to the critical delivery path but don't know about?

When confronted with complexity people have an inevitable tendency to retreat into hope and historic belief. One consequence of this is that if an event hasn't happened yet, it is very unlikely to ever happen. In probability circles this is called the gambler's fallacy, the base of a significant fraction of the earnings currently achieved on the web – a great example is the so-called guaranteed winning "doubling" strategy for roulette.

For any IT dependent organisation, which in reality is pretty much everything these days, a fundamental question should be: Why does the organisation believe its IT is sufficiently robust to allow it to meet its operational goals? What is the evidence that belief is based on? How has the evidence been validated? Is there a predictive model, not a picture on a slide deck, of why the system as a whole stays up?

Just like velociraptors, it's not the one you can see that kills you. ®

Similar topics

Other stories you might like

  • North Korea pulled in $400m in cryptocurrency heists last year – report

    Plus: FIFA 22 players lose their identity and Texas gets phony QR codes

    In brief Thieves operating for the North Korean government made off with almost $400m in digicash last year in a concerted attack to steal and launder as much currency as they could.

    A report from blockchain biz Chainalysis found that attackers were going after investment houses and currency exchanges in a bid to purloin funds and send them back to the Glorious Leader's coffers. They then use mixing software to make masses of micropayments to new wallets, before consolidating them all again into a new account and moving the funds.

    Bitcoin used to be a top target but Ether is now the most stolen currency, say the researchers, accounting for 58 per cent of the funds filched. Bitcoin accounted for just 20 per cent, a fall of more than 50 per cent since 2019 - although part of the reason might be that they are now so valuable people are taking more care with them.

    Continue reading
  • Tesla Full Self-Driving videos prompt California's DMV to rethink policy on accidents

    Plus: AI systems can identify different chess players by their moves and more

    In brief California’s Department of Motor Vehicles said it’s “revisiting” its opinion of whether Tesla’s so-called Full Self-Driving feature needs more oversight after a series of videos demonstrate how the technology can be dangerous.

    “Recent software updates, videos showing dangerous use of that technology, open investigations by the National Highway Traffic Safety Administration, and the opinions of other experts in this space,” have made the DMV think twice about Tesla, according to a letter sent to California’s Senator Lena Gonzalez (D-Long Beach), chair of the Senate’s transportation committee, and first reported by the LA Times.

    Tesla isn’t required to report the number of crashes to California’s DMV unlike other self-driving car companies like Waymo or Cruise because it operates at lower levels of autonomy and requires human supervision. But that may change after videos like drivers having to take over to avoid accidentally swerving into pedestrians crossing the road or failing to detect a truck in the middle of the road continue circulating.

    Continue reading
  • Alien life on Super-Earth can survive longer than us due to long-lasting protection from cosmic rays

    Laser experiments show their magnetic fields shielding their surfaces from radiation last longer

    Life on Super-Earths may have more time to develop and evolve, thanks to their long-lasting magnetic fields protecting them against harmful cosmic rays, according to new research published in Science.

    Space is a hazardous environment. Streams of charged particles traveling at very close to the speed of light, ejected from stars and distant galaxies, bombard planets. The intense radiation can strip atmospheres and cause oceans on planetary surfaces to dry up over time, leaving them arid and incapable of supporting habitable life. Cosmic rays, however, are deflected away from Earth, however, since it’s shielded by its magnetic field.

    Now, a team of researchers led by the Lawrence Livermore National Laboratory (LLNL) believe that Super-Earths - planets that are more massive than Earth but less than Neptune - may have magnetic fields too. Their defensive bubbles, in fact, are estimated to stay intact for longer than the one around Earth, meaning life on their surfaces will have more time to develop and survive.

    Continue reading

Biting the hand that feeds IT © 1998–2022