Visa has said a “very rare” partial network switch failure in one of its two data centres led to the fiasco earlier this month that caused millions of transactions in Europe to be declined.
The outage, which lasted for about ten hours on Friday, June 1, sent panic among European pub-goers, as apparently about 10 per cent of 51.2 million transactions that were attempted failed.
In a letter late last week [PDF] to the UK’s Treasury Committee – which asked the firm to explain itself – Visa's European boss Charlotte Hogg explained the error, as well as issuing the usual “unreserved” apologies for the outage.
“We take seriously our important role in supporting financial stability in the UK,” she said. “A disruption to our processing that impacts consumers at any time is unacceptable, let alone during a busy Friday afternoon.”
The letter pointed the finger at a “very rare” component fault within a network switch that prevented a backup switch from taking over at one of Visa’s two data centres in the UK. That interfered with the two centres' mechanisms for communicating with each other, we're told, generating a backlog of work that overwhelmed the system.
It took about 10 hours for normal service to resume, and affected about 1.7 million UK card holders, with around 9 per cent of transactions initiated on UK-issued cards failing to process.
The firm has launched a number of reviews and is also in the process of migrating its European systems to a more resilient global processing system: VisaNet.
In more detail...
Hogg explained that both of Visa's two UK-based data centres are constantly processing transactions, and that each is able to handle all of Visa’s European transactions should the other centre completely fail or become overloaded. In order to do this, they must be continuously synchronised, so transactions can be immediately routed to either site for processing. They stay synchronised by exchanging messages.
There are a number of backup systems, Hogg added, as well as two core network switches for directing the flow of transactions. However, the aforementioned switch equipment fault in the primary centre prevented a backup switch from activating, and stopped the two sites from working together.
“In this instance, the switch being used in our primary data centre experienced a very rare, partial failure, which impacted the secondary site and prevented it from automatically processing all transactions, as it should have,” Hogg said.
“As a result, it took far longer than it normally would to isolate the system at the primary data centre; in the interim, the malfunctioning system at the primary data centre continued to try to synchronise messages with the secondary site.
“This created a backlog of messages at the secondary data centre, which, in turn, slowed down that site's ability to process incoming transactions.”
Specifically, she noted:
A component within a switch in our primary data centre suffered a very rare partial failure which prevented the backup switch from activating.
The payment processor had to take a series of steps to fully shut off the faulty switch, isolate the primary data centre and stop the message backlog, including turning off all software applications at the primary site and cleaning up message backlogs at the secondary site by both manual and automatic means.
However, it took until 1910 local time – the cockup having been noticed at 1435 – to fully deactivate the system causing the transaction failures at the primary data centre, by which time the second data centre had begun processing almost all transactions normally. Normal service had resumed at both data centres by Saturday morning at 0045.
Nine per cent fail rate
During this time, there were two peak periods of disruption in the UK, when an average of 35 per cent of transactions failed to process: ten minutes between 1505 and 1515, and 50 minutes between 1740 and 1830. The fail rate was 7 per cent at all other times in the UK.
Visa Europe fscks up Friday night with other GDPR: 'God Dammit, Payment Refused'READ MORE
Overall, the biz reported that 27.6 million Visa transaction were initiated on 16 million UK-issued cards, with 2.4 million (9 per cent) of these transactions failing - slightly less than the European fail rate of 10 per cent.
However, Visa added that a number of people successfully retried their transaction; taking this into account, it said, the overall rate of transactions that failed to process dropped by approximately half.
Hogg said the payment giant was working with the switch hardware manufacturer to figure out why the device failed when it did, and was taking steps that will allow it to isolate and remove a failing component in a more automated and timely way in future.
In addition, Hogg noted that Visa is migrating its European processing onto its global system, VisaNet - a process that is due to complete by the end of 2018.
VisaNet has a different technical architecture with multiple data centres, she said, with “significantly” more capacity and scale and is more resilient in detection and recovery from partial malfunctions.
Anticipating the next obvious question, Hogg added: “It is worth noting that the incident on 1 June was in no way related to this migration, which has been underway since February and has been going well, following a robust migration plan.”
Visa has also asked international accountancy firm EY to review the incident and is offering compensation to people affected.
The Treasury Committee said it was satisfied with Visa's answers, but expected to see the findings of the review. ®