Monzo, a UK online banking startup, suffered an outage on Friday for over an hour due to a four-month-old Kubernetes bug.
The Fatal Flaw, as the event might be titled by author Lemony Snicket, took down a complete production cluster, according to Oliver Beattie, head of engineering for Monzo, "through a very unfortunate series of events."
Customers saw incoming payments delayed during this period and outgoing payments failed. Monzo essentially operates as an internet-based bank, accessible through its smartphone app, that offers current accounts, budgeting tools, spending warnings, and so on.
On Monday, Beattie posted an analysis of the incident and lay the blame on Kubernetes and incompatibility with related software.
Monzo's stack, Beattie explained, relies on Kubernetes for cluster orchestration, the distributed database
linkerd, software that manages cluster routing and load balancing.
Two weeks prior to the outage, Monzo's platform team upgraded its
etcd cluster to a new version and expanded its size from three nodes to nine. In so doing, they set the stage for the outage. On Thursday, an engineering team deployed a new feature for account holders, but started seeing issues and scaled the service down so it was not running on any replicas but remained as a Kubernetes service.
On Friday, around 14:10 BST, a change was made to a service used for processing payments. At that point, customers began experiencing payment failures. Two minutes later, the change was rolled back but the problems persisted.
By 14:18, Monzo's engineers traced the problem to
linkerd. The software wasn't receiving updates from Kubernetes about where new pods were running on the network and was routing requests to IP addresses that were no longer valid.
At 14:26, they decided to restart the several hundred
linkerd instances running on the backend in the belief doing so would fix the issue across the board. But they couldn't because the Kubelets running the cluster's nodes were unable to fetch configuration data from the Kubernetes
Banking app startups go TITSUP as payment slurper keels over. AgainREAD MORE
Suspecting additional issues affecting either Kubernetes or
etcd, they restarted three
apiservers processes. Come 15:13 and all the
linkerd pods had restarted. But the banking app's services were not receiving any requests. It was, by this point, a full platform outage.
At 15:27, the engineers noticed
linkerd logging a
NullPointerException while trying to read the service discovery response from the
apiservers. They realized the failure to parse empty responses was due to an incompatibility between the versions of Kubernetes and
linkerd being run.
To restore service, they turned to an updated version of
linkerd being tested in the company's staging environment. After deploying the necessary version upgrade, they recognized that they could avoid the error that arose from trying to parse services with no replicas by deleting them. That allowed
linkerd to resume its service discovery and the platform started to recover.
Beattie said his team "found a bug in Kubernetes and the
etcd client that can cause requests to timeout after cluster reconfiguration of the kind we performed the week prior. Because of these timeouts, when the service was deployed,
linkerd failed to receive updates from Kubernetes about where it could be found on the network."
linkerd instances compounded the problem, he said, because it revealed an incompatibility between specific versions of
linkerd and Kubernetes.
"I want to reassure everyone that we take this incident very seriously; it’s among the worst technical incidents that have happened in our history, and our aim is to run a bank that our customers can always depend on," Beattie concluded. "We know we let you down, and we’re really sorry for that."
The frank mea culpa appears to have been well-received by customers, with a number of them voicing appreciation for the detailed disclosure and explanation. ®