This article is more than 1 year old
Software dev 101: 'The best time to understand how your system works is when it is dying'
Architect for failure, sure, but know that it will never be easy
QCon London At the QCon Developer conference underway in London, William Hill's R&D Engineering Lead Gavin Stevenson told attendees that they should celebrate IT failures.
"The best time to understand how your system works is when it is dying," he said.
QCon is a vendor-neutral event focused on large-scale software development and architecture, and is relatively hype-free.
Stevenson's underlying point is that examining how an application fails when under stress is more illuminating than simply observing it working. Failure identifies the limitations of the system. That said, his team works in R&D, rather than on production systems, and the real goal is to avoid failure.
William Hill is a Java shop; but its next-generation bet settlement system under development is written in Erlang, which is designed for concurrency. "The syntax is simple," said Stevenson, "and the supervisor hierarchy makes it really nice to work with."
Supervisors, part of Erlang's OTP (open telecom platform) library, manage child processes and restart them when necessary, adding resilience.
Stevenson's team decided to use an in-memory database for performance. They tested the system by using a log of all the bets placed for last year's Grand National, over 6.2 million, and replaying them as fast as possible.
"Our app failed, which was brilliant," he said. There was "massive contention" in the database and excessive memory consumption, over 50GB.
A redesign using sharding (a technique for partitioning the data), load-balanced supervisors, distributed logging using Apache Kafka and multiple betting engines, a new design which avoids having a new Erlang process for every bet: all these things resulted in a resilient, scalable system that could process 6 million bets in 20 minutes.
Stevenson's team also relies on Docker containers for deployment. "Everything we do in R&D, it's Docker," he said; though they have struggled with container load-balancing and orchestration. "There isn't a brilliant solution," he said, though they are looking at Docker Swarm, a product for clustering Docker engines.
"It's a reactive microservice-based architecture," said Stevenson. "Probably. Nobody seems to agree what microservices is."
Making your application fail, then, is a handy tool for application development; but only one small piece in the wider task of designing resilient systems.
At an "open space" discussion which followed Stevenson's talk, William Hill's relatively clean-room development story, in the comfort of R&D, seemed remote from the reality facing many businesses.
One attendee, in the financial services industry, lamented the many dependencies in the system he managed, any one of which could stop things working. The core problem was a legacy back-end system including IBM's WebSphere MQ, SOAP web services and JDBC (Java) database calls. "It's 30 years of legacy," he said. "When will we get the budget to fix it? Not in my lifetime."
Nor is today's rush towards microservices architecture a complete solution. Each microservice is a dependency, and what happens when one breaks? "You have to dig into why it doesn't work, how do you react quickly?" asked an attendee.
Even if you think you know how it should be done, implementing today's best practice in the real world is a huge challenge. ®