This article is more than 1 year old
Cloudflare's outage was human error. There's a way to make tech divinely forgive
Don't push me 'cos I'm close to the edge. And the edge is safer if you can take a step back
Opinion Edge is terribly trendy. Move cloudy workloads as close to the user as possible, the thinking goes, and latency goes down, as do core network and data center pressures. It's true – until the routing sleight-of-hand breaks that diverts user requests from the site they think they're getting to the copies in the edge server.
If that happens, everything goes dark – as it did last week at Cloudflare, edge lords of large chunks of web content. It deployed a Border Gateway Protocol policy update, which promptly took against a new fancy-pants matrix routing system designed to improve reliability. Yeah. They know.
It took some time to fix, too, because in the words of those in the know, engineers "walked over each other's changes" as fresh frantic patches overwrote slightly staler frantic patches, taking out the good they'd done. You'd have thought Cloudflare of all people would be able to handle concepts of dirty data and cache consistency, but hey. They know that too.
What's the lesson? It's not news that people make mistakes, and the more baroque things become, the harder they are to guard against. It's just that what gets advertised on BGP isn't just routes but things crapping out, and when you're Cloudflare that's what the C in CDN becomes. It's not the first time it's happened, nor the last, and one trusts the company will hire a choreographer to prevent further op-on-op stompfests.
Yet if it happens, and keeps happening, why aren't systems more resilient to this sort of problem? You can argue that highly dynamic and structurally fluid routing mechanisms can't be algorithmically or procedurally safeguarded, and we're always going to live in the zone where the benefits of pushing just a bit too hard for performance is worth the occasional chaotic hour. That's defeatist talk, soldier.
There's another way to protect against the unexpected misfire, other than predicting or excluding. You'll be using it already in different guises, some of which have been around since the dawn of computer time: state snapshotting. No matter what a computing device is doing, it's going from one state to another and those states absolutely define its functioning. Take a snapshot and you're freezing that state in time; return to that state and it's as if nothing that happened after that point never happened. The bad future has been erased. God-like power.
It seems so mundane. The concept is behind ctrl-Z, backups, journaling file systems, auto-save, speculative execution, and much more besides. We almost never think of all these as the same idea, because they're all individual bodges we keep reinventing. Can the concept be generalized, so we can engineer it into our systems as a fundamental property? If we did, what would that look like?
- AI's most convincing conversations are not what they seem
- TSMC and China: Mutually assured destruction now measured in nanometers, not megatons
- Only Microsoft can give open-source the gift of NTFS. Only Microsoft needs to
- Safari is crippling the mobile market, and we never even noticed
When Apple named its backup system Time Machine, it came the closest of any high-profile tech company to making the concept explicit. It's immediately obvious what the state is that's being snapshotted – files, static collections of bits, are nothing but state. Make a copy of this, store it with a label and a timestamp, and you're done.
Journaling file systems are smarter, they understand that a file can change rapidly when it's in use and a cloddish copy every time a byte changes is impractical. So they keep a journaling data structure of changes made between a file opening and closing. If things go wrong, the last copy can have the journal applied and 'Bang, we're back.' Would those Cloudflare engineers have appreciated a "Bang! You're back!" button? Would we? Form a queue.
The same is true of undo mechanisms in editors, where the journaling data is kept locally, which makes stepping back and forward much easier: a journaling file system works across all applications, but doesn't give that level of control. You can pick apart all sorts of state snapshotting and find different mechanisms, depending on what works best for each. In fact, there may even be extra benefits.
Is it actually universally true that state can be retrieved? The universe thinks so, with quantum information theory saying just that - although the cost is another matter. More prosaically, can state in a distributed asynchronous system – looking at you, Cloudflare, looking at you, BGP – be saved?
Turns out it can. the best known example is the Chandry-Lamport algorithm. It boils down to messages, timing and leaving it to each entity to decide how to save and restore its state, but if you get that right, it's generally applicable.
That "getting it right" for things like time and messaging in a distributed system is non-trivial shouldn't put us off. This will never be a snippet of code you can include to add time travel to anything, but it does point the way to the sort of standardized, API-able option that very different but co-operative components could use to provide common functionality. There'll be complexity and resource costs, but why not push the envelope on robustness for a change?
As for unexpected benefits, that's up to us. Imagine a browser with a time dial, where you can skip backwards and forwards along the timeline of your tabbed URLs. We've all hunted uselessly for that long-closed tab – history and bookmarks? Pah – even if we have a good idea of when we last used it. But a browser could just regularly dump state to a time-series database for itself or other tools to use. It'd be fun to see that as a 3D visualization, and besides, it would just be us doing for ourselves what Google does to us anyway.
That we don't think this way is a hangover from the days when storage and CPU were too expensive to waste on just-in-case insurance. Those days are gone, we spend both on trivia with the abandon of a Victorian libertine. Yet we actually have the makings of a Doctor Who-style Tardis for our universe of information. Let's start building it. They certainly need one at Cloudflare. ®