CrowdStrike meets Murphy's Law: Anything that can go wrong will
And boy, did last Friday's Windows fiasco ever prove that yet again
Opinion CrowdStrike's recent Windows debacle will surely earn a prominent place in the annals of epic tech failures. On July 19, the cybersecurity giant accomplished what legions of hackers could only dream of – bringing millions of Windows systems worldwide to their knees with a single botched update.
As a veteran tech journalist, I've seen my fair share of software snafus. Heck, I went hand-to-hand with the grandpa of all network blow-ups – the Morris Worm – in 1988 when I was a sysadmin. Even so, I can't help but marvel at the sheer scale and impact of this blunder. CrowdStrike, a company valued at over $70 billion and trusted by countless organizations to protect their digital assets, inadvertently became the source of one of the largest IT outages in history.
The fallout from this debacle was staggering – thousands of flights canceled, healthcare services disrupted, and 911 systems knocked offline. It's a stark reminder of how deeply intertwined our digital infrastructure has become and how vulnerable it can be to a single point of failure.
Let's break down the cascade of errors that led to this fiasco.
In the beginning, Microsoft enabled CrowdStrike's Falcon security software to run at the zero level of the Windows kernel. Any problem at this low level will likely cause a Blue Screen of Death (BSOD). Meanwhile, Microsoft reportedly wants to blame the European Commission – no, really – for requiring it to grant third-party software vendors this level of access.
You know, I think with all of Microsoft developers and lawyers, they could come up with a better, legal way to avoid this kind of foul-up and let software companies compete equally. It's not rocket science.
Microsoft doesn't want any of the blame, but it deserves some of it. For far too long, we've placed too many vital IT eggs in the Windows basket. When that basket falls, so does much of the economy.
Returning to CrowdStrike, the company claims a "logic error" in a routine sensor configuration update caused the meltdown. But for a company of CrowdStrike's caliber, such a fundamental mistake is inexcusable. This wasn't some obscure edge case – it was a critical failure in its core functionality.
It wasn't even a code problem. This wasn't a software update per se. The villain of this piece was a Falcon configuration file called a channel file. One simple file containing what should have contained data to update a security setting ended up causing a cascade of one BSOD after another.
- The graying open source community needs fresh blood
- Windows: Insecure by design
- Let's take a look at Oracle's love and hate relationship with open source software
- Where do Terraform and OpenTofu go from here?
How did such a catastrophic bug pass quality assurance? CrowdStrike admitted: "Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data [and] were deployed into production." When your software has deep hooks into millions of Windows systems, your testing should be bulletproof. Clearly, CrowdStrike's testing protocols need a massive overhaul.
We also now know, as security expert Kevin Beaumont pointed out on Mastodon: "The key takeaway – channel updates are currently deployed globally, instantly." I always send major patches to all my customers simultaneously and wait to see what happens next. Doesn't everyone? Who are these people, and why does anyone let them do security work?
There's a simple concept called canary testing. You may have heard of it. Like the proverbial canary in a coal mine, you first test whether a new space – or program – is safe by trying it on a canary – or a small group of users – and then, if all's well, let everyone else in.
Let's not forget that CrowdStrike's initial response was slow and inadequate. Users were left scrambling for answers while critical infrastructure faltered. Even today, almost a week later, I still have friends having trouble with their Delta flights.
This serves as a sobering wake-up call for the rest of us in the tech industry. As we rush to secure our systems against external threats, we must not overlook the potential for self-inflicted wounds. Rigorous testing, fail-safe mechanisms, and a healthy dose of humility are essential when dealing with critical systems.
In the end, CrowdStrike's Windows fiasco is a textbook example of Murphy's Law in action – anything that can go wrong will go wrong. It's a painful lesson but one that we would all do well to learn from. After all, in cybersecurity, your next big threat might just be an update away. ®