CrowdStrike blames a test software bug for that giant global mess it made
Something called 'Content Validator' did not validate the content, and the rest is history
CrowdStrike has blamed a bug in its own test software for the mass-crash-event it caused last week.
A Wednesday update to its remediation guide added a preliminary post incident review (PIR) that offers the antivirus maker's view of how it brought down 8.5 million Windows boxes.
The explanation opens by detailing that CrowdStrike's Falcon Sensor ships with "sensor content" that steers and defines its threat-detection engine's capabilities. This behavioral-based software is also updated with "rapid response content" that uses the sensor content to detect and handle specific emerging malware and other unwanted system activity. This rapid response content is delivered to users in those channel files you've been hearing about.
The base sensor content includes what's called "template types," which are blocks of code that can be customized and used by rapid response content to identify malicious stuff on a system. As such, these rapid response updates are known as "template instances" because they are "instantiations of a given template type."
Thus, the sensor content defines a bunch of code templates, and the rapid response content customizes the action of those templates so that the sensor software can detect, observe, and prevent specific system activity.
As CrowdStrike puts it, template instances configure how template types operate during runtime.
In February 2024, CrowdStrike introduced and distributed a new "inter-process-communication (IPC) template type" for rapid response content to use that the vendor designed to detect "novel attack techniques that abuse Named Pipes." The IPC template type passed testing on March 5, and a rapid response template instance was released to use it.
Three more IPC template instances were deployed between April 8 and April 24. All ran without crashing 8.5 million Windows machines – although, as we reported earlier this week, some Linux machines had problems with CrowdStrike's Falcon around May and June.
On July 19, CrowdStrike introduced two more IPC template instances. One included "problematic content data," but made it into production anyway, because of what CrowdStrike described as "a bug in the content validator."
The post doesn't detail the content validator's role; we'll assume it's supposed to do what the name suggests and likely in an automated manner.
- CrowdStrike CEO summoned to explain epic fail to US Homeland Security committee
- Life, interrupted: How CrowdStrike's patch failure is messing up the world
- Cybercrooks spell trouble with typosquatting domains amid CrowdStrike crisis
Whatever the validator does or is supposed to do, it did not prevent the release to customers of the dodgy July 19 template instance despite it being a dud. This test software should have detected that the content update was broken but approved it anyway because the validator was buggy.
CrowdStrike thus assumed the July 19 channel file release would be OK; the tests had after all passed the IPC template type delivered in March, and subsequent related IPC template instances, without a hitch on Windows.
History tells us that was a very bad assumption. As we concluded in our earlier analysis of the crash, Falcon loaded and parsed the new content and was confused by the broken template instance, which "resulted in an out-of-bounds memory read triggering an exception" within CrowdStrike's Windows driver-level code, which would bring down the whole box.
On reboot, it would start up and crash all over again. CrowdStrike's Falcon suite runs at the operating system level to give it good visibility for its threat detection operations. When its content interpreter is misled into accessing memory it shouldn't, however, as what happened here with the bad data, it has the potential to take out the OS and running applications with it.
"This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash," Team CrowdStrike said.
On around 8.5 million machines.
There have been calls for CrowdStrike to scrutinize its releases for errors prior to distribution; well, it tried and failed. The incident report includes promises to test future rapid response content more rigorously – we recommend sandboxing releases if it's not already doing that – plus stagger releases, offer users more control over when to deploy it, and provide release notes.
You read that right: Release notes. Be still your beating heart.txt
.
The report also includes a pledge to release a full root cause analysis once CrowdStrike has finished its investigation.
Take all the time you want: Some of us are still busy rebuilding machines you broke. ®