CrowdStrike's Blue Screen blunder: Could eBPF have saved the day?
Grafana Labs CTO looks at the options
Interview The CrowdStrike chaos was caused by software running riot in the Windows kernel after an update tripped up the code. eBPF is a useful tool for kernel tracing and observability, but could it have mitigated the CrowdStrike incident?
"It's interesting," Tom Wilkie, CTO of observability specialist Grafana Labs tells The Register, "because there was a vulnerability in the eBPF runtime that caused a similar outage that was also triggered by CrowdStrike in a certain Red Hat kernel."
CrowdStrike's Falcon Sensor also linked to Linux kernel panics and crashes
READ MOREWilkie is referring to an incident in June, where Red Hat warned its customers of a Linux kernel bug that caused CrowdStrike's user-land eBPF-based Falcon Sensor code to crash the machine. Then, ironically enough, a few short weeks later, a broken kernel-level Falcon update made and distributed by CrowdStrike left 8.5 million Windows computers across the world stuck in a blue-screen boot loop.
In other words, computers were previously knocked out by a programming error in the Linux kernel's eBPF implementation code, triggered by CrowdStrike's product anyway. That doesn't inspire confidence.
eBPF allows applications to run in a virtual machine (VM) in the Linux kernel, permitting developers to add capabilities at runtime without having to write and load kernel-level modules or add code to the actual kernel, rebuild, and redeploy it. The theory goes that an eBPF program can't crash the kernel because it runs in a sandbox and is safety-checked by a verifier. Because of the low level at which some programs need to run, it's a popular way of implementing observability and security.
Yet it was CrowdStrike's eBPF program to monitor the system for threats that ran into a Linux kernel bug and caused boxes to panic. Meanwhile, work to implement the technology for Windows is ongoing.
eBPF might be the solution ... it has also been a historical cause of these problems
"So eBPF might be the solution," Wilkie continued, "but it has also been a historical cause of these problems. I mean, fundamentally, injecting code into running kernels is a risky activity. That was the problem CrowdStrike had. And you can still have bugs in eBPF; the safety guarantees offered by the eBPF runtime and the eBPF verifier are not perfect.
"The concept of eBPF is good, but the implementation – like all implementations – has bugs. Now, could you catch something like the CrowdStrike incident with eBPF? Yes. Probably. But honestly, you could also catch it just by doing better testing, and that would be my advice. Having better software engineering hygiene. And that's the lesson CrowdStrike has already learned."
Crowstrike CEO George Kurtz said at the Goldman Sachs' Communacopia and Technology Conference earlier this month that a freak incident caused the July calamity.
"In this particular case," Kurtz said, "we had a configuration change, which is like there's no code, its just a config that the sensor consumes. And we went through a validation process and we validated all those. They actually worked. The problem is we had 21 of them and the sensor understood 20. And that's the simple explanation of what happened.
"What have we changed in terms of the process? Well, we now run the configuration changes through not only the validation but all the various code QA processes we have and then deploy that in a phased rollout manner, as well as giving customers the choice on how they want to deploy that content."
Speaking to us ahead of this week's New York ObservabilityCON, during which Grafana Labs will announce enhancements to its Explore apps and Adaptive features, Wilkie also has thoughts on another contemporary theme: Cloud repatriation and funding open source development.
Having users run in the cloud is central to Grafana's mission. Wilkie says the company continues to see the use of its cloud growing – both in terms of user count and revenue – but is repatriation happening? "I would agree with the sentiment," he concedes.
"It feels like there's been a shift in the market in the last year or two, like post-zero percent interest rates, where people are more critically looking at cloud economics and realizing that a lot of SaaS and Infrastructure-as-a-Service is just not viable from a cost perspective."
- CrowdStrike apologizes to Congress for 'perfect storm' that caused global IT outage
- 1 in 10 orgs dumping their security vendors after CrowdStrike outage
- Post-CrowdStrike catastrophe, Microsoft figures moving antivirus out of Windows kernel mode is a good idea
- CrowdStrike hopes legal threats will fade as time passes since it broke the world
In a recent submission to the UK's Competition and Markets Authority, cloud giant AWS warned that it was facing stiff competition from the very on-premises infrastructure it dismissed as obsolete not so many years ago.
According to Wilkie, Grafana Labs' solution is to make its cloud more attractive. It has an on-premises version, but features such as adaptive metrics and logs are only available in the cloud. Wilkie says customers find it more cost-effective to use Grafana Labs' cloud for many applications than try to roll their own - well, he would, we guess.
Which brings us to how Grafana Labs remains a viable business and how it decides which services to make open source and which to keep proprietary.
... people are more critically looking at cloud economics and realizing that a lot of SaaS and Infrastructure-as-a-Service is just not viable from a cost perspective
Wilkie explains: "We call it the 'sniff test.' If a feature is going to be generally usable by a very large group of people, we will make it open source; if it only appeals to a small group of enterprises or large organizations, then we'll consider keeping it as a commercial differentiation."
He provides an example: "Grafana has 200-plus data sources, where you can connect Grafana to pretty much anywhere, and 170-ish are open source. Thirty of them are commercial integrations that we sell as part of Grafana Enterprise.
"A good example of a commercial integration would be with Datadog. One of our most popular enterprise data sources is our Datadog one. If you're paying Datadog to store your metrics and you want to visualize them in Grafana, you can pay us some money as well! It seems like a fair exchange of value."
Wilkie also cites Grafana's open source projects. A customer can build solutions with them, but, echoing comments made to El Reg by Kelsey Hightower, Grafana would be more than happy to sell them a managed service, requiring a credit card to get rolling in minutes. ®
Editor's note: This article was updated to add more detail about the BPF-based Linux kernel bug that CrowdStrike ran into prior to the global meltdown on Windows.