What it takes to keep an enterprise 'Frankenkernel' alive
The skillful handiwork of merging bits from different kernels into one, and keeping it secure at the same time
devconf.cz Maintaining the kernel of an enterprise distro is not only hard work, it also involves conflicting goals.
A talk by Red Hat Principal Kernel Engineer Jiří Benc at this year's DevConf.cz event covered some of the inherent contradictions in keeping an enterprise distro's kernel on its feet. Or at least on somebody – or something's – feet, as its title hinted: "CentOS Frankenkernel: Append Your Limb."
He focused on the kernel of CentOS Stream, which in time will be the kernel of the next point-release of RHEL 9 – at the time of writing, that will be RHEL 9.3, but like the other versions of RHEL 9, this will have kernel 5.14 – released way back on August 29 2021. How do they achieve this?
The goals of any kernel update are simple: stability, obviously. No regressions, and that also means no performance regressions. No API changes, and no internal ABI changes either: in fact, no changes in behaviour. But, at the same time, customers want new features, and support for new hardware, including new drivers; they want updates, at a minimum any outstanding security updates. All without breaking whatever they are currently using, because that's what they are paying for.
This is a big ask, and the result must inevitably be a compromise. The team tries to deliver no functional regressions, and to limit performance regressions to the important stuff. To make no backwards-incompatible microAPI changes, and to avoid kernel ABI changes for important stuff. The problem is that people want new features… and new or updated drivers.
So, what the team are working on is a Frankenstein's monster, sewn together from different codebases. Although the base kernel is still version 5.14, it is full of backports from upstream. It has the XFS filesystem code from kernel 6.0, the USB subsystem – complete with drivers – and BPF subsystem from kernel 6.2, the wireless stack and all drivers from kernel 6.3, and the multipath TCP/IP code from kernel 6.4 – which at the time of the talk hadn't even been released upstream yet. (It was released last weekend.)
- It's 2023 and memory overwrite bugs are not just a thing, they're still number one
- Rocky Linux claims to have found 'path forward' from CentOS source purge
- Inclusive Naming Initiative limps towards release of dangerous digital dictionary
- A (cautionary) tale of two patched bugs, both exploited in the wild
It works because of a lot of testing and a very cautious release process. Of course, the developer themselves tests it, but it also undergoes continuous-integration testing thanks to tools from the CKI project, as well as network-stack testing using the LNST tools. Then, it undergoes preverification, meaning that a human – someone other than the author – manually checks the change. Only then is the change merged into the CentOS kernel tree, after which it undergoes integration testing: checks against another 150 or so work-in-progress changes. Then, once it's passed all those, it undergoes normal QA testing with the rest of the OS.
The results can be seen on the CentOS Stream Gitlab – Benc was keen to stress this all happens in public, and it's all documented. Indeed, anyone can open a request for such a change, by filing a bug on Bugzilla, or opening a JIRA issue, according to a prescribed format: Product… Version… Component… Subcomponent… Benefit… Tests. Similarly, there's also a very strict format for merge requests (which are Gitlab's equivalent of Github pull requests), and for commit messages – and it must be followed exactly, because the messages are parsed by machines as well as by humans.
So long as the format is followed precisely, then the automation kicks in. It adds lots of labels, checks for subsequent fixes and patches from upstream, tags various people who must inspect and check the change, and more. All the discussion is handled in the comments on MR itself on Gitlab – except for dependencies, such as drivers, because Gitlab can't currently handle that.
If you listen very carefully to the Youtube stream of the talk, the first question was from the Reg FOSS desk, asking if this didn't overlap with the work of the long-term-support releases from the upstream kernel developers. Benc told us that he feels that Red Hat's level of testing and quality control exceeds that of the upstream LTS kernels, and that they don't deliver the level of stability that an enterprise distro needs.
That was quite surprising to us, but this is an undeniably impressive amount of work and level of attention to detail. In the light of the continuing furore that has followed Red Hat's withdrawal of the RHEL source code from publication, the talk emphasized the sheer amount of work that goes into maintaining a distro, complete with a single version of the kernel, for a life cycle of a whole decade. RHEL 9.10, for instance, is not planned to go out of support until 2032.
This is the work that Red Hat wants to get paid for, and the reason that it is still trying to find ways to exclude the downstream rebuilds – as it has been doing for a dozen years. ®