Software

OSes

How a botched kernel patch broke Ubuntu – and why it may happen again

Panic! at the distro


If you spent the early days of June fighting kernel panics in Ubuntu 20.04, you were not alone – and we now know why.

A problem with a Ubuntu-specific Linux kernel patch early last month rendered many systems, running Docker on that flavor of the operating system unusable, and it probably won't be the last time.

The whole debacle can be traced back to a bad distro-specific kernel update for Ubuntu 20.04 — Canonical's long-term support (LTS) release — that started rolling out on or about June 8. Within hours of the patch hitting systems, bug reports began filing in.

The source of the trouble was quickly isolated to Ubuntu systems running Docker with the hardware-enablement (HWE) stack enabled. As the name suggests, HWE adds support for newer hardware by shipping updated kernels – and Ubuntu routinely pushes out new kernels via these HWE updates. While switching this on is usually a manual process for server systems, it's a standard feature on many Ubuntu images available in the cloud. To this end, several users reported VM images on AWS, GCP, Azure, and Oracle were affected. HWE is also usually enabled by default for new desktop installs.

The bug itself triggered a kernel panic any time a Docker container was started. Some users even reported the update resulted in a bootloop, and the only cure was to roll back to a previous working kernel during startup. This is presumably because their Docker containers were set to start with the rest of the system, causing a vicious cycle in which Ubuntu boots, the Docker containers start, the system kernel panics … rinse, repeat.

To make matters worse, Ubuntu's unattended-upgrades service, which is responsible for keeping systems patched and usually free of issues, made this particular kernel update more difficult to avoid.

The crash stemmed from an issue with /proc/self/map_files and container environment file systems overlayfs and shiftfs that the kernel patch intended to fix. A revised kernel was released a few days later by Ubuntu addressing the issue. The impact of this botched patch is hard to gauge, but Ubuntu 20.04 remains a popular choice for production environments thanks to its relatively long support life.

Ironically, the five years of support that makes LTS releases so popular was also partially to blame, according to an analysis this week by Jordan Webb, shared via LWN.

A crucial point in this saga is that Ubuntu has up until 21.04 included another container-related file system aufs in its Linux kernels; this file system code was never merged into the mainline kernel, and was maintained out-of-tree. When Ubuntu's developers came to backport the shiftfs-related patch to Ubuntu 20.04, part of the patch code was dropped because it depended on aufs that wasn't present during the backporting process – but aufs was in fact in the 5.13 kernel used by Ubuntu 20.04 HWE.

Owing to that, and changes to how overlayfs worked internally, a reference to already free()'d memory in a kernel data structure would be released, triggering a panic. This would happen any time a Docker container spun up. According to Webb, this clash was caught almost immediately and fixed in Ubuntu's 5.15 kernel source. But for reasons that aren't clear, the 5.13 kernel in Ubuntu 20.04 HWE was overlooked and would continue to crash. 

As Webb put it:

When Ubuntu's developers ported the shiftfs-related patches from their 5.8 kernel branch to their 5.13 and 5.15 kernels, the patch that corrected the problem with map_files and shiftfs was left out, because it depended on AUFS, which had been dropped from Ubuntu's kernel. When those kernels were backported to Ubuntu 20.04, where AUFS continues to be supported, the missing patch was noticed, and it was applied to Ubuntu's 5.13 and 5.15 trees as well.

Unfortunately, the internals of overlayfs changed over time in a way that eventually caused the patch to be incorrect. As a result, when a file on an overlayfs is mapped into memory, the function added by the patch attempts to release a reference to a struct file using fput(), but the structure had already been freed due to an earlier fput() call. That causes the kernel to panic.

On Ubuntu 21.10, where 5.13 is the default kernel, this didn't cause any problems. Since AUFS is not enabled, the #ifdef block around the code introduced by the patch prevented it from being compiled into the kernel. The problem occurred when 5.13 and 5.15 were rebuilt for Ubuntu 20.04. Since an HWE kernel needs to support all of the features that are supported by the kernel it is replacing, AUFS was enabled in these builds, and the code containing the extraneous fput() was compiled in.

This particular issue has since been resolved, and anyone who's only now returned to find their VMs or server deployments bootlooping should roll back to an earlier kernel and update their systems.

Unfortunately, gremlins like these may be hard to avoid given the lifespan of Canonical's LTS releases, which has led to developers juggling multiple branches of the kernel simultaneously.

"Maintaining an out-of-tree kernel patch for any length of time is an arduous task," Webb wrote, adding that the situation is unlikely to get any easier for the Ubuntu kernel devs and may actually become more difficult before long. ®

Send us news
41 Comments

AlmaLinux 9.4 beta prepares to tread where RHEL dares not

CIQ also has an alternative approach to compatible kernels with RockyLinux

Qt Ubuntu 24.04 betas show that there's room to innovate

Hot on the heels of Ubuntu Noble beta come the betas of the Qt-based remixes, with some interesting differences

Fedora 40 is just around the corner with more spins and flavors than ever

KDE edition has the most conspicuous changes, and could become future flagship

Lightweight LXQt 2.0.0 updates to same toolkit as KDE Plasma 6

4-letter survivor's move to Qt 6 means that, love it or hate it, Wayland is looming

NetBSD 10 proves old tech can still kick apps and take names three decades later

Proper old-school Unix, not like those lazy, decadent Linux types

After delay due to xz, Ubuntu 24.04 'Noble Numbat' belatedly hits beta

Kernel 6.8, GNOME 46, and more apps in Snap packages

Miracle-WM tiling window manager for Mir hits 0.2.0

What are Mir and Wayland all about anyway?

Digital Realty wants to turn Irish datacenters into grid-stabilizing power jugglers

Electricity goes both ways as bit barns in Dublin aim to cut emissions and boost the bank

AI energy draw from Chicago datacenters to rise ninefold

No wonder industry is exploring nuclear as an alternative to electricity

Blackstone wants to plug hyperscale datacenter into former Britishvolt battery site

Plans to plant $12B bitbarn where homegrown renewables hope once lived

Microsoft aims to triple datacenter capacity to fuel AI boom

And it's far from the only hyperscaler getting in on the act

Using its own sums, AMD claims it's helping save Earth with Epyc server chiplets

Smaller dies, less wafer loss equals lower emissions, exec claims