Devs of bcachefs try to get filesystem into Linux again

Maturity and merging: Manageable for bcachefs?


The lead developer of the bcachefs filesystem is gunning to get it accepted into the Linux kernel… again.

The story of bcachefs is quite long-running, and this isn't the first time – nor even the first time this year the project has attempted this. The filesystem has been around for a while; The Reg first reported on it in 2015. But it looks like it's getting closer to its goal.

Filesystems are serious stuff, and getting them right takes time. As of November 2021, bcachefs gained snapshot support. With the latest update, the on-disk structures have changed. This means that when you mount a volume, the driver will update the format – so you can't go back. This is the sort of issue that would hinder integration into the mainline kernel.

Bcachefs grew out of the existing bcache module, which allows you to use a fast drive (probably an SSD) as a cache for a slower drive, such as a RAID volume. It's more complex than it sounds, which, as ever, shows up if it goes wrong.

Lead developer Kent Overstreet described the inspiration for the new filesystem: "I and the other people working on bcache realized that what we were working on was, almost by accident, a good chunk of the functionality of a full blown filesystem – and there was a really clean and elegant design to be had there if we took it and ran with it."

The plan is that bcachefs will rival the features of OpenZFS, while being GPL2-licensed and so able to become part of the Linux kernel without any legal issues.

There's nothing technically wrong with OpenZFS, but because it's not GPL code, it can't be included in the kernel. This has side-effects: for instance, ZFS' Adaptive Replacement Cache or ARC is allocated separately from the kernel's pagecache. This makes it appear that ZFS is using a lot of memory.

The plan is that bcachefs will offer a broadly similar feature-set to ZFS, combining logical volume management and aggregating multiple physical drives into larger, redundant volumes, plus the ability to make copy-on-write snapshots. This is an important feature, because it makes it easy to roll back operating system updates.

I and the other people working on bcache realized that what we were working on was, almost by accident, a good chunk of the functionality of a full blown filesystem – and there was a really clean and elegant design to be had there if we took it and ran with it

This is currently possible using both Btrfs and the existing Linux LVM tools, and SUSE's snapper tool puts it to good use. Snapper is how SUSE MicroOS implements rollbacks, for instance, and it makes the openSUSE Tumbleweed rolling-release distro much safer. If an update goes wrong, you can just boot off the last good snapshot and remove the offending update.

So what bcachefs offers isn't new. The significance is that the current tools are quite complex. Btrfs can do this – and RAID – on its own, but it's not without issues.

There are also some problems around determining how much free space is available.

Repairing damaged volumes can be tricky; the man page for btrfs-check contains a worrying warning, and the FAQ points to a post with data-recovery instructions. Call us nervous Nellies if you wish, but that's not reassuring.

The issues are enough that although Fedora has embraced Btrfs, it's been booted out of RHEL. Because RHEL doesn't currently have a filesystem with snapshots and rollback, Red Hat has had to develop an additional tool called OStree to provide Git-like software installation and rollback.

The Linux LVM system is capable, but its functionality overlaps both that of the kernel's built-in mdraid at a lower level, and that of Btrfs RAID support at a higher level.

There used to be a richer LVM toolset called the Enterprise Volume Management System, but the kernel team favoured the simpler LVM2 tools, so the EVMS developers gave up.

Both LVM and mdraid volumes need to be formatted with a filesystem, and if you use Btrfs, it can be confusing to ascertain which RAID tools are the ones you need. Worse still, there is always the risk that someone somewhere didn't know what they were doing and may have combined more than one of them. For instance, if you use LUKS for disk encryption, that normally runs on top of LVM. The layering can get complicated.

There are other ramifications. As an example: the author ran openSUSE Tumbleweed for several years, and although its Snapper tool uses Btrfs snapshots, it doesn't understand Btrfs' special tooling for measuring free disk space. More than once, a simple system upgrade filled the computer's root partition, causing irreparable Btrfs corruption and requiring a reinstall. This is fairly trivial if your data is in a separate /home partition. It used to default to this, and the partitioning guide recommends it, but the openSUSE installer does not do that by default any more.

There are workarounds, of course. You need a substantially bigger root partition for a Btrfs machine with Snapper than with, say, ext4, and such things shouldn't affect properly-provisioned servers.

Currently, only OpenZFS cuts straight through all these layers of complexity. The ambition is that bcachefs will too, while also being license-compatible. More choice, and more competition to spur development. Those sound like very desirable outcomes for all concerned. ®


Other stories you might like

Biting the hand that feeds IT © 1998–2022