A path out of bloat: A Linux built for VMs
What Linux distros could learn from the inventor of the hypervisor
FOSDEM 2024 How hard can you cut down Linux if you know it will never run on bare metal? Further than any distro vendor we know of has tried to go.
This article is the fourth based on the Reg FOSS desk's talk at FOSDEM 2024. The first part talked about the problem of software bloat, the second about the history of UNIX, and the third about what the inventors of Unix did next: Plan 9.
In the previous section of the talk, I covered why Plan 9 wasn't called Unix any more: because its very different design made it incompatible with Unix. I went on to suggest how its modern descendants, such as 9front, could be made compatible, not just with Unix in general, but with Linux in particular. Not by bloating them with compatibility layers or emulators, but using microVMs.
At the end of the talk, I went into a little more detail about one way this could work. That's what I'm going to expand upon here, but please note: this is by way of a postscript. The core proposal of the talk is covered in the previous article. What I want to look at here is what could be done with Linux to make it work better in such a role, quite independent of any discussion of host OS design or anything like that.
One aspect of this part that I hope might interest readers is that it's about hacking on Linux distros, not kernel or hypervisor programming or any hardcore stuff. This part is much more open to experimentation by anyone who has customized their own Linux distro, or built one from scratch.
The design principle that unifies Unix, Linux, the BSDs, and indeed Plan 9, is the use of the filesystem as the basic method, not only of storage, but also for communications between programs and subsystems. Plan 9 takes this further than Unix, and so that should be the focus here. It's a different sort of microVM from current ones such as Amazon's Firecracker.
Current MicroVMs are a part of the modern microservices model for developing web applications – but using VMs as a sort of compatibility bridge for enabling one OS to run apps from another is a different use case. The components of a microservice architecture talk to one another over the network, using web protocols. That's not what you want for multiple apps running on a single machine, or even a local cluster.
Inspiration: an OS built to run in a VM
Instead, we suggest a different conceptual model: the one that IBM used when it invented hypervisors in the mid-1960s. The hot new thing then was the idea of interactive computing: people working at terminals, rather than submitting decks of punched cards. MIT built a whole new OS to do this: Multics, now mostly remembered as the inspiration for Unix. IBM came up with a different approach, one that built atop its existing investment in mainframe computers.
What IBM came up with to offer each user at a terminal for their own personal interactive session was virtual machines, each containing its own instance of an end-user OS dedicated to that person. Rather than rewrite its big, complex batch-oriented mainframe OSes to make them interactive, IBM just made the mainframe time-slice between several smaller instances of a specialised OS called the Conversational Monitor System, each running in its own self-contained session. Although CMS was originally designed to run on bare metal, the version shipped as part of IBM CP/CMS was dedicated to running inside a VM.
As a thought experiment, now let's think about what a Linux system would look like if it was designed with this in mind. It will only ever be a guest, running under a parent OS. (To make life easier, we can restrict any specific edition to one individual host hypervisor.)
Headless diskless Linux
A lot of issues ordinary distros face just… disappear. It doesn't need an installer, because a VM image is just a file. It doesn't need an initrd
, because we know the host hardware in advance: it's virtual, so it's always identical. It doesn't need to boot from disk, because it won't have disks: it will never drive any real hardware, meaning no real disks of its own. That also means no disk filesystem is needed.
Much of this can be done with existing tools. For example, even way back in the 1980s, it was standard practice for many Unix machines, such as Sun boxes, to mount /home
over NFS. This is still possible with Linux today although it's less common now. You can even mount the root directory over NFS, to have a VM with no disks of its own.
At a deeper level than NFS, Plan 9 communicates with filesystems on disk over a protocol called 9p. 9p is already supported directly in the Linux kernel in the form of v9fs. In Plan 9, 9p is a core part of the kernel, but in Linux, it is just a way to mount remote filesystems over the network – but the point is, it's already available. Under QEMU, guest VMs can access directories on the host across v9fs, and this includes keeping the whole root directory on a file share over 9p. The QEMU documentation describes how to install Debian this way.
No virtual hard disks means no disk formats, no block storage at all. No need for ext4, Btrfs, ZFS, or any other filesystem in the kernel. If you know the exact VM config your OS will run on at the time you build the OS, you can compile in just the drivers needed for that VM and nothing else, not only making for a much smaller kernel but also making it possible to dispense with the initrd or initramfs. There is some prior art in the form of this diskless VMs guide, which includes booting VMs from the host over iPXE, which replaces even the GRUB bootloader.
A dedicated guest distro doesn't need device drivers, except for communications with the host hypervisor, which can use virtio
drivers. No other networking or anything needed, no I/O devices. It doesn't need to support a console or framebuffer, because there won't be one: these microVMs will always be headless, and can talk to the host over virtio-console. An X11 server running on the host enables apps to have GUIs, with color and sound, and Wayland has WayPipe.
Take the previous generation of software and encapsulate it
Why?
Well, this isn't just about Plan 9. It could help anyone running Linux inside virtual infrastructure. Aside from Android, and maybe a few million Chromebooks, Linux running on the bare metal is a tiny niche now. The vast majority of Linux servers are running on some kind of hypervisor, even if that's provided by another Linux distro.
(Which they increasingly are. Broadcom's moves since acquiring VMware, including ending the free edition, put us in mind of Citrix's confessed missteps which led to XCP-ng, which soon thrived. Linux on Linux is on the rise.)
- Damn Small Linux returns after a 12-year gap
- FreeBSD can now boot in 25 milliseconds
- Already in final beta? That's Madagascar: Ubuntu 20.04 'Focal Fossa' gets updated desktop, ZFS support
- It doesn't work with Docker, K8s right now, but everyone's going nuts anyway for AWS's Firecracker microVMs
The point of this effort is that a distro built to target a specific hypervisor can be tiny and very simple. There's no need to try to generalize it: you can have one build for KVM, one build for Xen, one for VMX, whatever that developer is using. Even builds for Hyper-V or WSL2: if that's their kink, that's OK.
On top of Plan 9, head-and-diskless microVMs could bring Linux applications to Plan 9 with no need for emulation. If the VM keeps everything directly in the parent OS filesystem, with no virtual disks, console Linux binaries could communicate with Plan 9 binaries via files, just like any other program.
Even if in the end nobody is interested in building a next-generation OS on top of Plan 9, the same style of microVM could bring Linux app compatibility to other next-gen OSes, freeing them of the burden of backwards compatibility. Move to a new base, but keep the critical apps running until they can be replaced.
Linux is mature now. So, for that matter, are the main BSDs, and even Windows and macOS. They are not changing that radically any more, and they are steadily dropping support for older hardware… but the sheer quantity of code involved hinders real innovation.
I think we need to think about where we might go next. In a previous FOSDEM talk I offered a far more radical proposal for a next-gen OS using next-gen hardware. That hardware didn't sell and got cancelled, its uptake hindered by legacy OS design. Other kinds of non-volatile RAM may yet arrive to replace it, and indeed, that talk might yet arrive in a Reg version. ®