The Linux cloud swap that spells trouble for Microsoft and VMware

Containers just wanna be hypervisors


Just occasionally, you get it right. Six years ago, I called containers "every sysadmin's dream," and look at them now. Even the Linux Foundation's annual bash has been renamed from "LinuxCon + CloudOpen + Embedded Linux Conference" to "LinuxCon + ContainerCon".

Why? Because since virtualization has been enterprise IT's favourite toy for more than a decade, the rise of "cloud computing" has promoted this even more. When something gets that big, everyone jumps on board and starts looking for an edge – and containers are much more efficient than whole-system virtualization, so there are savings to be made and performance gains to win. The price is that admins have to learn new security and management skills and tools.

But an important recent trend is one I didn't expect: these two very different technologies beginning to merge.

Traditional virtualization is a special kind of emulation: you emulate a system on itself. Mainframes have had it for about 40 years, but everyone thought it was impossible on x86. All the "type 1" and "type 2 hypervisor" stuff is marketing guff – VMware came up with a near-native-speed PC emulator for the PC. It's how everything from KVM to Hyper-V works. Software emulates a whole PC, from the BIOS to the disks and NICs, so you can run one OS under another.

It's conceptually simple. The hard part was making it fast. VMware's big innovation was running most of the guest's code natively, and finding a way to trap just the "ring 0" kernel-mode code and run only that through its software x86 CPU emulation. Later, others worked out how and did the same, then Intel and AMD extended their chips to hardware-accelerate running ring-0 code under another OS – by inserting a "ring -1" underneath.

But it's still very inefficient. Yes, there are hacks to allow RAM over-commit, sparse disk allocation and so on, but overdo it and performance suffers badly. The sysadmin has to partition stuff up manually and the VMs take ages to boot, limiting rapid scaling.

In computing terms, this is stone-age stuff. The whole point of half a century of R&D on dynamic memory management and multitasking operating systems was to avoid having to do this stuff manually. VMs squander all that.

Yes, it's improved, there are good management tools and so on, but all PC OSes were designed around the assumption that they run on their own dedicated hardware. Virtualization is still a kludge – but just one so very handy that everyone uses it.

That's why containers are much more efficient: they provide isolation without emulation. Normal PC OSes are divided into two parts: the kernel and drivers in ring 0, and all the ordinary unprivileged code – the "GNU" part of GNU/Linux – and your apps, in ring 3.

With containers, a single kernel runs multiple separate, walled-off userlands (the ring 3 stuff). Each thinks it's the only thing on the machine. But the kernel keeps total control of all the processes in all the containers.

There's no emulation, no separate memory spaces or virtual disks. A single kernel juggles multiple processes in one memory space, as it was designed to do. It doesn't matter if a container holds one process or a thousand. To the kernel, they're just ordinary programs – they load and can be paused, duplicated, killed or restarted in milliseconds.

But there's only one kernel, so you can only run Linux containers on Linux. Because there's only one copy of the core OS, if an app in a container needs an OS update, everyone gets it and the whole machine must be rebooted.

These are fundamentally different approaches. So how can they be merged together or the fundamental distinction between two totally different approaches reduced?

The hypervisor that isn't a hypervisor

Canonical has come up with something like a combination – although it admittedly has limitations. Its LXD "containervisor" runs system containers – ones holding a complete Linux distro from the init system upwards. The "container machines" share nothing but the kernel, so they can contain different versions of Ubuntu to the host – or even completely different distros.

LXD uses btrfs or zfs to provide snapshotting and copy-on-write, permitting rapid live-migration between hosts. Block devices on the host – disk drives, network connections, almost anything – can be dedicated to particular containers, and limits set, and dynamically changed, on RAM, disk, processor and IO usage. You can change how many CPU cores a container has on the fly, or pin containers to particular cores.

Despite some of the marketing folks' claims, it's not a full hypervisor. You can't run non-Linux containers. Indeed you can only run distros that will work on the kernel of the host's version of Ubuntu. You can't even freely migrate containers between hosts running different Ubuntu versions. Also, any global restrictions in the Linux kernel – such as number of network connections or IP addresses – apply to all the containers on the host put together.

However, it does offer most of the functionality of Xen or KVM-style Linux-on-Linux virtualization but with considerably greater efficiency, meaning lower overheads in both resources and licence costs. Importantly for Canonical, it allows Ubuntu Server to run software that's only certified or supported on more established enterprise distros, such as RHEL or SLES.

LXD is a pure container system that looks as much as possible like a full hypervisor, without actually being one.

... and containers that aren't really containers

What's the flipside of trying to make containers look like VMs? A hypervisor trying very hard to make VMs look like containers, complete with endorsement from an unexpected source.

When IBM invented hypervisors back in the 1960s, it created two different flavours of mainframe OS – ones designed to host others in VMs, and other radically different ones designed solely to run inside VMs.

Some time ago, Intel modified Linux into something akin to a mainframe-style system: a dedicated guest OS, plus a special hypervisor designed to run only that OS. The pairing of a hypervisor that will only run one specific Linux kernel, plus a kernel that can only run under that hypervisor, allowed Intel to dispense with a lot of baggage on both sides. The VMs aren't PC-compatible. There's no BIOS or boot process – just copy the kernel into RAM and execute it. No need to emulate a display or other IO – everything, including the root filesystem, is accessed over a simple, fast, purely virtual network connection. Guest control is over ssh, just like with containers.

The result is a tiny, simple hypervisor and tiny VMs, which start in a fraction of a second and require a fraction of the storage of conventional ones, with almost no emulation involved. In other words, much like containers.

Intel announced this under the slightly misleading banner of "Clear Containers" some years ago. It didn't take the world by storm, but slowly, support is growing. First, CoreOS added support for Clear Containers into container-based OSes. Later, Microsoft added it to Azure. Now, though, Docker supports it, which might speed adoption.

Summary? Now both Docker and CoreOS rkt containers can be started in actual VMs, for additional isolation and security – whereas a Linux distro vendor is offering a container system that aims to look and work like a hypervisor. These are strange times. Perhaps the only common element is that it's bad news for both VMware and Microsoft. ®

Similar topics


Other stories you might like

  • Prisons transcribe private phone calls with inmates using speech-to-text AI

    Plus: A drug designed by machine learning algorithms to treat liver disease reaches human clinical trials and more

    In brief Prisons around the US are installing AI speech-to-text models to automatically transcribe conversations with inmates during their phone calls.

    A series of contracts and emails from eight different states revealed how Verus, an AI application developed by LEO Technologies and based on a speech-to-text system offered by Amazon, was used to eavesdrop on prisoners’ phone calls.

    In a sales pitch, LEO’s CEO James Sexton told officials working for a jail in Cook County, Illinois, that one of its customers in Calhoun County, Alabama, uses the software to protect prisons from getting sued, according to an investigation by the Thomson Reuters Foundation.

    Continue reading
  • Battlefield 2042: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

    Another terrible launch, but DICE is already working on improvements

    The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

    I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

    The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

    Continue reading
  • American diplomats' iPhones reportedly compromised by NSO Group intrusion software

    Reuters claims nine State Department employees outside the US had their devices hacked

    The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

    NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

    "Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

    Continue reading

Biting the hand that feeds IT © 1998–2021