A chunk of code added to the Linux kernel to help inter-container communication turned out to mess up checksum handling on Ethernet networks.
Described here, the bug was in veth (Virtual Ethernet).
As the description notes, the coding error allowed corrupt packets to get passed to a veth device for delivery to the application. Vijay Pandurangan, engineering site lead at Twitter's New York City offices, writes that applications at Twitter were receiving corrupt data “when network hardware was corrupting packets” (that is, for example, if there was a failing hardware device).
He notes: “Packets that arrive from real hardware devices have ip_summed == CHECKSUM_UNNECESSARY if the hardware verified the checksums, or CHECKSUM_NONE if the packet is bad or it was unable to verify it. The current version of veth will replace CHECKSUM_NONE with CHECKSUM_UNNECESSARY” – and that let bad packets through.
It was a pretty esoteric bug to try and trace and fix, as Pandurangan explains at Medium. “One weekend in November, a group of Twitter engineers responsible for a wide variety of services got paged. Each affected application showed “impossible” errors, like weird characters appearing in strings, or missing required fields.”
That's hard enough to trace in a highly distributed system, but he goes on to say that the bad packets could live on in caches and disk logs “long after the original corruption”.
With the source eventually nailed down to the rack level, new hardware fixed the problem, but Twitter's “ton” of engineers still needed to know why Ethernet's checksums weren't trapping the bad packets.
After a variety of theories and tests, he writes, someone noticed that there was an important difference between the test environments and the live systems: “while our tests were on a normal Linux system, most services at Twitter run on Mesos, which uses Linux containers to isolate different applications.”
That's where veth came in: all application packets are passed through virtual Ethernet devices. That gave them a means to reproduce the error and find out what was wrong.
“In order to construct a container with a virtual ethernet device, one must (1) create a container, (2) create a veth, (3) bind one end of the veth to the container, (4) assign an IP address to the veth, (5) set up routing, usually using Linux Traffic Control, so that packets can get in and out of the container,” Pandurangan writes.
Reviewing the code, Pandurangan concluded that the bug was intended to be a feature. If two containers are passing packets on the same machine, there's no need for checksums designed to detect (for example) a dodgy port on an Ethernet switch.
Since the error was introduced in December 2010, he suggests that a lot of sysadmins have lost sleep and hair over unexpected crashes.
The patch is now merged into kernels back as far as 3.14 and arrived in popular distributions during January 2016. ®