Back in September, The Register's networking desk chatted to a company called Teclo about the limitations of TCP performance in the Linux stack.
That work, described here, included moving TCP/IP processing off to user-space to avoid the complex processing that the kernel has accumulated over the years.
It's no surprise, then, to learn of other high-performance efforts addressing the same issue: both the BBC in its video streaming farms; and CloudFlare, which needs to deal with frequent packet flood attacks.
The Beeb's work is described by research technologist Stuart Grace here. The broadcaster explains that its high-definition video streams have to push out 340,000 packets per second into 4 Gbps ultra-high definition streams.
With just 3 µs per packet of processing time, the post says, using the kernel stack simply wasn't an option.
Using the network sockets API, the post explains, involves a lot of handling of the packet, as “each data packet passes through several layers of software inside the operating system, as the packet's route on the network is determined and the network headers are generated. Along the way, the data is copied from the application's buffers to the socket buffer, and then from the socket buffer to the device driver's buffers.”
The Beeb boffins started by getting out of the kernel and into userspace, which let them write what they call a “zero-copy kernel bypass interface, where the application and the network hardware device driver share a common set of memory buffers”.
The post continues that when the application creates a group of packets and their network headers, it does so directly in those shared buffers.
“Then using a single function call, the whole group is handed over to the control of the device driver which transmits them directly on to the network”.
Networking away from the kernel: the BBC's architecture
The broadcaster hasn't said when or if it will publish the work.
CloudFlare: We wrote a new syntax, you won't believe what happened next
CloudFlare's approach is similar – a userspace kernel bypass – but with wrinkles specific to its circumstances.
CloudFlare's problem is not just the quantity of packets, but the need to distinguish attack packets from user data. Regular readers of The Register will already know that the provider suffers regular attacks.
As Gilberto Bertin writes: “During packet floods we offload selected network flows (belonging to a flood) to a user space application. This application filters the packets at very high speed. Most of the packets are dropped, as they belong to a flood. The small number of "valid" packets are injected back to the kernel and handled in the same way as usual traffic.”
To get what it wanted, Bertin says, the company settled on writing modifications to the Netmap Project.
Instead of using Netmap's default functionality, which bumps all received packet queues away from the kernel, “We want to keep most of the RX queues back in the kernel mode, and enable Netmap mode only on selected RX queues. We call this functionality: "single RX queue mode".”
Like all the best hacks, it's small and simple. As Bertin writes, most of what CloudFlare needed was the ability to split queues: “the only difference is the nm_open() call, which uses the new syntax netmap:ifname~queue_number.”
The patch code has been submitted to the project and is available on GitHub. ®