Alibaba Cloud reveals network telemetry tool that helped cut number of engineers needed by 86%
Zoonet employs 'elegant generalization of ping and traceroute' among other tricks
Exclusive Alibaba Cloud has detailed the telemetry tool it uses to look out for glitches in customers' virtual networks, and revealed it’s reduced the number of personnel dedicated to troubleshooting by 86 percent since developing the system.
Detailed in a paper poetically entitled "Proactive Telemetry in Large-Scale Multi-Tenant Cloud Overlay Networks" published in the most recent issue of IEEE/ACM Transactions on Networking, the tool is named Zoonet and is described as "a proactive virtual network telemetry system for multi-tenant clouds."
The paper – penned by 12 Alibaba Cloud staff and three researchers from the Institute for Network Sciences and Cyberspace and Tsinghua University’s Beijing National Research Center for Information Science and Technology – opens with the observation that telemetry systems for physical networks are well understood, but that virtual networks have not had as much attention.
Alibaba Cloud, as is the nature of such operations, runs rather a lot of virtual networks and so set out to learn how best to manage them to ensure tenants didn't experience glitches or outages.
Zoonet is the result of those efforts. The paper reveals it "has been deployed in Alibaba Cloud for over two years, covering tens of cloud regions, hundreds of thousands of servers." The authors admit that Alibaba has become "increasingly reliant on Zoonet as it reduces 86 percent of the personnel engaged in troubleshooting."
The tool comprises the following elements:
- A data plane that uses host agent and arp-ping to protect tenants' privacy and defines an elegant generalization of ping and traceroute, which can work on heterogeneous middleboxes;
- A control plane that conducts update batch processing and substantial probing path pruning to lessen the overhead;
- An analysis plane that reduces noise and aggregates alerts based on temporal and spatial correlation and conducts the hop-by-hop telemetry mode to locate failures.
The tool works by sending "probe packets" to detect virtual networks and gather desired telemetry data.
Doing so involves a "Zoonet agent" – described as "a process deployed together with the tenants' VMs and the VM hypervisor on the server."
The agent is bound to a specific CPU core, so it is isolated from tenants' workloads and the hypervisor.
Alibaba Cloud sometimes uses Zoonet to run "massively concurrent probing tasks" and concedes that that means "CPU utilization will spike intensively due to the aggregated probing traffic, which may affect other running processes as well as the telemetry accuracy."
Probing tasks are therefore scattered at random time intervals to avoid CPU workload spikes.
Zoonet also establishes a telemetry point at the network boundary between clouds and internet service providers, to measure the impact of traffic between the public internet and cloudy VMs.
Cloudy networks that span multiple datacenters and touch the public internet can be very complex – creating the potential for Zoonet to detect lots of anomalies. To avoid alert fatigue for its on-call engineers, Alibaba Cloud chose to aggregate warnings "according to temporal and spatial correlations." Engineers receive representative alerts for virtual network troubleshooting, rather than a firehose of warnings.
- Alibaba Cloud slashes prices outside China
- Alibaba Cloud posts modest growth, mostly thanks to other Alibaba business units
- Alibaba shuts down quantum lab, donates it to university
- Alibaba cancels cloud spinoff, blames US chip sanctions
The paper explains that the during development, Zoonet revealed bugs, congestion, virtual routing anomalies, and "last-mile" anomalies in Alibaba Cloud's networks. Some of those problems, the paper asserts, could not have been found with telemetry tools created to monitor physical networks.
Zoonet's control plane sometimes came under "great pressure" because of bursts of virtual network updates. Alibaba Cloud figured out that was due to the frequent allocation and release of spot instances – a feature of hyperscale clouds that made Zoonet look buggy.
Sorting that out by adding service differentiation to Zoonet is on Alibaba Cloud's to-do list – along with improving last mile coverage and work to allow the tool to work with third-party devices.
The paper concludes by stating: "Zoonet has been deployed in production for over two years and helps the cloud vendor detect, pinpoint, and mitigate a variety of network anomalies."
But there's no indication that Alibaba Cloud has open sourced Zoonet or ever will, despite its obvious attractiveness to other managers of virtual servers. With the labor savings it's made, the Chinese cloud probably wants to preserve the competitive advantage this tool confers. ®