Huawei Cloud built a network monitor so sensitive it spotted the impact of a single faulty chip
Focus on physical ports helped spot issues across 100,000 switches and a million servers
Sigcomm 2024 Huawei Cloud has developed a network monitoring tool that, when used in production on three of its own regions, was able to observe more of its infrastructure than existing tools, and revealed issues that previously evaded human efforts.
The tool is called RD-Probe and was detailed in a paper [PDF] presented on Tuesday at the SIGCOMM 2024 conference in Sydney.
The paper explains that network monitoring is vital but hard to achieve at hyperscale. The authors – some from Huawei and others from the School of Computer Science at Peking University – cite AWS research [PDF] that states the Amazonian cloud has 1087 intra-region link-path combinations and 10176 inter-region link-path combinations (and also reveals that Huawei Cloud's datacenter networks comprise over 100,000 switches and a million servers). Monitoring all that infrastructure and all those paths – in a virtualized environment that uses randomness for load balancing – makes it very hard to gather enough data about what's going on at Layer 2.
RD-Probe is Huawei Cloud's attempt to solve that problem. The tool's developers decided to monitor each physical Layer 2 port, as doing so means they can observe the runtime status of switch fabrics. Considering only Layer 3, the authors write, would mean some ports would not be monitored.
Monitoring physical ports also helps to achieve more coverage than is possible when observing virtual networks – which, by their very nature, abstract some of the resources used to run them. That's not desirable because without comprehensive coverage, network monitoring tools will have blind spots that mean issues are missed.
The paper notes that RD-Probe "seamlessly integrates with the existing monitoring architecture" and "only modifies the task generation and data processing modules."
The tool starts by randomly generating probes, then does so again deterministically. This two-phase scheme is again done in the name of achieving the required monitoring coverage.
A dedicated 16-node cluster – in which each server runs an unnamed eight-core 2.80GHz CPU with 64GB of memory – generates the probes. Data generated by probes is processed by a streaming 48-node cluster in which each machine employs a 16-core 2.80GHz CPU with 32GB memory.
- Huawei's woes really were just a flesh wound – profits just soared 564 percent
- Huawei Cloud reveals the dynamic traffic allocation system it uses to cut bandwidth bills
- Tencent Cloud to revisit design after circular dependencies slowed emergency API fix
- Alibaba Cloud reveals its datacenter design, homebrew network used for LLM training
Within a month of using RD-Probe, Huawei Cloud found "many previously unnoticed issues."
Thankfully most "only caused fail-slow symptoms or intermittent packet drops" and they were spotted before users perceived degraded service. This made Huawei happy, as the paper's authors rated the issue "hard to locate via manual inspection."
Faults detected by RD-Probe and missed by other tools included:
- A faulty chip in the line processing unit of a core switch used in an object storage service, which caused dropped incoming packets and could not report the issue to the control plane;
- Flawed load balancing that caused traffic to go only through the local port instead of stack cables;
- Use of incorrect values for some BGP routes, which led traffic onto a slow path.
Huawei's researchers are pleased with RD-Probe as it improved its network monitoring coverage from 80.9 percent of resources to 99.5 percent, and "unearthed several previously unnoticed issues while tolerating numerous faults."
The concern plans to implement it in more cloud regions soon.
But the paper's authors also point out that RD-Probe does not consider North-South traffic, and can't filter out server-side failures. Fixing those issues remains on Huawei's to-do list. ®