Alibaba Cloud claims K8s service meshes can require more resources than the apps they run
Built its own replacement – Canal Mesh – that it says leaves Google's Istio and Ambient eating dust
SIGCOMM 2024 Alibaba Cloud has claimed its home-grown service mesh for Kubernetes – Canal Mesh – significantly outperforms Google's Istio and other rival tools.
The Chinese cloud leader revealed the existence of Canal Mesh at last week's Association for Computing Machinery SIGCOMM conference in Sydney, Australia, in a presentation and paper [PDF]. The prez opened with an explanation of how microservices rely on service meshes to connect Kubernetes pods, how those meshes rely on a proxy "sidecar" to handle and mediate network communication between microservices, and to collect telemetry on traffic, so that applications don't need their own networking plumbing.
But in Alibaba Cloud's estimation, sidecars "cause numerous problems, including intrusion into the user pod, excessive resource occupation, significant overhead in managing many sidecars, and performance degradation caused by passing traffic through the sidecar."
The Chinese cloud considered Istio's impact on a customer that ran a Kubernetes cluster comprising 500 nodes and 15,000 pods, and found it consumed 1,500 cores and 5,000 gigabytes of memory – ten percent of hardware resources.
In other scenarios, Alibaba Cloud claimed the sidecar's CPU and memory requirements "grow even higher than that of the app."
That's bonkers, and clearly untenable. And in 2022 Google did something about it by introducing Ambient Mesh – an Istio data plane mode that offered Istio users the chance to park sidecars.
Alibaba Cloud's paper notes that Ambient Mesh improved performance and reduced demands on resources – but still required some proxies to reside within the user cluster.
The Chinese cloud felt complete decoupling of service mesh from user clusters would be more effective – and built Canal Mesh to prove it.
- Alibaba Cloud reveals its datacenter design, homebrew network used for LLM training
- Alibaba Cloud closing Australian and Indian datacenters
- Faulty instructions in Alibaba's T-Head C910 RISC-V CPUs blow away all security
- Tencent Cloud's home-grown traffic-tamer halves WAN latency
The paper claims Alibaba Cloud succeeded, handsomely, and produced the following results:
- Throughput 12.3x and 2.3x higher than Istio and Ambient, with latency 1.7x and 1.3x lower;
- CPU consumption 12x~19x and 4.6x~7.2x lower than Istio and Ambient;
- Configuration completion time for creating hundreds of pods 1.5x~2.1x and 1.2x~1.5x smaller than Istio and Ambient;
- Southbound bandwidth occupation is 9.8x and 4.6x lower than Istio and Ambient.
Alibaba Cloud achieved those numbers with an architecture that sees proxies moved out of the user cluster – albeit with a minimal on-node proxy retained to handle some security and observability chores.
eBPF-based kernel bypass and remote mTLS acceleration are also employed. The paper describes how Alibaba Cloud uses its hyperscale smarts to place proxies across its pools of resources.
The paper and presentation state that Canal Mesh has run at Alibaba Cloud for a year – without quite confirming it is in production. Both also omit a link or even mention of code for you to peruse or implement – but the presentation includes contacts at Alibaba Cloud for those who have questions.
If Alibaba Cloud intends to keep Canal Mesh to itself, it may operate more efficiently than its rivals. The Chinese cloud does compete with the likes of AWS, Google, and Azure in some markets, but The Register understands most of Alibaba's clients outside China have roots or connections in the Middle Kingdom that see them feel more comfortable with the outfit than is the case for buyers based in other countries. ®