Microsoft has let the world in on one of its key Azure management tools: a simulator designed to help prevent nearly 70 per cent of the bugs that cause network downtime.
The simulator, called CrystalNet, is a design tool Microsoft Research created for its admins to help avoid downtime during routine maintenance and upgrades.
Redmond describes CrystalNet as a “cloud-scale, high-fidelity network emulator” which “runs real network device firmwares in a network of containers and virtual machines, loaded with production configurations”.
As the authors explain in the paper [PDF] presented to the Association of Computing Machinery's (ACM's) Symposium on Operating Systems and Principles 2017, they also wanted to deal with various types of failure modes, including human errors “which are responsible for a non-negligible six per cent of the outages in our network”.
The table below details two years' worth of outage data Microsoft collected while creating CrystalNet.
|Software bugs||36||Bugs in routers, middleboxes or management tools|
|Configuration bugs||27||Wrong ACL policies, traffic black holes, route leaks|
|Human errors||6||Mis-typing, unexpected design flaws|
|Hardware failures||29||ASIC driver failures, silent packet drops, fibre cuts, power failures|
CrystalNet doesn't cover the last two items in the table, but that still leaves it with coverage of 69 per cent of what goes wrong in Azure.
The problem statement is daunting: the aim with CrystalNet is to create a simulation and validation environment that can cope with bugs in routing software, interoperability failures arising because two vendors implement the same protocol differently (as much as 36 per cent of outages) – for a cloud-scale network.
The system takes advantage of the move by nearly all vendors to virtualise their environments. That let the researchers take the control plane of the Azure network, and replicate it in a container environment.
“CrystalNet runs real network device firmwares in virtualized sandboxes … We inter-connect the device sandboxes with virtual links to mimic the real topology. It loads real configurations into the emulated devices, and injects real routing states into the emulated network.”
For users, the software includes APIs to configure, create, and delete simulations; run tests; and observe the network state, as shown below:
Redmond itself is happy with CrystalNet in its own hands, saying it helped migrate regional Azure networks to a standardised architecture “with zero user-impacting incidents”, even though production traffic continued throughout the migration.
The paper was written by Microsoft Azure and Microsoft Research staff Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Jiaxin Cao, Sri Tallapragada, Nuno Lopes, Andrey Rybalchenko, Guohan Lu, and Lihua Yuan. ®