Hyperscale data centres win between their ears, not on the racks
Operating at scale is easy. Changing culture to accept and cope with failure is harder
Organisations that hope to improve their own data centre operations by adopting the techniques used by hyperscale operators like Google or Facebook need to consider the stuff between their ears, not just the stuff on their racks, because changing data centre culture is more powerful than changing equipment.
That was the gist of a session delivered by Gartner analysts Joe Skorupa and Evan Zeng at the firm's IT Infrastructure, Operations & Data Centre Summit in Sydney on Tuesday.
“Operating at scale is not the trick,” Skorupa said. “The issue is that hyperscalers understand how to deal with risk.”
The pair argued that the culture of corporate data centres and the incentives offered to their staff mitigate against innovation. “In the enterprise we measure and pay people on mean time between failure,” Skorupa said. “The whole operating principle is to avoid risk at all cost.” Data centre teams therefore run a mile at the prospect of anything that might risk an outage and end up incapable of innovation as a result.
Hyperscalers, by contrast, accept that there will be be failures and “are better at identifying and managing risk, and recovering from failure.”
Skorupa therefore said that organisations hoping to learn from hyperscalers “can't think the way you used to think and can't measure and reward the way you used to do.” So forget about just buying some Open Compute kit and then living the good life.
He and Zeng have therefore cooked up new metrics by which to measure on-premises data centre teams, namely:
- Mean time to respond for a new service
- Mean time to discover a failure
- Mean time to repair
The pair also advocate learning how to understand the impact of failure, or the “blast radius of an outage.” Doing so means data centre teams can think differently about the kind of changes they are willing to entertain.
“They won't do 100 changes at once because the blast radius is big,” Zeng said. “But they will learn to design the data centre for resilience and for frequent changes that have small blast radii.”
Which is not to say that kit doesn't matter. The pair advocated standardisation whenever possible. Skorupa also singled out Dell's dis-aggregated switches, which can run any of five network operating systems, as the kind of hyperscale-driven innovation that on-premises operators will do well to consider. Chef, Puppet and Ansible were name-checked for their utility facilitating automation, which the pair pronounced essential “if there is any chance of doing something twice.”
But the pair also declared that “the biggest opportunity is changing how people think and react”. ®