New Relic: How observability reduces downtime

Efficient troubleshooting will cut your mean-time-to-resolution

Sponsored A call in the middle of the night is something every member of an IT operations team dreads. Regardless of the hour, downtime demands that they spring into action, scrambling to resolve the outage before it seriously impacts a company’s revenues and reputation.

“I’ve witnessed this myself and it can be really painful, especially when a team has no real idea of what’s causing the downtime,” says Stijn Polfliet, Principal TechOps Strategy Consultant at New Relic. Time may be wasted chasing false diagnoses up blind alleys, he says. Worse still, frantic conversations and panicked guesswork can quickly descend into finger-pointing, as every layer in a complex application environment - and the team responsible for it - gets blamed.

While the clock keeps ticking, the costs mount up. A 2014 Gartner study estimates the average cost of downtime at $5,600 per minute. A more recent report, from the Ponemon Institute in 2016, raises that average to nearly $9,000 per minute. These averages, of course, vary according to company size, reach and volume of online business.

Gartner appears to have evolved its thinking about downtime costs since 2014. In an April 2019 research note, the analyst firm reveals that it received more than 25 client enquiries a month between January 2017 and February 2019 “on this topic, often asking if there are any specific or general numbers that are based on their industry."

Report authors Mark Jaggers and David Gregory, argue “I&O (infrastructure and operations) leaders are often on a misguided mission to find mythical cost-of-downtime numbers, which leads to a lack of credibility and, ultimately, a denial of necessary funding. Focus instead on impacts that matter to the business.”

They caution against using generic, broad-brush ‘cost of downtime’ metrics which are built on five myths. “They use a constant time frame, don’t consider seasonality, use measurements that are not impactful to business leaders, use a linear impact for any duration and use soft costs to enlarge impacts.

The upshot is that “their lack of a solid foundation in meaningful metrics that the business values often results in an immediate dismissal of relevance by business leaders and a lack of funding for needed investments”.

While you figure out how this applies to your organisation, we can all agree that it is better to avoid downtime in the first place.

At US-based ‘big box’ retailer Costco, for example, website outages over Thanksgiving and Black Friday in 2019 may have led to revenue losses amounting to some $11 million, according to some estimates. In the Uptime Institute’s 2019 survey of data centre owners and operators, one in ten respondents report that their most recent significant outage cost them more than $1 million.

There are often less obvious costs to consider, too, says Polfliet. Constant firefighting is stressful for team members, potentially leading to staff attrition and reduced productivity among those who remain. “And when staff are pulled away from strategic, innovative work to investigate and resolve faults, there’s an opportunity cost for the business as well,” he says.

In other words, the business could find itself losing out to more fleet-footed competitors and missing out on new market opportunities. In fact, following the deployment of New Relic, Evans saw MTTR shrink in 2018 from 197 minutes to 33 minutes. In 2019, Evans’ team did even better, reducing MTTR to just 24 minutes.

Similarly, it took just one incident of downtime for South Africa-based BetTech Gaming to reassess its approach to monitoring - particularly since it happened on a Saturday night, a peak time for the company. “Although we recovered quickly, that one time was enough for us,” says head of architecture, Ian Barnes.

The problem occurred, he says, because BetTech’s in-house monitoring system failed to help his team anticipate downtime. According to Barnes, “Our internal monitoring system was maintenance-heavy and it was difficult to configure the alerts we needed.”

Performance issues were mostly related to the sheer number of users at peak times, so using New Relic, Barnes and his team set careful thresholds to determine what constituted a slow transaction. These alert notifications helped with capacity planning.

Better observability

What’s often lacking in situations where IT teams struggle to get digital services back up and running, says Polfliet, is observability. In other words, they don’t have a single console they can turn to that brings together performance information from every layer of the stack and correlates it to see how a fault in one layer might affect another.

“Instead, they’re just wasting time, because they’re having to switch between a bunch of different tools, each focusing on a different layer of infrastructure or software, and from there, they have to manually try to correlate and analyse different events to understand what’s happening. It’s a huge pitfall,” he says. It also has a disastrous effect on a team’s ‘Mean Time to Resolution’ (MTTR) performance.

“When something goes wrong, and you're experiencing downtime, you have to be able to troubleshoot much, much faster than that. And observability will help you do that,” he says.

That’s been the experience at Chegg, a US-based online learning platform for college students, which provides them with on-demand access to digital study aids and subject-specific tutoring, as well as low-cost textbook rentals and information on internship opportunities.

In January 2018, an outage occurred involving a frontend page that was issuing too many API calls to a back-end system, which in turn brought down a database. Given the complex IT environment that Chegg runs, such problems typically called for a great deal of collaboration among IT operations staff. The company has more than 500 services in production, running on hundreds of hosts in AWS, with about 80 per cent of the compute workload containerised with Docker, via Amazon Elastic Container Service (ECS). “It’s very rare that we have an incident contained within a single team at Chegg,” says Steve Evans, the company’s VP of engineering services.

But with those 500 services instrumented using New Relic, his team could get to work quickly on fixing problems. “Having New Relic means a frontend engineer can start troubleshooting an incident and slide all the way through to the data layer. It’s that whole end-to-end visibility that is key to reducing the time it takes to detect and resolve incidents,” he s g, to make sure the infrastructure could cope with future peak events.

Planning ahead

As BetTech’s example demonstrates, planning ahead is the best form of defence against downtime. Artificial intelligence can also be a big help here, says Polfliet, with monitoring tools increasingly relying on machine learning capabilities capable of analysing and correlating performance information from throughout the stack, giving IT operations staff the information they need to get to work on fixes, faster. This AI-enabled approach is often referred to as AIOps.

“So, what’s happening here, with these tools, is that they’ll tell you, ‘This is what we see happening right now and it could lead to downtime. All of these things are correlated, and this is what we believe to be the root cause.’ From there, the team is better placed to take the best decisions, before the downtime even occurs,” he says.

Another approach he sees “more and more” customers adopting is chaos engineering. This is the practice of carrying out controlled experiments, with the goal of uncovering weaknesses in the system that could lead to downtime in future.

“It’s about simulating a day where problems happen and seeing how your systems respond to various stresses,” he says. A useful way to think of this, he adds, is “breaking things on purpose, but in a safe, controlled environment”.

In a world where digital services are often the key interaction point between a business and its customers, and customer expectations of seamless experiences have never been higher, the costs of downtime continue to grow. It’s enough to give any IT team sleepless nights - and, as Polfliet points out, can be a massive distraction from more important work.

“By having observability, by having multiple data sources in one platform, you’re improving overall transparency and accountability. And it’s that transparency and accountability that not only gets stuff fixed quicker, but also inspires confidence in IT teams to push ahead on innovation, and ultimately, create even better experiences for customers.”

Sponsored by New Relic

Biting the hand that feeds IT © 1998–2022