How New Relic takes the guesswork out of IT fixes

Observability is everything

Got Tips?

Sponsored When you know that a glitch in your company’s systems could ruin a child’s birthday, that’s pressure. For the IT team at Metapack, it’s a day-to-day reality, making fast and effective troubleshooting a top priority.

The UK-based company works directly with leading retailers such as John Lewis, Tesco and Marks & Spencer, providing them with a platform to connect with delivery providers to ensure that their customers’ e-commerce orders arrive on time. Today, Metapack oversees the delivery of some 1.2 million parcels daily - but during peak periods, such as the run-up to Christmas, this can grow to between three million and five million parcels per day.

In a busy environment like that, identifying faults and fixing them needs to happen at breakneck speed. Taking too long could scupper delivery schedules and seriously undermine Metapack’s relationships with retailers. After all, it’s retailers who’ll be in the firing line from customers if a birthday present fails to appear on time. In turn, the retailers will likely take Metapack to task. In short, reducing ‘mean time to resolution’ (MTTR) is vital for the firm.

Metapack is succeeding here through ‘observability’, an approach to real-time performance monitoring that focuses on delivering faster detection, diagnosis and resolution of issues. In Metapack’s case, this is based on tools from New Relic.

“I liken it to moving from driving in the fog, where you have to go slowly because you can only see to the end of your car, to driving in perfect conditions, where you can speed up and still see every small bump in the road,” says Lukasz Ciechanowicz, head of technical operations at Metapack.

Observability defined

Metapack is far from alone in its desire to navigate a smoother path to faster system fixes. ‘Observability’ is a term quickly gaining traction in a world where compelling digital experiences for customers means more operational complexity for IT teams, says Stijn Polfliet, principal technical evangelist at New Relic.

“These teams are increasingly dealing with complex distributed environments, often running across multiple clouds and a large number of containers and microservices. As the complexity grows, many IT teams start to struggle with troubleshooting,” he says. Simply put, there’s just a lot more that can go wrong - and hence a lot more technology that requires careful oversight.

At the same time, he adds, IT teams are adopting DevOps practices, with a view to delivering a more regular stream of small, incremental updates to software. While these updates are frequently designed to improve the customer experience of a digital service, constant changes to an environment can open up new risks by introducing new bugs and glitches.

Against this backdrop of complexity, many IT teams find that their legacy monitoring tools, which tend to focus on specific silos within the application or infrastructure environment, simply don’t provide the ‘big picture’ view that they need, according to Polfliet. What they’re looking for is an approach that brings together performance information from across the entire IT stack and presents it in a consolidated view.

And it’s not just a case of identifying that something’s gone wrong, he says. “Observability takes things a step further than simple monitoring, because today’s IT teams want more than an alert. They want to understand why something’s gone wrong and they want to be able to act on it.”

Customer expectations

At the heart of observability is a relatively straightforward goal: a seamless digital experience for customers. Implemented well, observability can enable IT teams to detect and fix any problems before customers are even aware of them. As Werner Vogels, CTO of Amazon Web Services, put it in his 2019 keynote presentation at the company’s AWS Summit in New York: “Observability is everything.”

After all, modern customers have pretty high expectations of the digital services they use. Internet giants such as Amazon, Uber and Netflix have transformed notions of what speed and convenience should look like and set new standards for slick experiences. Whether customers are shopping or gaming online, booking a ride, or bingeing on TV shows and movies, slow response times and failed transactions are a major turn-off.

The stakes couldn’t be higher. In a recent survey conducted by analyst firm The 451 Group, consumers were asked how likely they’d be to switch brands or providers due to poor application or service performance. Almost four out of five (79 per cent) said they would be somewhat or very likely to switch. “For ecommerce businesses or others that rely on an app or service for sales, each customer that switches due to slow or buggy performance represents lost revenue potentially long into the future,” say the report’s authors.

Downtime just makes a bad situation worse, says Polfliet, pointing to a 2014 Gartner study that estimates the average cost of downtime at $5,600 per minute. A more recent report, from the Ponemon Institute in 2016, raises that average to nearly $9,000 per minute. These averages, naturally, vary wildly according to company size, reach and volume of online business. At US-based ‘big box’ retailer Costco, for example, website outages over Thanksgiving and Black Friday in 2019 may have led to $11 million in revenue losses, according to some estimates.

“And it’s not just the impact of lost revenues that companies need to consider,” he adds. “When there’s downtime, or you have issues that are slowing your systems down, your engineers are spending a lot of their time getting things fixed. They’re firefighting. That has a direct effect on productivity and the ability of a company to innovate.”

Journey to observability

So how can companies move beyond simple monitoring to observability? It’s a journey that Cellulant, the Africa-based operator of digital payments platform Tingg, has already made. Today, Tingg serves one in 10 Africans - but keeping up with growing demand hasn’t always been a smooth ride.

During an outage in April 2018, “we couldn’t tell whether it was an application or a database issue, so we had multiple teams running around in circles,” says the company’s group head of technology operations George Murage. That outage prompted a move of its core systems to AWS and the implementation of New Relic.

Says Murage: “One of the first things we noticed about New Relic was the consolidated view it provided. For the first time ever, we could see what was happening across our environment, and identify the code that was generating so many errors. This was unprecedented.” Increased observability has enabled Cellulant to reduce its MTTR by as much as 50 per cent, he reckons.

Similarly, at Metapack, Ciechanowicz says the ability to drill down into alerts to make accurate diagnoses has “drastically reduced” his team’s time to detect and fix issues. He estimates an 80 per cent time saving from moving away from point solutions to the consolidated view provided by the New Relic platform. “Alerts are now so detailed that we can direct them straight to the team best placed to fix them the fastest,” he says.

Basically, New Relic is taking a lot of the guesswork out of fixes at these companies, says Polfliet: “When real-time information flows from the application, server and related infrastructure, all into one place, an IT team gets the hard facts it needs to make an accurate assessment of the situation.” That same information is just as valuable in quickly verifying that fixes have the desired results, further reducing MTTR.

An AI boost

Today, observability is getting an extra boost from the inclusion of artificial intelligence (AI) and machine learning capabilities embedded in these toolsets. According to Polfliet, these automate the process of analysing data generated by an IT stack and, in turn, enable tools to predict issues before they occur, determine their root causes and drive automation to fix them. The term coined by Gartner for these AI-driven capabilities is ‘AIOps’ - or Artificial Intelligence for IT Operations.

“AIOps is a really interesting technology in this context of reducing time to fix issues,” says Polfliet, explaining that it’s a way for machines to take the strain of analysing masses of data captured from a modern IT environment and make important connections, in order to guide a speedy, accurate response from humans.

“This is an important step forward in IT teams being able to work in a proactive way, rather than a reactive way. Instead of getting a customer telling you they’re having problems adding an item to their basket or checking out, you get a big red flag from AIOps, telling you that these might become problems soon if you don’t attend to an issue quickly.”

On top of this, he continues, an AIOps-enabled suite of tools can help teams prioritise the issues that matter most, by correlating related incidents and providing background information and context on them. Alerts can be automatically routed to the individuals or teams best equipped to respond, and in some cases, issues can be automatically remediated by the toolset itself. All these can have a dramatic impact when it comes to shrinking MTTR.

For many companies, this vision of AIOps may seem like a somewhat distant destination, but greater observability shouldn’t do. Companies that move ahead on this journey, in fact, report rapid benefits. At Metapack, where systems availability has risen from 99.4 per cent to 99.96 per cent, Ciechanowicz likens observability to a sophisticated sat-nav system, “which tells us all about upcoming traffic jams, roadwork and speed traps, and helps us navigate the best path around them.”

Sponsored by New Relic


Biting the hand that feeds IT © 1998–2020