Bridging the observability gap

Trace the journey through all those microservices in the background

Sponsored In modern IT, visibility is everything. IT admins and Site Reliability Engineers (SRE) survive on their ability to see what's happening in their systems. Unfortunately, as systems get more sophisticated, it has become harder to see what they're doing. That's why the industry is promoting observability as the evolution of existing concepts like monitoring and metrics. Vendors are stepping up with tools to address a growing visibility gap.

What is observability?

IT departments have been monitoring infrastructure and applications for decades, so isn't 'Observability' just a marketing term for what's already best practice? Not according to Dhiraj Goklani, area vice president of IT & DevOps, APAC at Splunk, which recently launched its Splunk Observability Cloud.

"We had a massive shift to cloud infrastructure and containerized applications, because organizations are significantly accelerating their digital and direct-to-consumer initiatives," he explains.

Once upon a time, observers could peer into IT operations easily enough by running programs that directly monitored their servers' operations. Applications were monolithic, so you would point monitoring software at them and log the results. Things got a lot more difficult when companies began abstracting everything and making it more distributed. When they started running everything in the cloud, they relied on cloud service providers' monitoring services. When they spread operations between multiple cloud companies, along with their own on-premise solutions, things became more disjointed.

Composable applications made things more difficult still. For years, companies had struggled to atomize their applications, constructing them from smaller, more manageable pieces. Microservices and the containers that run them finally bought that practice to the mainstream, giving development teams modular applications with pieces that they could update individually. The downside of this approach was that those smaller pieces, running on more abstracted cloud infrastructure, became more difficult to monitor.

Developers now accessed those services via APIs when bolting them together to create new applications. Those interfaces also needed monitoring as part of the overall end-to-end journey. Legacy monitoring tools are often siloed, designed to examine certain parts of the stack or infrastructure domains. They cannot cope with complex, disjointed workflows. That's the gap that observability tools fill by marrying applications and infrastructure into a single end-to-end view at all levels of the stack.

Taking note of traces

That joined-up view demands better telemetry, Goklani explains. Splunk cut its teeth developing tools that took the analysis of machine-generated logs to the next level, enabling IT admins & SREs to derive sense from reams of operational data. Observability combines those with the metrics that summarize performance and availability. These forms of operational data are well understood, but there's a third type that's critical in modern composable cloud-based infrastructures, Goklani explains: traces.

Traces document the interactions between the thousands - possibly millions - of microservices that work together to fulfil an application request. These small pieces of code are usually duplicated for a mixture of scale-out functionality and resilience.

When a user of an application built on microservices logs into a web app, changes their account configuration, talks to a support technician in a chat window, searches for products and makes a feature comparison, they touch thousands of individual pieces of software. Checking their cart and then checking out with a payment touches more. That is more difficult to track in an atomized microservices infrastructure where lots of individual pieces of code work together, often across infrastructure owned by different entities.

"I call traces footprints in the sands of time," Goklani quips. "They trace my journey through all those microservices in the background. If my transaction is failing, anyone in the global services team can figure out why that happened, for me specifically. I might have a slow connection here in my network, or it could be something at their back end that caused it."


GigaOm's Cloud Observability Radar, which analyses 14 competitors in the observability space, highlights the OpenTelemetry project as a key initiative in this space. This project, to which Splunk is a primary contributor, is an open source observability framework managed by the Cloud Native Computing Foundation. It merged two existing projects, OpenTracing and OpenCensus, which offered standard APIs for gathering traces and for application behaviour metrics, respectively.

The GigaOm report also calls out Splunk as the only outperformer in the observability arena. "Splunk has emerged as one of the leaders in the observability space with strategic acquisitions and targeted organic solution development."

Splunk started out in 2003 with an offering to distil vast amounts of machine-generated data from technology infrastructures. Since then, it has grown its product portfolio to cover aspects of IT ranging from security through to IoT. These days it monitors a wide range of infrastructure and application performance data while also offering IT admins & SREs a deep dive into system logs for forensic investigations.

GigaOm ranked Splunk the highest among other vendors due to a wide set of integrations and a rich set of back-end functions. "Splunk ingests full-fidelity data from all sources (logs, metrics, and traces) across the full stack," the industry deep-dive said. "It also provides massive scalability, sophisticated in-stream analytics, and native OpenTelemetry support."

Observability cloud

Splunk Observability Cloud brings together several components from the company's existing portfolio under a single interface that it says makes it easier for IT admins and SREs to build those end-to-end views. These cover infrastructure and application performance monitoring, along with log observations. It also includes Splunk Real User Monitoring product, which captures and measures user activity using browser interactions with back-end resources to get an accurate picture of real-world user experiences. The full product also includes Splunk On-Call, which escalates any emergent problems discovered to the right members of the incident response team.

Splunk Observability Cloud also features a new product, Splunk Synthetic Monitoring, which enables IT admins and SREs to script interactions to test the performance of different interaction types. This complements its other tools with the ability to monitor common interactions with critical applications at set intervals, surfacing any problems quickly. Goklani also highlights its ability to test API interactions.

Splunk Observability Cloud is one of three such suites from Splunk. Its siblings are the IT and Security Cloud products. They may target different use cases, but they have one thing in common: a new pricing model. Splunk traditionally priced its services based on the amount of data that customers ingested. The suite-based approach switches this out for what it calls entity-based pricing. This charges for the products based on different kinds of infrastructural unit. This could be an IP address, or an individual user. The company defines those units based on which of its suites the customer is using.

It's possible to buy all of the products separately and enjoy automatic integrations as they discover each other, says Goklani, but the advantage of unifying them all under a single interface is that they make it easier to track and exchange workflows.

"In the past, IT and DevOps teams took a swivel chair approach, going between tools, correlating their information, and writing custom solutions," he explains that this was the only way to get any kind of unified view. "Now we have provided a unified interface to consume all that data."

A single interface helps DevOps and IT teams to maintain service availability, he says, protecting those processes that businesses simply can't allow to go down. It also helps IT admins and SREs maintain the performance of services strung across different applications and infrastructure by surfacing problems quickly and allowing different team members to analyze and resolve them using the same tool.

How observability supports DevOps

These days, the IT admins & SREs that spot and manage those problems are just as likely to be the people that developed the code. Goklani explains that collecting tools into an observability suite supports DevOps disciplines that put the software engineers at the centre of the software service lifecycle. Now that developers own the underlying cloud-based infrastructure along with the code itself, they need a way to get full visibility into its operation and make it reliable, performant and secure. "Observability helps to monitor the DevOps lifecycle," Goklani says.

Better observability translates into real-world results. Lenovo began using Splunk Observability Cloud to manage its e-commerce business. It implemented the tool to gather data from its e-commerce systems and identify emerging problems. It sliced the time that it takes to recover from a system failure to five minutes from half an hour.

The suite also came in handy when coping with an unexpected mid-pandemic surge in demand. Lenovo expected its traffic to surge on Black Friday 2020 after it introduced pricing incentives and gaming product give-aways, but it didn't expect the traffic to balloon by 300% more than the prior year. Splunk Observability Cloud helped the company to maintain 100% uptime, executives said.

The change in Splunk's pricing and the collation of the products into a single Splunk Observability Cloud go hand in hand, and represent a sea-change for a company that built its revenues pricing by the byte. It is Splunk's bid to have customers do more with its products, pushing it further into their environments with a portfolio that it wants to be ubiquitous.

Sponsored by Splunk

Biting the hand that feeds IT © 1998–2021