Looking for agility AND stability? You're going to need observability too

How to establish the true state of your systems in a distributed world

Sponsored Feature Computing is supposed to be binary. One or zero. On or off. But the reality, we know, is much more nuanced.

Establishing the real "state" of your system has always been a challenge. Operators and administrators traditionally relied on monitoring or logging to establish how their systems were performing. Or perhaps more accurately, had been performing up until the point they went down.

At least when applications were monolithic, and infrastructure largely on-prem, problems were relatively contained, even if finding and resolving them might mean examining each line of code or hardware component.

Things today are vastly more complicated. Distributed applications and infrastructure comprise hundreds, even thousands, of components and microservices. The underlying infrastructure can span on-prem and cloud infrastructure. And all of this is operating in real time – because that is what customers and internal users expect, after all.

So, traditional logging and monitoring tools that alert you when a problem has become critical are no longer good enough. What we need is to ensure problems are spotted, and ideally resolved, before they impact users. What we need is what many in the IT industry now refer to as observability.

As AWS senior technical product manager Nitin Chandra explains, there are varying definitions of observability, but "the one we emphasize most is the ability to be able to know the state of your system in depth at any given point of time." The aim is to be able to query the system, know what components are in play and their current state, and whether they are performing as expected.

As Chandra explains, this spans three key aspects. The first is operational data from an infrastructure perspective. This means having visibility into whatever you're using –infrastructure, containers, or Kubernetes, or Docker for example, and wherever it happens to be hosted whether on- or off premises public, private or hybrid cloud environments.

The second is the applications and services that run on that infrastructure. "Are they performing as expected? Because there's often a correlation and interdependence between infrastructure and services."

But the ultimate measure, Chandra points out, is the end user. "How good is the customer experience for people who are actually using your software system?"

Customer experience might seem a rather fuzzy concept. But consider the impact of degraded customer experience. A study by Oxford Economics puts the average cost of an hour of downtime to US organizations at $136,000. So, waiting until a problem manifests itself is an expensive approach. The same research shows that just 14 per cent of respondents achieve four nines availability or better, which equates to eight hours or less downtime per month.

That's before considering the ongoing draining effect sub-optimal systems have on day-to-day efficiency within an organization. As companies invest in digital transformation, DevOps and cyber security, their systems become more complex and tracking errors within them becomes harder. Oxford Economics states, "True observability provides the detailed data that turns small improvements into big gains."

What are the unknown unknowns?

As Chandra explains, monitoring helps when you know what problems to expect. "But with all the complexity that has come in with third party dependencies and multiple dependencies in your own components, it's important to also be able to investigate unknown unknowns."

That is why distributed tracing, which allows the tracking of a request's path through the entire system, collecting and reporting back data as it goes, has become so essential to achieving observability in today's distributed architectures.

This has coincided with the evolution of an agent framework and open standards which allow the collection and aggregation of data to produce a consolidated picture of what is really going on.

Chandra also points to the evolution of the OpenTelemetry standard. "That helped a lot in tying all of this together in well defined schemas that promoted collaboration and interoperability."

When it comes to the data itself, high cardinality is often described as an essential premise for observability. This simply means a particular data point can have lots of values. Such data can then be combined with other high cardinality data points, including data from different systems, to reveal trends and patterns, says Chandra.

Together with machine learning, this has set the scene for a shift in focus from alerts – and the consequent danger of false positives waking up ops people in the middle of the night leading to alert fatigue – towards anomaly detection.

"Another way AI is able to contribute is to try and learn from the system behavior and establish what the baselines look like," adds Chandra. For example, he continues, once the ideal behavior has been established, "even if there is not a big outage, if you are, over time, gradually getting towards a non-ideal behavior or the trends are different, then you can be advised on that."

And if a component is not working, it becomes easier to get to the root cause. "Because you can automatically impute from the anomaly chain what could be the source of where the anomalies happened."

Having this level of observability clearly makes it easier for operators, not just to find the root cause of problems, but to intercept and remedy them as they emerge and avoid any noticeable effect on customer experience. For modern organizations, customer experience underpins success in general. It's unsurprising then that there is what Chandra describes as "the emerging discipline of applied observability, where you're able to relate it to the business outcomes."

The premise is straightforward. "You may have some business objectives in mind that are dependent on your operational excellence," says Chandra. For example, the organization might have an objective around sales, or the number of checkouts, and "tie those KPIs and objectives back to your operational state and see the dependence between them."

The organization can then infer whether a slow checkout page had led to a decrease in the number of checkouts that people were able to do on a mobile app or the website.

With observability and automation, operators or analysts can take a step back and look at how service level objectives tie in with business level objectives, rather than simply spending time checking whether systems are working or not.

Tying it all together

Putting all this in place is, of course, hardly trivial. As Chandra says, "It's perhaps easiest and most cost effective to build it with cloud services like AWS, which allow people to use a lot of out of the box components. The ease of managing it [means] that they can now focus on higher level concepts."

In Amazon's case, its native CloudWatch technology collects information, including metrics and logs, across AWS' services. It also offers open-source platforms as managed services, such as Amazon Managed Prometheus for managing metric data at scale.

Amazon OpenSearch Service expands the integration and potential for analysis, with tools such as OpenSearch Dashboards, as well as a data ingestion component DataPrepper and an observability plugin. OpenSearch is available as an open-source project in its own right, as well as an AWS managed service, and a serverless version has also recently been launched.

Amazon recently added OpenSearch Ingestion - a managed service based on DataPrepper, which serves as an observability pipeline by taking over intermediate processing and enrichment of data to make analysis simpler and deliver better performance.

Chandra notes that managing observability involves a trade off between scaling up the amount of data processed, to produce ever finer grained insights, and the costs and overhead of managing and storing that data. That capability is important considering in a typical large customer, a front-end application could be using over a thousand microservices and serving tens of millions of customers per day.

So, combining such tools with low-cost storage such as AWS's S3 will clearly pay dividends. Says Chandra: "You can imagine the amount of requests that are generated, and each request will have probably, at least about 1,000 to 2,000 lines of logs for each transaction."

Chandra estimates that CloudWatch as a whole manages in the region of six exabytes of data a month, and the Amazon OpenSearch Service regularly supports individual customers ingesting petabytes of data. This volume of data represents massive potential for generating rich insight, with the right tools and context. And it seems inevitable that AI will play an ever-bigger role going forward, one which extends way beyond straightforward analysis and anomaly detection.

"Today we rely on the domain knowledge of the SRE to know exactly where to look and what information to look for," says Chandra. "What generative AI would make easier is to have a conversational interface for someone to get information without necessarily knowing in-depth which screen or dashboard to look at."

Afterall, why shouldn't system operators and infrastructure engineers not benefit from a better customer experience too?

Sponsored by AWS.

More about


Send us news