Quick Links

Observability is a trait of software systems that provide deep visibility into their internal operations. Possessing good observability facilitates faster resolution of problems by helping operations teams identify the cause of issues.

The simplest definition of "observable" software is a system that lets you deduce its internal state by watching the outputs it produces. If your system can't provide these outputs, it won't be fully observable.

Consider a software platform that appears to be running more slowly than normal. At first glance, you've got insufficient information to work out what's causing the slowdown. But if the system emitted performance metrics for each stage of its execution, you could immediately pinpoint the component with a problem. The system's observability has now been enhanced.

Isn't Observability The Same As Monitoring?

Observability is not the same as monitoring, although the two concepts are close relations. Good monitoring practices contribute towards an observable system. They don't provide a guarantee of observability. Conversely, a system could be reasonably observable without a fully-fledged monitoring stack.

Monitoring in DevOps terms typically refers to the use of several predefined metrics to identify when a system's performing within expectations. The metrics which are covered usually get tied down to resource utilization (CPU usage, network throughput) but may also surface basic data about your system's operations (number of requests causing a

        500
    

error code).

Observability goes a little deeper and requires more nuanced instrumentation. Unlike monitoring, it's coupled to your system and its characteristics, rather than the surrounding environment.

A monitored system tells you that the

        500
    

error count is elevated and users are having issues. An observed system reports that your authentication microservice is timing out, so user sessions aren't being restored and your gateway is issuing a

        500
    

as a last resort.

How Do Systems Become Observable?

Let's break down the differences between the two examples shown above. In a traditional approach, you'd deploy to your server and set up monitoring, perhaps using your cloud provider's metrics alerts. If an issue was detected, you could go and inspect the server logs for issues.

This model is already observable to a degree. The modern usage of "observability" conveys a little more though. Server error logs typically provide the final outcome but not the states that caused it to occur. For a system to be truly observable, you should be able to determine the sequence of internal states that led to a particular output, without having to spend too much time manually collecting the information.

There are three primary "pillars" of observability, of which good monitoring is one. Paying attention to all three pillars should result in an observable system that is an effective aid in diagnosing problems.

Metrics and Monitoring

An observable system should provide constant measurements for predefined metrics. The best metrics are ones that present actionable information relevant to your application and its performance, not necessarily generic CPU and memory charts.

Logging

The second pillar of observability is logging. This describes a more structured approach to logging than a basic write when an error occurs. Logs should be highly integrated into your system so that each event gets recorded to a centralized logging service. Logs themselves should be structured in a standardized manner, so log viewing tools can auto-index and format them.

Tracing

The final pillar is tracing. Traces capture everything that happens during a particular run through the program. This gives you the information you need to reproduce the exact sequence of events that led to a problem. Tracing is especially important to distributed systems where a single request might hit a dozen or more microservices. Relying on service logs alone is unrealistic, as you won't be able to see what happened to the request after it had finished with each service. A request-level trace is more effective in pinpointing problems.

You can start to make a system more observable by ensuring you've got coverage of all three pillars. Remember that "observability" isn't one specific thing - it's a trait of a system, not a single attribute. Chances are your system is already "observable" through basic metrics and error logs but it may still have low "observability" if you can't readily determine the root cause of errors.

Does Observability Stop Errors?

It's worth noting that observability isn't meant to eliminate bugs and errors. Instead, it's actually an acceptance that problems can and will occur. Rather than assuming your system is infallible, observability encourages you to plan for the unthinkable. If you faced an outage, would you have the tools you needed to find the cause?

To use a motoring analogy, it's the difference between a check engine light and the manufacturer's diagnostics software. As undesirable and unlikely as it may be, on-the-road malfunctions do happen. Most people with no specialist equipment see a generic warning light. A dedicated motorist or technician will have the tools to read off the cause of that light.

Now let's return to the cloud. A screen of red metrics won't help much during an outage. Similar to how a vehicle mechanic can read off diagnostics, your system needs to be more deeply observable so you can quickly establish what's wrong without looking at the nuts and bolts. It's important to plan for disasters so you don't get caught out.

Observability is Continuous

Maintaining good observability does require ongoing maintenance. You'll need to evaluate your instrumentation as you add new services. Otherwise, you may unwittingly create voids in your logs and traces which requests disappear into.

You can identify gaps in your observability implementation by questioning the system and checking you can get the answers you need. You should think about the information you'd need to start addressing a problem. Would you be able to access it during an outage, without a prolonged manual intervention?

A truly observable system should be able to use one of the three pillars to answer questions presented by the other two. Why is memory usage in the danger zone? Why was an error recorded in the authentication service's logs? In both these cases, the other two pillars should be your first port of call to find the answer.

Summary

Observability is a made-up word that can sometimes seem vague and opaque. In practice, the modern use of the term refers to something fairly simple: the unison of monitoring, logging, and tracing to help you infer the internal state of a system from its outputs.

Good observability is vital to distributed architectures where functionality is spread across microservices. An unobservable system becomes a black hole that sucks in requests but provides nothing back. This will compromise your ability to respond to issues and could lead to users reporting problems before you're aware of them.

Conversely, an observable system helps you stay ahead of error reports. Time to resolution is minimized as the system will already be waiting with the information you require.