What is Observability?
Overview of observability, its importance, key telemetry types (metrics, logs, traces), basic architecture, relevant tools, and references for further reading.
Table of contents
- Table of contents
- Description
- Telemetry
- Why implement observability?
- Basic architecture
- References
- Some tools
Description
Observability is the capability to understand a system’s internal workings by examining its external behaviors and outputs, it involves collecting and analyzing telemetry data (logs, metrics, and traces) from diverse sources.
Telemetry
Telemetry refers to data emitted from a system and its behavior. The data can come in the form of traces, metrics, and logs.
Metrics
A metric is a measurement of a system’s state. Metrics are generally collected at regular intervals and are used to understand the system’s behavior over time.
Logs
Logs are records of events that have occurred in a system. Logs are typically used to understand the system’s behavior at a specific moment in time.
Traces
Traces are records of the path a request takes through a system. Traces are generally used to understand how the system behaves when handling a request.
Why implement observability?
- Enables to evaluate, monitor, and improve the performance of distributed IT systems.
- Provides visibility across multiple data sources (logs, metrics, traces).
- Leads to faster, higher-quality software delivery.
- Synthesizes data from all IT layers (hardware, software, cloud infrastructure, containers, microservices, endpoints, etc.), supporting both real-time and historical analysis.
Basic architecture
architecture-beta
group system(disk)[SYSTEM]
group server(server)[SYSTEM or SERVER]
service collect(disk)[Collector] in system
service db_timeseries(database)[DB Timeseries] in server
service db_log(database)[DB Logs] in server
service visualizer(database)[Visualizer] in server
collect:R -- L:db_timeseries
collect:R -- L:db_log
db_timeseries:R -- L:visualizer
db_log:R -- L:visualizer
A very basic observability architecture consists of three key components:
- Collector: Responsible for gathering telemetry data from the system. The choice of collector(s) depends on the types of telemetry you need (such as metrics, logs, or traces) and the available system resources.
- Database: Stores the collected data. In observability stacks, it is common to use specialized databases based on the type of telemetry being stored (e.g., time-series databases for metrics, log databases for logs). The database can be hosted on the monitored system itself or on a remote server.
- Visualizer: Provides the ability to query the database and display the telemetry data in a user-friendly format. Visualization tools can also run locally or connect to remote servers.
References
Some tools
- Data collectors:
- collectd: For embedded systems.
- OTel
- Grafana Alloy
- cAdvisor: Analyzes resource usage and performance characteristics of running containers.
- omlibdbi (rsyslog): Provides native support, on Linux systems, for log collection and export to different databases.
- Specialized DBs:
- TimescaleDB: A time-series database for high-performance real-time analytics packaged as a Postgres extension.
- Jaeger: A time-series database.
- Prometheus: A time-series database.
- Grafana Tempo: Tracing backend.
- Grafana Loki: Log aggregation system.
- Visualizers: