Observability vs Monitoring: What's the Difference?

January 29, 2020

devops observability monitoring

“Observability” has become a buzzword. Is it just monitoring rebranded for marketing purposes? No—there’s a real distinction that matters.

Monitoring: Predefined Questions

Monitoring answers questions you know to ask:

Is the server up?
What’s the CPU usage?
How many errors occurred?

You define dashboards, alerts, and thresholds in advance.

# Prometheus alert rule
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
  for: 5m
  annotations:
    summary: "Error rate above 10%"

This works until you encounter an unknown failure mode.

Observability: Arbitrary Questions

Observability lets you ask questions you didn’t anticipate:

Why are requests from this specific customer slow?
What changed between yesterday’s deployment and today’s?
What’s different about requests that fail vs succeed?

You don’t predefine the questions—you explore.

The Three Pillars

1. Metrics

Aggregated numerical measurements over time:

http_requests_total{method="GET", path="/api/users", status="200"} 1234
http_request_duration_seconds{quantile="0.99"} 0.250

Good for: Alerting, dashboards, capacity planning Limitation: Pre-aggregated, lose individual request detail

2. Logs

Event records with context:

{
  "timestamp": "2020-01-29T10:30:00Z",
  "level": "ERROR",
  "message": "Database connection failed",
  "service": "user-api",
  "trace_id": "abc123",
  "user_id": "user-456"
}

Good for: Debugging specific events, audit trails Limitation: High cardinality, expensive at scale

3. Traces

Request flow across services:

[Trace: abc123]
  → user-api (10ms)
    → auth-service (5ms)
    → database (100ms) ← Slow!
  → notification-service (8ms)

Good for: Understanding distributed systems, finding bottlenecks Limitation: Sampling needed at scale

Observability = Correlation

The power is connecting these:

Alert fires (metrics)
  → Find related trace
    → See database was slow
      → Find log showing connection pool exhausted

Without correlation, you’re context-switching between tools.

High Cardinality: The Key Difference

Traditional monitoring aggregates away details:

# Metric: Average latency across all users
http_latency_avg = 50ms

Observability preserves dimensions:

# Trace: This specific user, this specific request
user=premium_customer, endpoint=/expensive-operation, latency=5000ms

That 50ms average hides the premium customer with 5-second requests.

Implementation

Logging

import structlog

logger = structlog.get_logger()

def process_order(order_id, user_id):
    logger.info(
        "processing_order",
        order_id=order_id,
        user_id=user_id,
        trace_id=get_trace_id()
    )

Structured logs enable querying:

SELECT * FROM logs 
WHERE user_id = 'user-456' 
AND trace_id = 'abc123'

Distributed Tracing

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("user_id", user_id)
    span.set_attribute("amount", amount)
    result = payment_gateway.charge(amount)
    span.set_attribute("result", result.status)

Metrics with Labels

from prometheus_client import Counter, Histogram

request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

Tool Stack

Collection

Metrics: Prometheus, StatsD
Logs: Fluentd, Logstash, Vector
Traces: Jaeger, Zipkin, OpenTelemetry

Storage

Metrics: Prometheus, InfluxDB, Cortex
Logs: Elasticsearch, Loki
Traces: Jaeger, Tempo

Visualization

Grafana: Dashboards for all three
Kibana: Log exploration
Jaeger UI: Trace visualization

All-in-One

Datadog: Commercial, full-featured
Honeycomb: Observability-focused
New Relic: APM + observability
Elastic Stack: Open source option

OpenTelemetry: The Standard

OpenTelemetry (OTel) unifies instrumentation:

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

# Setup once
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(JaegerExporter())
)

# Use everywhere
tracer = trace.get_tracer(__name__)

One API, export to any backend.

Practical Starting Points

Add trace IDs to logs: Correlation without full tracing
Structure your logs: JSON, not plain text
Add labels to metrics: Not just totals, but by endpoint, status, etc.
Start sampling: You don’t need 100% of traces

Anti-Patterns

Logging Everything

# Don't log every line
logger.debug(f"Starting loop iteration {i}")  # Expensive

No Correlation

# Useless without trace context
logger.error("Database error occurred")

Dashboard Overload

50 dashboards nobody looks at. Start with 3-5 key views.

Final Thoughts

Observability is a capability, not a product. It’s the ability to understand your system from its outputs.

Start by connecting your existing signals. Add trace IDs to logs. Correlate metrics with traces. The tools matter less than the practice.

Don’t just know that something broke. Understand why.