Observability vs Monitoring: What's the Difference?

devops observability monitoring

“Observability” has become a buzzword. Is it just monitoring rebranded for marketing purposes? No—there’s a real distinction that matters.

Monitoring: Predefined Questions

Monitoring answers questions you know to ask:

You define dashboards, alerts, and thresholds in advance.

# Prometheus alert rule
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
  for: 5m
  annotations:
    summary: "Error rate above 10%"

This works until you encounter an unknown failure mode.

Observability: Arbitrary Questions

Observability lets you ask questions you didn’t anticipate:

You don’t predefine the questions—you explore.

The Three Pillars

1. Metrics

Aggregated numerical measurements over time:

http_requests_total{method="GET", path="/api/users", status="200"} 1234
http_request_duration_seconds{quantile="0.99"} 0.250

Good for: Alerting, dashboards, capacity planning Limitation: Pre-aggregated, lose individual request detail

2. Logs

Event records with context:

{
  "timestamp": "2020-01-29T10:30:00Z",
  "level": "ERROR",
  "message": "Database connection failed",
  "service": "user-api",
  "trace_id": "abc123",
  "user_id": "user-456"
}

Good for: Debugging specific events, audit trails Limitation: High cardinality, expensive at scale

3. Traces

Request flow across services:

[Trace: abc123]
  → user-api (10ms)
    → auth-service (5ms)
    → database (100ms) ← Slow!
  → notification-service (8ms)

Good for: Understanding distributed systems, finding bottlenecks Limitation: Sampling needed at scale

Observability = Correlation

The power is connecting these:

Alert fires (metrics)
  → Find related trace
    → See database was slow
      → Find log showing connection pool exhausted

Without correlation, you’re context-switching between tools.

High Cardinality: The Key Difference

Traditional monitoring aggregates away details:

# Metric: Average latency across all users
http_latency_avg = 50ms

Observability preserves dimensions:

# Trace: This specific user, this specific request
user=premium_customer, endpoint=/expensive-operation, latency=5000ms

That 50ms average hides the premium customer with 5-second requests.

Implementation

Logging

import structlog

logger = structlog.get_logger()

def process_order(order_id, user_id):
    logger.info(
        "processing_order",
        order_id=order_id,
        user_id=user_id,
        trace_id=get_trace_id()
    )

Structured logs enable querying:

SELECT * FROM logs 
WHERE user_id = 'user-456' 
AND trace_id = 'abc123'

Distributed Tracing

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("user_id", user_id)
    span.set_attribute("amount", amount)
    result = payment_gateway.charge(amount)
    span.set_attribute("result", result.status)

Metrics with Labels

from prometheus_client import Counter, Histogram

request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

Tool Stack

Collection

Storage

Visualization

All-in-One

OpenTelemetry: The Standard

OpenTelemetry (OTel) unifies instrumentation:

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

# Setup once
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(JaegerExporter())
)

# Use everywhere
tracer = trace.get_tracer(__name__)

One API, export to any backend.

Practical Starting Points

  1. Add trace IDs to logs: Correlation without full tracing
  2. Structure your logs: JSON, not plain text
  3. Add labels to metrics: Not just totals, but by endpoint, status, etc.
  4. Start sampling: You don’t need 100% of traces

Anti-Patterns

Logging Everything

# Don't log every line
logger.debug(f"Starting loop iteration {i}")  # Expensive

No Correlation

# Useless without trace context
logger.error("Database error occurred")

Dashboard Overload

50 dashboards nobody looks at. Start with 3-5 key views.

Final Thoughts

Observability is a capability, not a product. It’s the ability to understand your system from its outputs.

Start by connecting your existing signals. Add trace IDs to logs. Correlate metrics with traces. The tools matter less than the practice.


Don’t just know that something broke. Understand why.

All posts