Observability vs Monitoring: What's the Difference?
“Observability” has become a buzzword. Is it just monitoring rebranded for marketing purposes? No—there’s a real distinction that matters.
Monitoring: Predefined Questions
Monitoring answers questions you know to ask:
- Is the server up?
- What’s the CPU usage?
- How many errors occurred?
You define dashboards, alerts, and thresholds in advance.
# Prometheus alert rule
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
annotations:
summary: "Error rate above 10%"
This works until you encounter an unknown failure mode.
Observability: Arbitrary Questions
Observability lets you ask questions you didn’t anticipate:
- Why are requests from this specific customer slow?
- What changed between yesterday’s deployment and today’s?
- What’s different about requests that fail vs succeed?
You don’t predefine the questions—you explore.
The Three Pillars
1. Metrics
Aggregated numerical measurements over time:
http_requests_total{method="GET", path="/api/users", status="200"} 1234
http_request_duration_seconds{quantile="0.99"} 0.250
Good for: Alerting, dashboards, capacity planning Limitation: Pre-aggregated, lose individual request detail
2. Logs
Event records with context:
{
"timestamp": "2020-01-29T10:30:00Z",
"level": "ERROR",
"message": "Database connection failed",
"service": "user-api",
"trace_id": "abc123",
"user_id": "user-456"
}
Good for: Debugging specific events, audit trails Limitation: High cardinality, expensive at scale
3. Traces
Request flow across services:
[Trace: abc123]
→ user-api (10ms)
→ auth-service (5ms)
→ database (100ms) ← Slow!
→ notification-service (8ms)
Good for: Understanding distributed systems, finding bottlenecks Limitation: Sampling needed at scale
Observability = Correlation
The power is connecting these:
Alert fires (metrics)
→ Find related trace
→ See database was slow
→ Find log showing connection pool exhausted
Without correlation, you’re context-switching between tools.
High Cardinality: The Key Difference
Traditional monitoring aggregates away details:
# Metric: Average latency across all users
http_latency_avg = 50ms
Observability preserves dimensions:
# Trace: This specific user, this specific request
user=premium_customer, endpoint=/expensive-operation, latency=5000ms
That 50ms average hides the premium customer with 5-second requests.
Implementation
Logging
import structlog
logger = structlog.get_logger()
def process_order(order_id, user_id):
logger.info(
"processing_order",
order_id=order_id,
user_id=user_id,
trace_id=get_trace_id()
)
Structured logs enable querying:
SELECT * FROM logs
WHERE user_id = 'user-456'
AND trace_id = 'abc123'
Distributed Tracing
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("user_id", user_id)
span.set_attribute("amount", amount)
result = payment_gateway.charge(amount)
span.set_attribute("result", result.status)
Metrics with Labels
from prometheus_client import Counter, Histogram
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_latency = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint']
)
Tool Stack
Collection
- Metrics: Prometheus, StatsD
- Logs: Fluentd, Logstash, Vector
- Traces: Jaeger, Zipkin, OpenTelemetry
Storage
- Metrics: Prometheus, InfluxDB, Cortex
- Logs: Elasticsearch, Loki
- Traces: Jaeger, Tempo
Visualization
- Grafana: Dashboards for all three
- Kibana: Log exploration
- Jaeger UI: Trace visualization
All-in-One
- Datadog: Commercial, full-featured
- Honeycomb: Observability-focused
- New Relic: APM + observability
- Elastic Stack: Open source option
OpenTelemetry: The Standard
OpenTelemetry (OTel) unifies instrumentation:
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
# Setup once
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(JaegerExporter())
)
# Use everywhere
tracer = trace.get_tracer(__name__)
One API, export to any backend.
Practical Starting Points
- Add trace IDs to logs: Correlation without full tracing
- Structure your logs: JSON, not plain text
- Add labels to metrics: Not just totals, but by endpoint, status, etc.
- Start sampling: You don’t need 100% of traces
Anti-Patterns
Logging Everything
# Don't log every line
logger.debug(f"Starting loop iteration {i}") # Expensive
No Correlation
# Useless without trace context
logger.error("Database error occurred")
Dashboard Overload
50 dashboards nobody looks at. Start with 3-5 key views.
Final Thoughts
Observability is a capability, not a product. It’s the ability to understand your system from its outputs.
Start by connecting your existing signals. Add trace IDs to logs. Correlate metrics with traces. The tools matter less than the practice.
Don’t just know that something broke. Understand why.