Intro to Prometheus and Grafana for Monitoring
Nagios and Zabbix served us well, but modern infrastructure demands modern monitoring. Prometheus and Grafana have emerged as the de facto standard for metrics collection and visualization.
Why Prometheus?
Prometheus is a time-series database designed for monitoring. Key features:
- Pull-based: Prometheus scrapes metrics from targets
- Dimensional data: Labels enable powerful queries
- PromQL: Flexible query language
- Alerting: Built-in alert rules and Alertmanager integration
- Service discovery: Auto-discovers targets in K8s, Consul, etc.
Quick Start
Running Prometheus
# docker-compose.yml
version: '3'
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'my-app'
static_configs:
- targets: ['app:8000']
Exposing Metrics from Your App
Python example with prometheus_client:
from prometheus_client import Counter, Histogram, start_http_server
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint']
)
# Instrument your code
@REQUEST_LATENCY.labels(method='GET', endpoint='/api/users').time()
def get_users():
REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').inc()
return users
# Expose /metrics endpoint
start_http_server(8000)
Metric Types
Counter
Cumulative values that only increase:
errors_total = Counter('errors_total', 'Total errors', ['type'])
errors_total.labels(type='database').inc()
Gauge
Values that can go up or down:
active_connections = Gauge('active_connections', 'Active DB connections')
active_connections.set(42)
active_connections.inc()
active_connections.dec()
Histogram
Distribution of values:
request_duration = Histogram(
'request_duration_seconds',
'Request duration',
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
request_duration.observe(0.42)
Summary
Like Histogram but calculates percentiles client-side:
request_size = Summary('request_size_bytes', 'Request size')
request_size.observe(1024)
PromQL Basics
Simple Queries
# Current value of a metric
up
# Filter by label
http_requests_total{status="500"}
# Multiple labels
http_requests_total{method="POST", endpoint="/api/users"}
Range Vectors
# Last 5 minutes of data
http_requests_total[5m]
# Rate of increase per second
rate(http_requests_total[5m])
Aggregation
# Sum across all instances
sum(http_requests_total)
# Group by label
sum by (status) (http_requests_total)
# Top 5 endpoints by request count
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
Common Patterns
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Availability (uptime)
avg_over_time(up[24h]) * 100
Grafana: Visualization
Setup
# Add to docker-compose.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
Adding Prometheus Data Source
- Go to Configuration → Data Sources
- Add Prometheus
- URL: http://prometheus:9090
- Save & Test
Creating Dashboards
Start with pre-built dashboards:
- Import dashboard ID 1860 for Node Exporter
- Import dashboard ID 3662 for Prometheus stats
Then customize:
- Add Panel
- Select Prometheus data source
- Enter PromQL query
- Choose visualization (Graph, Gauge, Table, etc.)
Dashboard Best Practices
- RED Method: Rate, Errors, Duration for services
- USE Method: Utilization, Saturation, Errors for resources
- Variables: Use template variables for flexibility
- Annotations: Mark deployments and incidents
Alerting
Prometheus Alert Rules
# alerts.yml
groups:
- name: example
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
Alertmanager
Routes alerts to Slack, PagerDuty, email:
# alertmanager.yml
route:
receiver: 'slack'
group_by: ['alertname']
receivers:
- name: 'slack'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/...'
Common Exporters
- Node Exporter: System metrics (CPU, memory, disk)
- cAdvisor: Container metrics
- PostgreSQL Exporter: Database metrics
- Nginx Exporter: Web server metrics
- Blackbox Exporter: Probe endpoints (HTTP, TCP, DNS)
Best Practices
- Label wisely: High cardinality labels (user IDs) will explode storage
- Use recording rules: Pre-compute expensive queries
- Set up retention: Default is 15 days, adjust for your needs
- Monitor the monitor: Alert if Prometheus is down
- Document dashboards: Future you will thank you
Final Thoughts
Prometheus + Grafana is the modern standard for good reason. It’s powerful, flexible, and has excellent Kubernetes integration.
Start with basic system metrics, then instrument your application code. Good observability is an investment that pays dividends when things go wrong.
You can’t improve what you don’t measure.