Intro to Prometheus and Grafana for Monitoring

devops monitoring prometheus grafana

Nagios and Zabbix served us well, but modern infrastructure demands modern monitoring. Prometheus and Grafana have emerged as the de facto standard for metrics collection and visualization.

Why Prometheus?

Prometheus is a time-series database designed for monitoring. Key features:

Quick Start

Running Prometheus

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'my-app'
    static_configs:
      - targets: ['app:8000']

Exposing Metrics from Your App

Python example with prometheus_client:

from prometheus_client import Counter, Histogram, start_http_server

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

# Instrument your code
@REQUEST_LATENCY.labels(method='GET', endpoint='/api/users').time()
def get_users():
    REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').inc()
    return users

# Expose /metrics endpoint
start_http_server(8000)

Metric Types

Counter

Cumulative values that only increase:

errors_total = Counter('errors_total', 'Total errors', ['type'])
errors_total.labels(type='database').inc()

Gauge

Values that can go up or down:

active_connections = Gauge('active_connections', 'Active DB connections')
active_connections.set(42)
active_connections.inc()
active_connections.dec()

Histogram

Distribution of values:

request_duration = Histogram(
    'request_duration_seconds',
    'Request duration',
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
request_duration.observe(0.42)

Summary

Like Histogram but calculates percentiles client-side:

request_size = Summary('request_size_bytes', 'Request size')
request_size.observe(1024)

PromQL Basics

Simple Queries

# Current value of a metric
up

# Filter by label
http_requests_total{status="500"}

# Multiple labels
http_requests_total{method="POST", endpoint="/api/users"}

Range Vectors

# Last 5 minutes of data
http_requests_total[5m]

# Rate of increase per second
rate(http_requests_total[5m])

Aggregation

# Sum across all instances
sum(http_requests_total)

# Group by label
sum by (status) (http_requests_total)

# Top 5 endpoints by request count
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))

Common Patterns

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Availability (uptime)
avg_over_time(up[24h]) * 100

Grafana: Visualization

Setup

# Add to docker-compose.yml
grafana:
  image: grafana/grafana
  ports:
    - "3000:3000"
  environment:
    - GF_SECURITY_ADMIN_PASSWORD=admin

Adding Prometheus Data Source

  1. Go to Configuration → Data Sources
  2. Add Prometheus
  3. URL: http://prometheus:9090
  4. Save & Test

Creating Dashboards

Start with pre-built dashboards:

Then customize:

  1. Add Panel
  2. Select Prometheus data source
  3. Enter PromQL query
  4. Choose visualization (Graph, Gauge, Table, etc.)

Dashboard Best Practices

Alerting

Prometheus Alert Rules

# alerts.yml
groups:
  - name: example
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) 
          / sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

Alertmanager

Routes alerts to Slack, PagerDuty, email:

# alertmanager.yml
route:
  receiver: 'slack'
  group_by: ['alertname']
  
receivers:
  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        api_url: 'https://hooks.slack.com/...'

Common Exporters

Best Practices

  1. Label wisely: High cardinality labels (user IDs) will explode storage
  2. Use recording rules: Pre-compute expensive queries
  3. Set up retention: Default is 15 days, adjust for your needs
  4. Monitor the monitor: Alert if Prometheus is down
  5. Document dashboards: Future you will thank you

Final Thoughts

Prometheus + Grafana is the modern standard for good reason. It’s powerful, flexible, and has excellent Kubernetes integration.

Start with basic system metrics, then instrument your application code. Good observability is an investment that pays dividends when things go wrong.


You can’t improve what you don’t measure.

All posts