Intro to Prometheus and Grafana for Monitoring

June 9, 2018

devops monitoring prometheus grafana

Nagios and Zabbix served us well, but modern infrastructure demands modern monitoring. Prometheus and Grafana have emerged as the de facto standard for metrics collection and visualization.

Why Prometheus?

Prometheus is a time-series database designed for monitoring. Key features:

Pull-based: Prometheus scrapes metrics from targets
Dimensional data: Labels enable powerful queries
PromQL: Flexible query language
Alerting: Built-in alert rules and Alertmanager integration
Service discovery: Auto-discovers targets in K8s, Consul, etc.

Quick Start

Running Prometheus

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'my-app'
    static_configs:
      - targets: ['app:8000']

Exposing Metrics from Your App

Python example with prometheus_client:

from prometheus_client import Counter, Histogram, start_http_server

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

# Instrument your code
@REQUEST_LATENCY.labels(method='GET', endpoint='/api/users').time()
def get_users():
    REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').inc()
    return users

# Expose /metrics endpoint
start_http_server(8000)

Metric Types

Counter

Cumulative values that only increase:

errors_total = Counter('errors_total', 'Total errors', ['type'])
errors_total.labels(type='database').inc()

Gauge

Values that can go up or down:

active_connections = Gauge('active_connections', 'Active DB connections')
active_connections.set(42)
active_connections.inc()
active_connections.dec()

Histogram

Distribution of values:

request_duration = Histogram(
    'request_duration_seconds',
    'Request duration',
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
request_duration.observe(0.42)

Summary

Like Histogram but calculates percentiles client-side:

request_size = Summary('request_size_bytes', 'Request size')
request_size.observe(1024)

PromQL Basics

Simple Queries

# Current value of a metric
up

# Filter by label
http_requests_total{status="500"}

# Multiple labels
http_requests_total{method="POST", endpoint="/api/users"}

Range Vectors

# Last 5 minutes of data
http_requests_total[5m]

# Rate of increase per second
rate(http_requests_total[5m])

Aggregation

# Sum across all instances
sum(http_requests_total)

# Group by label
sum by (status) (http_requests_total)

# Top 5 endpoints by request count
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))

Common Patterns

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Availability (uptime)
avg_over_time(up[24h]) * 100

Grafana: Visualization

Setup

# Add to docker-compose.yml
grafana:
  image: grafana/grafana
  ports:
    - "3000:3000"
  environment:
    - GF_SECURITY_ADMIN_PASSWORD=admin

Adding Prometheus Data Source

Go to Configuration → Data Sources
Add Prometheus
URL: http://prometheus:9090
Save & Test

Creating Dashboards

Start with pre-built dashboards:

Import dashboard ID 1860 for Node Exporter
Import dashboard ID 3662 for Prometheus stats

Then customize:

Add Panel
Select Prometheus data source
Enter PromQL query
Choose visualization (Graph, Gauge, Table, etc.)

Dashboard Best Practices

RED Method: Rate, Errors, Duration for services
USE Method: Utilization, Saturation, Errors for resources
Variables: Use template variables for flexibility
Annotations: Mark deployments and incidents

Alerting

Prometheus Alert Rules

# alerts.yml
groups:
  - name: example
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) 
          / sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

Alertmanager

Routes alerts to Slack, PagerDuty, email:

# alertmanager.yml
route:
  receiver: 'slack'
  group_by: ['alertname']
  
receivers:
  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        api_url: 'https://hooks.slack.com/...'

Common Exporters

Node Exporter: System metrics (CPU, memory, disk)
cAdvisor: Container metrics
PostgreSQL Exporter: Database metrics
Nginx Exporter: Web server metrics
Blackbox Exporter: Probe endpoints (HTTP, TCP, DNS)

Best Practices

Label wisely: High cardinality labels (user IDs) will explode storage
Use recording rules: Pre-compute expensive queries
Set up retention: Default is 15 days, adjust for your needs
Monitor the monitor: Alert if Prometheus is down
Document dashboards: Future you will thank you

Final Thoughts

Prometheus + Grafana is the modern standard for good reason. It’s powerful, flexible, and has excellent Kubernetes integration.

Start with basic system metrics, then instrument your application code. Good observability is an investment that pays dividends when things go wrong.

You can’t improve what you don’t measure.