--- import GuideLayout from '../layouts/GuideLayout.astro'; const base = import.meta.env.BASE_URL.replace(/\/$/, ''); const howToJsonLd = { '@context': 'https://schema.org', '@type': 'HowTo', name: 'How to configure Prometheus and AlertManager for production alerting', description: 'Set up Prometheus alert rules, configure AlertManager routing and receivers, use recording rules to reduce load, and troubleshoot alert delivery delays.', step: [ { '@type': 'HowToStep', name: 'Configure Prometheus scrape and evaluation intervals', text: 'In prometheus.yml, set scrape_interval and evaluation_interval (e.g. 20s). Point rule_files at your alerts/*.yml directory.', }, { '@type': 'HowToStep', name: 'Write alert rules', text: 'Create YAML rule files with alert name, expr (PromQL), for duration, severity label, and summary/description annotations.', }, { '@type': 'HowToStep', name: 'Configure AlertManager routing', text: 'In alertmanager.yml, define a route tree with group_wait, group_interval, repeat_interval, and child routes that match severity labels to specific receivers.', }, { '@type': 'HowToStep', name: 'Set up receivers (Slack, PagerDuty, webhook)', text: 'Add receiver blocks for each notification channel. For Slack, provide api_url, channel, and a message template. Use continue: true if multiple receivers should handle the same alert.', }, { '@type': 'HowToStep', name: 'Add recording rules for expensive queries', text: 'Wrap high-cardinality or frequently evaluated expressions in recording rules. Reference the recorded metric in your alert expressions to reduce Prometheus CPU usage.', }, ], }; ---

If you notice a delay between an event and the first notification, read this post: {' '} Understanding the delays on alerting .

Prometheus configuration

Prometheus reads alert rules from YAML files and evaluates them on every evaluation_interval cycle. Keep both scrape_interval and evaluation_interval consistent — a mismatch causes stale data in range queries.

{`# prometheus.yml

global:
  scrape_interval: 20s

  # A short evaluation_interval will check alerting rules very often.
  # It can be costly if you run Prometheus with 100+ alerts.
  evaluation_interval: 20s

rule_files:
  - 'alerts/*.yml'

scrape_configs:
  # ...`}

{`# alerts/example-redis.yml

groups:

- name: ExampleRedisGroup
  rules:
  - alert: ExampleRedisDown
    expr: redis_up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Redis instance down (instance {{ $labels.instance }})
      description: "Redis is unreachable\\n  VALUE = {{ $value }}\\n  LABELS = {{ $labels }}"

  - alert: ExampleRedisHighMemory
    expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Redis memory usage above 90% (instance {{ $labels.instance }})
      description: "Redis memory usage is {{ $value | humanizePercentage }}\\n  LABELS = {{ $labels }}"`}

AlertManager configuration

AlertManager receives alerts from Prometheus, deduplicates and groups them, then routes them to the right receiver. The three key timing parameters control when notifications are sent:

group_wait — how long to wait for more alerts to batch into the first notification
group_interval — how long to wait before sending a follow-up for an ongoing group
repeat_interval — how often to re-notify if an alert hasn't resolved

{`# alertmanager.yml

route:
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 4h
  receiver: "slack"

  routes:
    # warnings and criticals → Slack
    - receiver: "slack"
      matchers:
        - severity =~ "critical|warning"
      continue: true

    # criticals also → PagerDuty
    - receiver: "pagerduty"
      matchers:
        - severity = "critical"

receivers:
  - name: "slack"
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxxxxx'
        send_resolved: true
        channel: '#monitoring'
        title: '{{ if eq .Status "firing" }}:fire:{{ else }}:white_check_mark:{{ end }} {{ .CommonLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          {{ end }}

  - name: "pagerduty"
    pagerduty_configs:
      - routing_key: ''
        send_resolved: true`}

Inhibition rules

Inhibition suppresses lower-priority alerts when a higher-priority alert is already firing for the same target. A common pattern: silence warning alerts when a critical alert is active on the same instance.

{`# alertmanager.yml

inhibit_rules:
  # Suppress warnings when a critical is firing for the same instance
  - source_matchers:
      - severity = "critical"
    target_matchers:
      - severity = "warning"
    equal:
      - alertname
      - instance

  # Suppress all alerts for a node when NodeDown is firing
  - source_matchers:
      - alertname = "NodeDown"
    target_matchers:
      - job = "node"
    equal:
      - instance`}

Reduce Prometheus server load

For expensive or frequently evaluated PromQL queries, use recording rules to precompute results. AlertManager and dashboards then reference the lightweight recorded metric instead of re-evaluating the full expression.

{`groups:

  # 1. Define the recording rule
  - name: recordings
    rules:
    - record: job:rabbitmq_queue_messages_delivered_total:rate5m
      expr: rate(rabbitmq_queue_messages_delivered_total[5m])

  # 2. Reference it in alert rules
  - name: alerts
    rules:
    - alert: RabbitmqLowMessageDelivery
      expr: sum(job:rabbitmq_queue_messages_delivered_total:rate5m) < 10
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: Low message delivery rate in RabbitMQ
        description: "Delivery rate is {{ $value | humanize }} msg/s\\n  LABELS = {{ $labels }}"`}

Troubleshooting alert delays

The total time from an event occurring to a notification being sent is the sum of several independent delays. Work through them in order:

Scrape delay: up to scrape_interval (20s) before the metric is collected
Evaluation delay: up to evaluation_interval (20s) before the rule fires
Pending duration: the for: 5m window must be satisfied before the alert state changes to firing
GroupWait: AlertManager waits group_wait (10s) for other alerts to batch

In the worst case with for: 5m: 20s + 20s + 5m + 10s ≈ 6 minutes from event to notification. Reduce evaluation_interval and for: for time-sensitive alerts, but be careful of false positives from transient spikes.

Prometheus configuration

AlertManager configuration

Inhibition rules

Reduce Prometheus server load

Troubleshooting alert delays

Further reading