🚨 Collection of Prometheus alerting rules
Find a file
Samuel Berthe c37ef8f50c
fix: review and fix 74 database & broker alert rules (#504)
* fix: review and fix 74 database & broker alert rules

Comprehensive review of all database and broker alerts covering 16 services.

Typos & descriptions (8 fixes):
- PGBouncer: "a a server" → "a server"
- RabbitMQ: "instace" → "instance", "RabbmitMQ" → "RabbitMQ",
  "unactive" → "inactive"
- Cassandra: write failure said "Read failures", "bad hacker" →
  "authentication failures"
- Solr: replication errors said "failed updates"
- Meilisearch: "index is empty" said "instance is down"

Duplicates removed (5 fixes):
- PostgreSQL: 2 rules using wrong exporter metric (postgresql_errors_total)
- ClickHouse: "High Network Traffic" (thread counts) duplicated byte-rate rule
- NATS: 2 rules with low thresholds duplicated better rules

Broken queries (20 fixes):
- Patroni: patroni_master → patroni_primary (renamed in v3)
- MongoDB: rate() on gauge → direct ratio for connection queries
- MongoDB: removed WiredTiger-incompatible virtual memory rule
- Cassandra instaclustr: avg() on counter → rate()[5m]
- Cassandra criteo: increase() on JMX rate metric → direct threshold
- ClickHouse: increase() on gauge → direct threshold
- NATS: rate() on gauge → direct comparison, removed 4 config-value rules
- SQL Server: increase() on gauge → direct threshold
- Pulsar: moved comparison outside sum() (4 rules)
- Hadoop: inverted comparison < 0.2 → > 0.8, counters → increase()[1h]

Severity adjustments (7 fixes):
- Redis: backup threshold 24h → 48h, rejected connections → warning > 5
- RabbitMQ: no consumer for: 5m with comment
- Elasticsearch: unassigned shards added for: 2m
- CouchDB: process restarted critical → info
- Kafka: consumer group lag → warning, threshold 10000, better description
- Hadoop: HBase heap low critical → warning

Missing for duration (18 fixes):
- Added for: 1m to service-down alerts across MySQL, PostgreSQL,
  SQL Server, Patroni, Redis, MongoDB, RabbitMQ, Elasticsearch,
  Cassandra, Zookeeper with restart-tolerance comments

Division by zero guards (9 fixes):
- Added denominator > 0 guards to ratio queries in PostgreSQL,
  RabbitMQ, Elasticsearch, ClickHouse, CouchDB, NATS

Query design improvements (5 fixes):
- Cassandra: removed unnecessary sum() and redundant avg_over_time()
- ClickHouse: ZooKeeper avg() → per-instance check
- PostgreSQL: sum() → sum by (instance) for SSL and locks
- PGBouncer: 30s range window → 2m

Hardcoded labels (2 fixes):
- ClickHouse: added comment about job="clickhouse"
- Cassandra criteo: removed hardcoded service="cas"

* fix: address PR review comments

- Cassandra connection timeouts: wrap rate() in sum by() (rate() by() is invalid PromQL)
- Elasticsearch query latency: add division-by-zero guard
- Redis backup: "backuped" → "backed up"
2026-03-16 01:27:18 +01:00
.github adding claude.md 2026-03-15 19:59:01 +01:00
_data fix: review and fix 74 database & broker alert rules (#504) 2026-03-16 01:27:18 +01:00
_layouts fix: fix favicon path 2026-03-15 23:54:05 +01:00
assets Website: Support dark mode (#501) 2026-03-01 22:54:42 +01:00
dist Publish 2026-03-15 18:47:04 +00:00
.gitignore Various updates and quality of life changes (#405) 2025-01-28 06:06:47 +01:00
.travis.yml 💄 awesome-lint 2019-02-11 22:09:50 +01:00
_config.yml chore: move from "https://awesome-prometheus-alerts.grep.to" to "https://samber.github.io/awesome-prometheus-alerts/" 2023-04-23 23:32:26 +02:00
alertmanager.md Update alertmanager.md 2024-10-06 17:31:23 +02:00
blackbox-exporter.md Remove Screeb 2025-08-29 15:20:21 +02:00
CLAUDE.md adding claude.md 2026-03-15 19:59:01 +01:00
CONTRIBUTING.md Update CONTRIBUTING.md 2025-11-13 16:24:49 +01:00
docker-compose.yml feat(ui): adding copy buttons 2019-10-26 16:41:11 +02:00
Gemfile build(deps): bump webrick from 1.7.0 to 1.8.2 (#435) 2024-09-27 22:24:21 +02:00
Gemfile.lock Website: Support dark mode (#501) 2026-03-01 22:54:42 +01:00
index.md Update index.md (#353) 2023-05-03 01:13:46 +02:00
LICENSE Changing license 2019-02-11 21:05:55 +01:00
package.json 💄 awesome-lint 2019-02-11 22:09:50 +01:00
README.md adding claude.md 2026-03-15 19:59:01 +01:00
rules.md fix: corrects download URL for rules files (#494) 2026-01-30 01:40:38 +01:00
sleep-peacefully.md Update sleep-peacefully.md (#487) 2025-12-08 15:19:11 +01:00

👋 Awesome Prometheus Alerts Awesome

Most alerting rules are common to every Prometheus setup. We need a place to find them all. 🤘 🚨 📊

Collection available here: https://samber.github.io/awesome-prometheus-alerts

Contents

🚨 Rules

Basic resource monitoring

Databases and brokers

Reverse proxies and load balancers

Runtimes

Orchestrators

Network, security and storage

Other

🤝 Contributing

Contributions from community (you!) are most welcome!

There are many ways to contribute: writing code, alerting rules, documentation, reporting issues, discussing better error tracking...

Instructions here

🏋️ Improvements

  • Create an alert rule builder in Jekyll for custom alerts (severity, thresholds, instances...)
  • Add resolution suggestions to rule descriptions, for faster incident resolution (#85).

💫 Show your support

Give a if this project helped you!

support us

📝 License

CC4

Licensed under the Creative Commons 4.0 License, see LICENSE file for more detail.