mirror of
https://github.com/samber/awesome-prometheus-alerts.git
synced 2026-06-21 00:47:18 +08:00
🚨 Collection of Prometheus alerting rules
alertalertingalerting-rulesalertmanagerawesomecollectionexportergrafanamonitoringprometheusprometheus-alerting-rulespromqlqueryrulesupervision
* fix: review and fix 74 database & broker alert rules Comprehensive review of all database and broker alerts covering 16 services. Typos & descriptions (8 fixes): - PGBouncer: "a a server" → "a server" - RabbitMQ: "instace" → "instance", "RabbmitMQ" → "RabbitMQ", "unactive" → "inactive" - Cassandra: write failure said "Read failures", "bad hacker" → "authentication failures" - Solr: replication errors said "failed updates" - Meilisearch: "index is empty" said "instance is down" Duplicates removed (5 fixes): - PostgreSQL: 2 rules using wrong exporter metric (postgresql_errors_total) - ClickHouse: "High Network Traffic" (thread counts) duplicated byte-rate rule - NATS: 2 rules with low thresholds duplicated better rules Broken queries (20 fixes): - Patroni: patroni_master → patroni_primary (renamed in v3) - MongoDB: rate() on gauge → direct ratio for connection queries - MongoDB: removed WiredTiger-incompatible virtual memory rule - Cassandra instaclustr: avg() on counter → rate()[5m] - Cassandra criteo: increase() on JMX rate metric → direct threshold - ClickHouse: increase() on gauge → direct threshold - NATS: rate() on gauge → direct comparison, removed 4 config-value rules - SQL Server: increase() on gauge → direct threshold - Pulsar: moved comparison outside sum() (4 rules) - Hadoop: inverted comparison < 0.2 → > 0.8, counters → increase()[1h] Severity adjustments (7 fixes): - Redis: backup threshold 24h → 48h, rejected connections → warning > 5 - RabbitMQ: no consumer for: 5m with comment - Elasticsearch: unassigned shards added for: 2m - CouchDB: process restarted critical → info - Kafka: consumer group lag → warning, threshold 10000, better description - Hadoop: HBase heap low critical → warning Missing for duration (18 fixes): - Added for: 1m to service-down alerts across MySQL, PostgreSQL, SQL Server, Patroni, Redis, MongoDB, RabbitMQ, Elasticsearch, Cassandra, Zookeeper with restart-tolerance comments Division by zero guards (9 fixes): - Added denominator > 0 guards to ratio queries in PostgreSQL, RabbitMQ, Elasticsearch, ClickHouse, CouchDB, NATS Query design improvements (5 fixes): - Cassandra: removed unnecessary sum() and redundant avg_over_time() - ClickHouse: ZooKeeper avg() → per-instance check - PostgreSQL: sum() → sum by (instance) for SSL and locks - PGBouncer: 30s range window → 2m Hardcoded labels (2 fixes): - ClickHouse: added comment about job="clickhouse" - Cassandra criteo: removed hardcoded service="cas" * fix: address PR review comments - Cassandra connection timeouts: wrap rate() in sum by() (rate() by() is invalid PromQL) - Elasticsearch query latency: add division-by-zero guard - Redis backup: "backuped" → "backed up" |
||
|---|---|---|
| .github | ||
| _data | ||
| _layouts | ||
| assets | ||
| dist | ||
| .gitignore | ||
| .travis.yml | ||
| _config.yml | ||
| alertmanager.md | ||
| blackbox-exporter.md | ||
| CLAUDE.md | ||
| CONTRIBUTING.md | ||
| docker-compose.yml | ||
| Gemfile | ||
| Gemfile.lock | ||
| index.md | ||
| LICENSE | ||
| package.json | ||
| README.md | ||
| rules.md | ||
| sleep-peacefully.md | ||
👋 Awesome Prometheus Alerts 
Most alerting rules are common to every Prometheus setup. We need a place to find them all. 🤘 🚨 📊
Collection available here: https://samber.github.io/awesome-prometheus-alerts
Sponsored by:
Cut Kubernetes & AI costs, boost application stability.
Better Stack lets you centralize, search, and visualize your logs.
✨ Contents
🚨 Rules
Basic resource monitoring
Databases and brokers
- MySQL
- PostgreSQL
- SQL Server
- Patroni
- PGBouncer
- Redis
- MongoDB
- RabbitMQ
- Elasticsearch
- Meilisearch
- Cassandra
- Clickhouse
- CouchDB
- Zookeeper
- Kafka
- Pulsar
- Nats
- Solr
- Hadoop
Reverse proxies and load balancers
Runtimes
Orchestrators
Network, security and storage
Other
🤝 Contributing
Contributions from community (you!) are most welcome!
There are many ways to contribute: writing code, alerting rules, documentation, reporting issues, discussing better error tracking...
🏋️ Improvements
- Create an alert rule builder in Jekyll for custom alerts (severity, thresholds, instances...)
- Add resolution suggestions to rule descriptions, for faster incident resolution (#85).
💫 Show your support
Give a ⭐️ if this project helped you!
📝 License
Licensed under the Creative Commons 4.0 License, see LICENSE file for more detail.
