mirror of
https://github.com/samber/awesome-prometheus-alerts.git
synced 2026-06-26 11:27:00 +08:00
Adding exporters: sidekiq, pgbouncer and thanos.
Adding rules to: prometheus, kubernetes, redis, docker and postgresql. Arranging exporters into categories. Showing number of rules. Thanks to Gitlab for opensourcing alerting rules!
This commit is contained in:
parent
affacde49b
commit
0b89a764ee
4 changed files with 1221 additions and 924 deletions
41
README.md
41
README.md
|
|
@ -14,39 +14,60 @@ Collection available here: **[https://awesome-prometheus-alerts.grep.to](https:/
|
||||||
|
|
||||||
## 🚨 Rules
|
## 🚨 Rules
|
||||||
|
|
||||||
- [Prometheus internals](https://awesome-prometheus-alerts.grep.to/rules#prometheus-internals)
|
#### Basic resource monitoring
|
||||||
|
|
||||||
|
- [Prometheus self-monitoring](https://awesome-prometheus-alerts.grep.to/rules#prometheus-internals)
|
||||||
- [Host/Hardware](https://awesome-prometheus-alerts.grep.to/rules#host-and-hardware)
|
- [Host/Hardware](https://awesome-prometheus-alerts.grep.to/rules#host-and-hardware)
|
||||||
- [Docker Containers](https://awesome-prometheus-alerts.grep.to/rules#docker-containers)
|
- [Docker Containers](https://awesome-prometheus-alerts.grep.to/rules#docker-containers)
|
||||||
- [RabbitMQ](https://awesome-prometheus-alerts.grep.to/rules#rabbitmq)
|
- [Blackbox](https://awesome-prometheus-alerts.grep.to/rules#blackbox)
|
||||||
|
- [Windows](https://awesome-prometheus-alerts.grep.to/rules#windows-server)
|
||||||
|
|
||||||
|
#### Databases and brokers
|
||||||
|
|
||||||
- [MySQL](https://awesome-prometheus-alerts.grep.to/rules#mysql)
|
- [MySQL](https://awesome-prometheus-alerts.grep.to/rules#mysql)
|
||||||
- [PostgreSQL](https://awesome-prometheus-alerts.grep.to/rules#postgresql)
|
- [PostgreSQL](https://awesome-prometheus-alerts.grep.to/rules#postgresql)
|
||||||
|
- [PGBouncer](https://awesome-prometheus-alerts.grep.to/rules#pgbouncer)
|
||||||
- [Redis](https://awesome-prometheus-alerts.grep.to/rules#redis)
|
- [Redis](https://awesome-prometheus-alerts.grep.to/rules#redis)
|
||||||
- [MongoDB](https://awesome-prometheus-alerts.grep.to/rules#mongodb)
|
- [MongoDB](https://awesome-prometheus-alerts.grep.to/rules#mongodb)
|
||||||
|
- [RabbitMQ](https://awesome-prometheus-alerts.grep.to/rules#rabbitmq)
|
||||||
- [Elasticsearch](https://awesome-prometheus-alerts.grep.to/rules#elasticsearch)
|
- [Elasticsearch](https://awesome-prometheus-alerts.grep.to/rules#elasticsearch)
|
||||||
- [Cassandra](https://awesome-prometheus-alerts.grep.to/rules#cassandra)
|
- [Cassandra](https://awesome-prometheus-alerts.grep.to/rules#cassandra)
|
||||||
|
- [Zookeeper](https://awesome-prometheus-alerts.grep.to/rules#zookeeper)
|
||||||
|
- [Kafka](https://awesome-prometheus-alerts.grep.to/rules#kafka)
|
||||||
|
|
||||||
|
#### Reverse proxies and load balancers
|
||||||
|
|
||||||
- [Nginx](https://awesome-prometheus-alerts.grep.to/rules#nginx)
|
- [Nginx](https://awesome-prometheus-alerts.grep.to/rules#nginx)
|
||||||
- [Apache](https://awesome-prometheus-alerts.grep.to/rules#apache)
|
- [Apache](https://awesome-prometheus-alerts.grep.to/rules#apache)
|
||||||
- [HaProxy](https://awesome-prometheus-alerts.grep.to/rules#haproxy)
|
- [HaProxy](https://awesome-prometheus-alerts.grep.to/rules#haproxy)
|
||||||
- [Traefik](https://awesome-prometheus-alerts.grep.to/rules#traefik)
|
- [Traefik](https://awesome-prometheus-alerts.grep.to/rules#traefik)
|
||||||
|
|
||||||
|
#### Runtimes
|
||||||
|
|
||||||
- [PHP-FPM](https://awesome-prometheus-alerts.grep.to/rules#php-fpm)
|
- [PHP-FPM](https://awesome-prometheus-alerts.grep.to/rules#php-fpm)
|
||||||
- [JVM](https://awesome-prometheus-alerts.grep.to/rules#jvm)
|
- [JVM](https://awesome-prometheus-alerts.grep.to/rules#jvm)
|
||||||
- [ZFS](https://awesome-prometheus-alerts.grep.to/rules#zfs)
|
- [Sidekiq](https://awesome-prometheus-alerts.grep.to/rules#sidekiq)
|
||||||
|
|
||||||
|
#### Orchestrators
|
||||||
- [Kubernetes](https://awesome-prometheus-alerts.grep.to/rules#kubernetes)
|
- [Kubernetes](https://awesome-prometheus-alerts.grep.to/rules#kubernetes)
|
||||||
- [Nomad](https://awesome-prometheus-alerts.grep.to/rules#nomad)
|
- [Nomad](https://awesome-prometheus-alerts.grep.to/rules#nomad)
|
||||||
- [Consul](https://awesome-prometheus-alerts.grep.to/rules#consul)
|
- [Consul](https://awesome-prometheus-alerts.grep.to/rules#consul)
|
||||||
- [Etcd](https://awesome-prometheus-alerts.grep.to/rules#etcd)
|
- [Etcd](https://awesome-prometheus-alerts.grep.to/rules#etcd)
|
||||||
- [Zookeeper](https://awesome-prometheus-alerts.grep.to/rules#zookeeper)
|
|
||||||
- [Kafka](https://awesome-prometheus-alerts.grep.to/rules#kafka)
|
|
||||||
- [Linkerd](https://awesome-prometheus-alerts.grep.to/rules#linkerd)
|
- [Linkerd](https://awesome-prometheus-alerts.grep.to/rules#linkerd)
|
||||||
- [Istio](https://awesome-prometheus-alerts.grep.to/rules#istio)
|
- [Istio](https://awesome-prometheus-alerts.grep.to/rules#istio)
|
||||||
- [Blackbox](https://awesome-prometheus-alerts.grep.to/rules#blackbox)
|
|
||||||
- [Windows](https://awesome-prometheus-alerts.grep.to/rules#windows-server)
|
#### Network and storage
|
||||||
- [Juniper](https://awesome-prometheus-alerts.grep.to/rules#juniper)
|
|
||||||
|
- [ZFS](https://awesome-prometheus-alerts.grep.to/rules#zfs)
|
||||||
- [OpenEBS](https://awesome-prometheus-alerts.grep.to/rules#openebs)
|
- [OpenEBS](https://awesome-prometheus-alerts.grep.to/rules#openebs)
|
||||||
- [Minio](https://awesome-prometheus-alerts.grep.to/rules#minio)
|
- [Minio](https://awesome-prometheus-alerts.grep.to/rules#minio)
|
||||||
- [Juniper](https://awesome-prometheus-alerts.grep.to/rules#juniper)
|
- [Juniper](https://awesome-prometheus-alerts.grep.to/rules#juniper)
|
||||||
- [CoreDNS](https://awesome-prometheus-alerts.grep.to/rules#coredns)
|
- [CoreDNS](https://awesome-prometheus-alerts.grep.to/rules#coredns)
|
||||||
|
|
||||||
|
#### Other
|
||||||
|
|
||||||
|
- [Thanos](https://awesome-prometheus-alerts.grep.to/rules#thanos)
|
||||||
|
|
||||||
## 🤝 Contributing
|
## 🤝 Contributing
|
||||||
|
|
||||||
Contributions from community (you!) are most welcome!
|
Contributions from community (you!) are most welcome!
|
||||||
|
|
@ -66,6 +87,10 @@ Give a ⭐️ if this project helped you!
|
||||||
|
|
||||||
[](https://www.patreon.com/samber)
|
[](https://www.patreon.com/samber)
|
||||||
|
|
||||||
|
## 👏 Thanks
|
||||||
|
|
||||||
|
Gratitude for the Gitlab operation team that provided 50+ rules. \o/
|
||||||
|
|
||||||
## 📝 License
|
## 📝 License
|
||||||
|
|
||||||
[](https://creativecommons.org/licenses/by/4.0/legalcode)
|
[](https://creativecommons.org/licenses/by/4.0/legalcode)
|
||||||
|
|
|
||||||
707
_data/rules.yml
707
_data/rules.yml
|
|
@ -1,19 +1,25 @@
|
||||||
|
groups:
|
||||||
|
- name: Basic resource monitoring
|
||||||
services:
|
services:
|
||||||
- name: Prometheus internals
|
- name: Prometheus self-monitoring
|
||||||
exporters:
|
exporters:
|
||||||
- rules:
|
- rules:
|
||||||
- name: Prometheus configuration reload failure
|
- name: Prometheus configuration reload failure
|
||||||
description: Prometheus configuration reload error
|
description: Prometheus configuration reload error
|
||||||
query: "prometheus_config_last_reload_successful != 1"
|
query: 'prometheus_config_last_reload_successful != 1'
|
||||||
severity: warning
|
severity: warning
|
||||||
- name: Prometheus too many restarts
|
- name: Prometheus too many restarts
|
||||||
description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.
|
description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.
|
||||||
query: "changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2"
|
query: 'changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2'
|
||||||
severity: warning
|
severity: warning
|
||||||
- name: Prometheus AlertManager configuration reload failure
|
- name: Prometheus AlertManager configuration reload failure
|
||||||
description: AlertManager configuration reload error
|
description: AlertManager configuration reload error
|
||||||
query: "alertmanager_config_last_reload_successful != 1"
|
query: 'alertmanager_config_last_reload_successful != 1'
|
||||||
severity: warning
|
severity: warning
|
||||||
|
- name: Prometheus AlertManager E2E dead man snitch
|
||||||
|
description: Prometheus DeadManSnitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.
|
||||||
|
query: 'vector(1)'
|
||||||
|
severity: error
|
||||||
- name: Prometheus not connected to alertmanager
|
- name: Prometheus not connected to alertmanager
|
||||||
description: Prometheus cannot connect the alertmanager
|
description: Prometheus cannot connect the alertmanager
|
||||||
query: "prometheus_notifications_alertmanagers_discovered < 1"
|
query: "prometheus_notifications_alertmanagers_discovered < 1"
|
||||||
|
|
@ -36,14 +42,22 @@ services:
|
||||||
severity: warning
|
severity: warning
|
||||||
- name: Prometheus notifications backlog
|
- name: Prometheus notifications backlog
|
||||||
description: The Prometheus notification queue has not been empty for 10 minutes
|
description: The Prometheus notification queue has not been empty for 10 minutes
|
||||||
query: 'min_over_time(prometheus_notifications_queue_length[10m])'
|
query: 'min_over_time(prometheus_notifications_queue_length[10m]) > 0'
|
||||||
severity: warning
|
severity: warning
|
||||||
|
- name: Prometheus AlertManager notification failing
|
||||||
|
description: Alertmanager is failing sending notifications
|
||||||
|
query: 'rate(alertmanager_notifications_failed_total[1m]) > 0'
|
||||||
|
severity: error
|
||||||
|
- name: Prometheus target empty
|
||||||
|
description: Prometheus has no target in service discovery
|
||||||
|
query: 'prometheus_sd_discovered_targets == 0'
|
||||||
|
severity: error
|
||||||
- name: Prometheus target scraping slow
|
- name: Prometheus target scraping slow
|
||||||
description: Prometheus is scraping exporters slowly
|
description: Prometheus is scraping exporters slowly
|
||||||
query: 'prometheus_target_interval_length_seconds{quantile="0.9"} > 60'
|
query: 'prometheus_target_interval_length_seconds{quantile="0.9"} > 60'
|
||||||
severity: warning
|
severity: warning
|
||||||
- name: Prometheus large scrape
|
- name: Prometheus large scrape
|
||||||
description: Prometheus has many scapres that exceed the sample limit
|
description: Prometheus has many scrapes that exceed the sample limit
|
||||||
query: 'increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10'
|
query: 'increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10'
|
||||||
severity: warning
|
severity: warning
|
||||||
- name: Prometheus TSDB checkpoint creation failures
|
- name: Prometheus TSDB checkpoint creation failures
|
||||||
|
|
@ -160,10 +174,14 @@ services:
|
||||||
description: 'At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap'
|
description: 'At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap'
|
||||||
query: 'node_md_disks{state="fail"} > 0'
|
query: 'node_md_disks{state="fail"} > 0'
|
||||||
severity: warning
|
severity: warning
|
||||||
- name: Kernel version deviations
|
- name: Host kernel version deviations
|
||||||
description: Different kernel versions are running
|
description: Different kernel versions are running
|
||||||
query: 'count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1'
|
query: 'count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1'
|
||||||
severity: warning
|
severity: warning
|
||||||
|
- name: Host OOM kill detected
|
||||||
|
description: OOM kill detected
|
||||||
|
query: 'increase(node_vmstat_oom_kill[30m]) > 1'
|
||||||
|
severity: warning
|
||||||
|
|
||||||
- name: Docker containers
|
- name: Docker containers
|
||||||
exporters:
|
exporters:
|
||||||
|
|
@ -190,6 +208,307 @@ services:
|
||||||
description: Container Volume IO usage is above 80%
|
description: Container Volume IO usage is above 80%
|
||||||
query: "(sum(container_fs_io_current) BY (instance, name) * 100) > 80"
|
query: "(sum(container_fs_io_current) BY (instance, name) * 100) > 80"
|
||||||
severity: warning
|
severity: warning
|
||||||
|
- name: Container high throttle rate
|
||||||
|
description: Container is being throttled
|
||||||
|
query: 'rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1'
|
||||||
|
severity: warning
|
||||||
|
|
||||||
|
- name: Blackbox
|
||||||
|
exporters:
|
||||||
|
- name: prometheus/blackbox_exporter
|
||||||
|
doc_url: https://github.com/prometheus/blackbox_exporter
|
||||||
|
rules:
|
||||||
|
- name: Blackbox probe failed
|
||||||
|
description: Probe failed
|
||||||
|
query: probe_success == 0
|
||||||
|
severity: error
|
||||||
|
- name: Blackbox slow probe
|
||||||
|
description: Blackbox probe took more than 1s to complete
|
||||||
|
query: "avg_over_time(probe_duration_seconds[1m]) > 1"
|
||||||
|
severity: warning
|
||||||
|
- name: Blackbox probe HTTP failure
|
||||||
|
description: HTTP status code is not 200-399
|
||||||
|
query: "probe_http_status_code <= 199 OR probe_http_status_code >= 400"
|
||||||
|
severity: error
|
||||||
|
- name: Blackbox SSL certificate will expire soon
|
||||||
|
description: SSL certificate expires in 30 days
|
||||||
|
query: "probe_ssl_earliest_cert_expiry - time() < 86400 * 30"
|
||||||
|
severity: warning
|
||||||
|
- name: Blackbox SSL certificate will expire soon
|
||||||
|
description: SSL certificate expires in 3 days
|
||||||
|
query: "probe_ssl_earliest_cert_expiry - time() < 86400 * 3"
|
||||||
|
severity: error
|
||||||
|
- name: Blackbox SSL certificate expired
|
||||||
|
description: SSL certificate has expired already
|
||||||
|
query: "probe_ssl_earliest_cert_expiry - time() <= 0"
|
||||||
|
severity: error
|
||||||
|
- name: Blackbox probe slow HTTP
|
||||||
|
description: HTTP request took more than 1s
|
||||||
|
query: "avg_over_time(probe_http_duration_seconds[1m]) > 1"
|
||||||
|
severity: warning
|
||||||
|
- name: Blackbox probe slow ping
|
||||||
|
description: Blackbox ping took more than 1s
|
||||||
|
query: "avg_over_time(probe_icmp_duration_seconds[1m]) > 1"
|
||||||
|
severity: warning
|
||||||
|
|
||||||
|
- name: Windows Server
|
||||||
|
exporters:
|
||||||
|
- name: martinlindhe/wmi_exporter
|
||||||
|
doc_url: https://github.com/martinlindhe/wmi_exporter
|
||||||
|
rules:
|
||||||
|
- name: Windows Server collector Error
|
||||||
|
description: "Collector {{ $labels.collector }} was not successful"
|
||||||
|
query: "wmi_exporter_collector_success == 0"
|
||||||
|
severity: error
|
||||||
|
- name: Windows Server service Status
|
||||||
|
description: Windows Service state is not OK
|
||||||
|
query: 'wmi_service_status{status="ok"} != 1'
|
||||||
|
severity: error
|
||||||
|
- name: Windows Server CPU Usage
|
||||||
|
description: CPU Usage is more than 80%
|
||||||
|
query: '100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80'
|
||||||
|
severity: warning
|
||||||
|
- name: Windows Server memory Usage
|
||||||
|
description: Memory Usage is more than 90%
|
||||||
|
query: "100*(wmi_os_physical_memory_free_bytes) / wmi_cs_physical_memory_bytes > 90"
|
||||||
|
severity: warning
|
||||||
|
- name: Windows Server disk Space Usage
|
||||||
|
description: Disk Space on Drive is used more than 80%
|
||||||
|
query: "100.0 - 100 * ((wmi_logical_disk_free_bytes{} / 1024 / 1024 ) / (wmi_logical_disk_size_bytes{} / 1024 / 1024)) > 80"
|
||||||
|
severity: error
|
||||||
|
|
||||||
|
|
||||||
|
- name: Databases and brokers
|
||||||
|
services:
|
||||||
|
- name: MySQL
|
||||||
|
exporters:
|
||||||
|
- name: prometheus/mysqld_exporter
|
||||||
|
doc_url: https://github.com/prometheus/mysqld_exporter
|
||||||
|
rules:
|
||||||
|
|
||||||
|
- name: PostgreSQL
|
||||||
|
exporters:
|
||||||
|
- name: wrouesnel/postgres_exporter
|
||||||
|
doc_url: https://github.com/wrouesnel/postgres_exporter/
|
||||||
|
rules:
|
||||||
|
- name: Postgresql down
|
||||||
|
description: Postgresql instance is down
|
||||||
|
query: "pg_up == 0"
|
||||||
|
severity: error
|
||||||
|
- name: Postgresql restarted
|
||||||
|
description: Postgresql restarted
|
||||||
|
query: "time() - pg_postmaster_start_time_seconds < 60"
|
||||||
|
severity: error
|
||||||
|
- name: Postgresql exporter error
|
||||||
|
description: Postgresql exporter is showing errors. A query may be buggy in query.yaml
|
||||||
|
query: 'pg_exporter_last_scrape_error > 0'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql replication lag
|
||||||
|
description: PostgreSQL replication lag is going up (> 10s)
|
||||||
|
query: '(pg_replication_lag > 10 and ON(instance) (pg_replication_is_replica == 1)'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql table not vaccumed
|
||||||
|
description: Table has not been vaccum for 24 hours
|
||||||
|
query: "time() - pg_stat_user_tables_last_autovacuum > 60 * 60 * 24"
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql table not analyzed
|
||||||
|
description: Table has not been analyzed for 24 hours
|
||||||
|
query: "time() - pg_stat_user_tables_last_autoanalyze > 60 * 60 * 24"
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql too many connections
|
||||||
|
description: PostgreSQL instance has too many connections
|
||||||
|
query: 'sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > pg_settings_max_connections * 0.9'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql not enough connections
|
||||||
|
description: PostgreSQL instance should have more connections (> 5)
|
||||||
|
query: 'sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql dead locks
|
||||||
|
description: PostgreSQL has dead-locks
|
||||||
|
query: 'rate(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 0'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql slow queries
|
||||||
|
description: PostgreSQL executes slow queries (> 1min)
|
||||||
|
query: 'rate(pg_slow_queries[1m]) * 60 > 10'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql high rollback rate
|
||||||
|
description: Ratio of transactions being aborted compared to committed is > 2 %
|
||||||
|
query: 'rate(pg_stat_database_xact_rollback{datname!~"template.*"}[3m]) / rate(pg_stat_database_xact_commit{datname!~"template.*"}[3m]) > 0.02'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql commit rate low
|
||||||
|
description: Postgres seems to be processing very few transactions
|
||||||
|
query: 'rate(pg_stat_database_xact_commit[1m]) < 10'
|
||||||
|
severity: error
|
||||||
|
- name: Postgresql low XID consumption
|
||||||
|
description: Postgresql seems to be consuming transaction IDs very slowly
|
||||||
|
query: 'rate(pg_txid_current[1m]) < 5'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresqllow XLOG consumption
|
||||||
|
description: Postgres seems to be consuming XLOG very slowly
|
||||||
|
query: 'rate(pg_xlog_position_bytes[1m]) < 100'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql WALE replication stopped
|
||||||
|
description: WAL-E replication seems to be stopped
|
||||||
|
query: 'rate(pg_xlog_position_bytes[1m]) == 0'
|
||||||
|
severity: error
|
||||||
|
- name: Postgresql high rate statement timeout
|
||||||
|
description: Postgres transactions showing high rate of statement timeouts
|
||||||
|
query: 'rate(postgresql_errors_total{type="statement_timeout"}[5m]) > 3'
|
||||||
|
severity: error
|
||||||
|
- name: Postgresql high rate deadlock
|
||||||
|
description: Postgres detected deadlocks
|
||||||
|
query: 'rate(postgresql_errors_total{type="deadlock_detected"}[1m]) * 60 > 1'
|
||||||
|
severity: error
|
||||||
|
- name: Postgresql replication lab bytes
|
||||||
|
description: Postgres Replication lag (in bytes) is high
|
||||||
|
query: '(pg_xlog_position_bytes and pg_replication_is_replica == 0) - GROUP_RIGHT(instance) (pg_xlog_position_bytes and pg_replication_is_replica == 1) > 1e+09'
|
||||||
|
severity: error
|
||||||
|
- name: Postgresql unused replication slot
|
||||||
|
description: Unused Replication Slots
|
||||||
|
query: 'pg_replication_slots_active == 0'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql too many dead tuples
|
||||||
|
description: PostgreSQL dead tuples is too large
|
||||||
|
query: '((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 unless ON(instance) (pg_replication_is_replica == 1)'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql split brain
|
||||||
|
description: Split Brain, too many primary Postgresql databases in read-write mode
|
||||||
|
query: 'count(pg_replication_is_replica == 0) != 1'
|
||||||
|
severity: error
|
||||||
|
- name: Postgresql promoted node
|
||||||
|
description: Postgresql standby server has been promoted as primary node
|
||||||
|
query: 'pg_replication_is_replica and changes(pg_replication_is_replica[1m]) > 0'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql configuration changed
|
||||||
|
description: Postgres Database configuration change has occurred
|
||||||
|
query: '{__name__=~"pg_settings_.*"} != ON(__name__) {__name__=~"pg_settings_([^t]|t[^r]|tr[^a]|tra[^n]|tran[^s]|trans[^a]|transa[^c]|transac[^t]|transact[^i]|transacti[^o]|transactio[^n]|transaction[^_]|transaction_[^r]|transaction_r[^e]|transaction_re[^a]|transaction_rea[^d]|transaction_read[^_]|transaction_read_[^o]|transaction_read_o[^n]|transaction_read_on[^l]|transaction_read_onl[^y]).*"} OFFSET 5m'
|
||||||
|
severity: warning
|
||||||
|
- name: Postgresql SSL compression active
|
||||||
|
description: Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.
|
||||||
|
query: 'sum(pg_stat_ssl_compression) > 0'
|
||||||
|
severity: error
|
||||||
|
- name: Postgresql too many locks acquired
|
||||||
|
description: Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.
|
||||||
|
query: '((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20'
|
||||||
|
severity: error
|
||||||
|
|
||||||
|
- name: PGBouncer
|
||||||
|
exporters:
|
||||||
|
- name: spreaker/prometheus-pgbouncer-exporter
|
||||||
|
doc_url: https://github.com/spreaker/prometheus-pgbouncer-exporter
|
||||||
|
rules:
|
||||||
|
- name: PGBouncer active connectinos
|
||||||
|
description: PGBouncer pools are filling up
|
||||||
|
query: 'pgbouncer_pools_server_active_connections > 200'
|
||||||
|
severity: warning
|
||||||
|
- name: PGBouncer errors
|
||||||
|
description: PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.
|
||||||
|
query: 'increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[5m]) > 10'
|
||||||
|
severity: warning
|
||||||
|
- name: PGBouncer max connections
|
||||||
|
description: The number of PGBouncer client connections has reached max_client_conn.
|
||||||
|
query: 'rate(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[1m]) > 0'
|
||||||
|
severity: error
|
||||||
|
|
||||||
|
- name: Redis
|
||||||
|
exporters:
|
||||||
|
- name: oliver006/redis_exporter
|
||||||
|
doc_url: https://github.com/oliver006/redis_exporter
|
||||||
|
rules:
|
||||||
|
- name: Redis down
|
||||||
|
description: Redis instance is down
|
||||||
|
query: "redis_up == 0"
|
||||||
|
severity: error
|
||||||
|
- name: Redis missing master
|
||||||
|
description: Redis cluster has no node marked as master.
|
||||||
|
query: 'count(redis_instance_info{role="master"}) == 0'
|
||||||
|
severity: error
|
||||||
|
- name: Redis too many masters
|
||||||
|
description: Redis cluster has too many nodes marked as master.
|
||||||
|
query: 'count(redis_instance_info{role="master"}) > 1'
|
||||||
|
severity: error
|
||||||
|
- name: Redis disconnected slaves
|
||||||
|
description: Redis not replicating for all slaves. Consider reviewing the redis replication status.
|
||||||
|
query: 'count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1'
|
||||||
|
severity: error
|
||||||
|
- name: Redis replication broken
|
||||||
|
description: Redis instance lost a slave
|
||||||
|
query: "delta(redis_connected_slaves[1m]) < 0"
|
||||||
|
severity: error
|
||||||
|
- name: Redis cluster flapping
|
||||||
|
description: Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).
|
||||||
|
query: 'changes(redis_connected_slaves[5m]) > 2'
|
||||||
|
severity: error
|
||||||
|
- name: Redis missing backup
|
||||||
|
description: Redis has not been backuped for 24 hours
|
||||||
|
query: "time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24"
|
||||||
|
severity: error
|
||||||
|
- name: Redis out of memory
|
||||||
|
description: Redis is running out of memory (> 90%)
|
||||||
|
query: "redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90"
|
||||||
|
severity: warning
|
||||||
|
- name: Redis too many connections
|
||||||
|
description: Redis instance has too many connections
|
||||||
|
query: "redis_connected_clients > 100"
|
||||||
|
severity: warning
|
||||||
|
- name: Redis not enough connections
|
||||||
|
description: Redis instance should have more connections (> 5)
|
||||||
|
query: "redis_connected_clients < 5"
|
||||||
|
severity: warning
|
||||||
|
- name: Redis rejected connections
|
||||||
|
description: Some connections to Redis has been rejected
|
||||||
|
query: "increase(redis_rejected_connections_total[1m]) > 0"
|
||||||
|
severity: error
|
||||||
|
|
||||||
|
- name: MongoDB
|
||||||
|
exporters:
|
||||||
|
- name: dcu/mongodb_exporter
|
||||||
|
doc_url: https://github.com/dcu/mongodb_exporter
|
||||||
|
rules:
|
||||||
|
- name: MongoDB replication lag
|
||||||
|
description: Mongodb replication lag is more than 10s
|
||||||
|
query: 'avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10'
|
||||||
|
severity: error
|
||||||
|
- name: MongoDB replication headroom
|
||||||
|
description: MongoDB replication headroom is <= 0
|
||||||
|
query: '(avg(mongodb_replset_oplog_tail_timestamp - mongodb_replset_oplog_head_timestamp) - (avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}))) <= 0'
|
||||||
|
severity: error
|
||||||
|
- name: MongoDB replication Status 3
|
||||||
|
description: MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync
|
||||||
|
query: "mongodb_replset_member_state == 3"
|
||||||
|
severity: error
|
||||||
|
- name: MongoDB replication Status 6
|
||||||
|
description: MongoDB Replication set member as seen from another member of the set, is not yet known
|
||||||
|
query: "mongodb_replset_member_state == 6"
|
||||||
|
severity: error
|
||||||
|
- name: MongoDB replication Status 8
|
||||||
|
description: MongoDB Replication set member as seen from another member of the set, is unreachable
|
||||||
|
query: "mongodb_replset_member_state == 8"
|
||||||
|
severity: error
|
||||||
|
- name: MongoDB replication Status 9
|
||||||
|
description: MongoDB Replication set member is actively performing a rollback. Data is not available for reads
|
||||||
|
query: "mongodb_replset_member_state == 9"
|
||||||
|
severity: error
|
||||||
|
- name: MongoDB replication Status 10
|
||||||
|
description: MongoDB Replication set member was once in a replica set but was subsequently removed
|
||||||
|
query: "mongodb_replset_member_state == 10"
|
||||||
|
severity: error
|
||||||
|
- name: MongoDB number cursors open
|
||||||
|
description: Too many cursors opened by MongoDB for clients (> 10k)
|
||||||
|
query: 'mongodb_metrics_cursor_open{state="total_open"} > 10000'
|
||||||
|
severity: warning
|
||||||
|
- name: MongoDB cursors timeouts
|
||||||
|
description: Too many cursors are timing out
|
||||||
|
query: "increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100"
|
||||||
|
severity: warning
|
||||||
|
- name: MongoDB too many connections
|
||||||
|
description: Too many connections
|
||||||
|
query: 'mongodb_connections{state="current"} > 500'
|
||||||
|
severity: warning
|
||||||
|
- name: MongoDB virtual memory usage
|
||||||
|
description: High memory usage
|
||||||
|
query: '(sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 3'
|
||||||
|
severity: warning
|
||||||
|
|
||||||
- name: RabbitMQ
|
- name: RabbitMQ
|
||||||
exporters:
|
exporters:
|
||||||
|
|
@ -241,143 +560,6 @@ services:
|
||||||
query: 'rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5'
|
query: 'rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5'
|
||||||
severity: warning
|
severity: warning
|
||||||
|
|
||||||
- name: MySQL
|
|
||||||
exporters:
|
|
||||||
- name: prometheus/mysqld_exporter
|
|
||||||
doc_url: https://github.com/prometheus/mysqld_exporter
|
|
||||||
rules:
|
|
||||||
|
|
||||||
- name: PostgreSQL
|
|
||||||
exporters:
|
|
||||||
- name: wrouesnel/postgres_exporter
|
|
||||||
doc_url: https://github.com/wrouesnel/postgres_exporter/
|
|
||||||
rules:
|
|
||||||
- name: Postgresql down
|
|
||||||
description: PostgreSQL instance is down
|
|
||||||
query: "pg_up == 0"
|
|
||||||
severity: error
|
|
||||||
- name: Postgresql replication lag
|
|
||||||
description: PostgreSQL replication lag is going up (> 10s)
|
|
||||||
query: "pg_replication_lag > 10"
|
|
||||||
severity: warning
|
|
||||||
comments: |
|
|
||||||
A label excluding master nodes should be added to this query,
|
|
||||||
in order to monitor lag on standby servers only.
|
|
||||||
Exporter does not guarantee a NaN value for pg_replication_log on promoted master nodes.
|
|
||||||
See https://github.com/samber/awesome-prometheus-alerts/issues/74
|
|
||||||
- name: Postgresql table not vaccumed
|
|
||||||
description: Table has not been vaccum for 24 hours
|
|
||||||
query: "time() - pg_stat_user_tables_last_autovacuum > 60 * 60 * 24"
|
|
||||||
severity: warning
|
|
||||||
- name: Postgresql table not analyzed
|
|
||||||
description: Table has not been analyzed for 24 hours
|
|
||||||
query: "time() - pg_stat_user_tables_last_autoanalyze > 60 * 60 * 24"
|
|
||||||
severity: warning
|
|
||||||
- name: Postgresql too many connections
|
|
||||||
description: PostgreSQL instance has too many connections
|
|
||||||
query: 'sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) > 100'
|
|
||||||
severity: warning
|
|
||||||
- name: Postgresql not enough connections
|
|
||||||
description: PostgreSQL instance should have more connections (> 5)
|
|
||||||
query: 'sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5'
|
|
||||||
severity: warning
|
|
||||||
- name: Postgresql dead locks
|
|
||||||
description: PostgreSQL has dead-locks
|
|
||||||
query: 'rate(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 0'
|
|
||||||
severity: warning
|
|
||||||
- name: Postgresql slow queries
|
|
||||||
description: PostgreSQL executes slow queries (> 1min)
|
|
||||||
query: 'avg(rate(pg_stat_activity_max_tx_duration{datname!~"template.*"}[1m])) BY (datname) > 60'
|
|
||||||
severity: warning
|
|
||||||
- name: Postgresql high rollback rate
|
|
||||||
description: Ratio of transactions being aborted compared to committed is > 2 %
|
|
||||||
query: 'rate(pg_stat_database_xact_rollback{datname!~"template.*"}[3m]) / rate(pg_stat_database_xact_commit{datname!~"template.*"}[3m]) > 0.02'
|
|
||||||
severity: warning
|
|
||||||
|
|
||||||
- name: Redis
|
|
||||||
exporters:
|
|
||||||
- name: oliver006/redis_exporter
|
|
||||||
doc_url: https://github.com/oliver006/redis_exporter
|
|
||||||
rules:
|
|
||||||
- name: Redis down
|
|
||||||
description: Redis instance is down
|
|
||||||
query: "redis_up == 0"
|
|
||||||
severity: error
|
|
||||||
- name: Redis missing backup
|
|
||||||
description: Redis has not been backuped for 24 hours
|
|
||||||
query: "time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24"
|
|
||||||
severity: error
|
|
||||||
- name: Redis out of memory
|
|
||||||
description: Redis is running out of memory (> 90%)
|
|
||||||
query: "redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90"
|
|
||||||
severity: warning
|
|
||||||
- name: Redis replication broken
|
|
||||||
description: Redis instance lost a slave
|
|
||||||
query: "delta(redis_connected_slaves[1m]) < 0"
|
|
||||||
severity: error
|
|
||||||
- name: Redis too many connections
|
|
||||||
description: Redis instance has too many connections
|
|
||||||
query: "redis_connected_clients > 100"
|
|
||||||
severity: warning
|
|
||||||
- name: Redis not enough connections
|
|
||||||
description: Redis instance should have more connections (> 5)
|
|
||||||
query: "redis_connected_clients < 5"
|
|
||||||
severity: warning
|
|
||||||
- name: Redis rejected connections
|
|
||||||
description: Some connections to Redis has been rejected
|
|
||||||
query: "increase(redis_rejected_connections_total[1m]) > 0"
|
|
||||||
severity: error
|
|
||||||
|
|
||||||
- name: MongoDB
|
|
||||||
exporters:
|
|
||||||
- name: dcu/mongodb_exporter
|
|
||||||
doc_url: https://github.com/percona/mongodb_exporter
|
|
||||||
rules:
|
|
||||||
- name: MongoDB replication lag
|
|
||||||
description: Mongodb replication lag is more than 10s
|
|
||||||
query: 'avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10'
|
|
||||||
severity: error
|
|
||||||
- name: MongoDB replication headroom
|
|
||||||
description: MongoDB replication headroom is <= 0
|
|
||||||
query: '(avg(mongodb_replset_oplog_tail_timestamp - mongodb_replset_oplog_head_timestamp) - (avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}))) <= 0'
|
|
||||||
severity: error
|
|
||||||
- name: MongoDB replication Status 3
|
|
||||||
description: MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync
|
|
||||||
query: "mongodb_replset_member_state == 3"
|
|
||||||
severity: error
|
|
||||||
- name: MongoDB replication Status 6
|
|
||||||
description: MongoDB Replication set member as seen from another member of the set, is not yet known
|
|
||||||
query: "mongodb_replset_member_state == 6"
|
|
||||||
severity: error
|
|
||||||
- name: MongoDB replication Status 8
|
|
||||||
description: MongoDB Replication set member as seen from another member of the set, is unreachable
|
|
||||||
query: "mongodb_replset_member_state == 8"
|
|
||||||
severity: error
|
|
||||||
- name: MongoDB replication Status 9
|
|
||||||
description: MongoDB Replication set member is actively performing a rollback. Data is not available for reads
|
|
||||||
query: "mongodb_replset_member_state == 9"
|
|
||||||
severity: error
|
|
||||||
- name: MongoDB replication Status 10
|
|
||||||
description: MongoDB Replication set member was once in a replica set but was subsequently removed
|
|
||||||
query: "mongodb_replset_member_state == 10"
|
|
||||||
severity: error
|
|
||||||
- name: MongoDB number cursors open
|
|
||||||
description: Too many cursors opened by MongoDB for clients (> 10k)
|
|
||||||
query: 'mongodb_metrics_cursor_open{state="total_open"} > 10000'
|
|
||||||
severity: warning
|
|
||||||
- name: MongoDB cursors timeouts
|
|
||||||
description: Too many cursors are timing out
|
|
||||||
query: "increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100"
|
|
||||||
severity: warning
|
|
||||||
- name: MongoDB too many connections
|
|
||||||
description: Too many connections
|
|
||||||
query: 'mongodb_connections{state="current"} > 500'
|
|
||||||
severity: warning
|
|
||||||
- name: MongoDB virtual memory usage
|
|
||||||
description: High memory usage
|
|
||||||
query: '(sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 3'
|
|
||||||
severity: warning
|
|
||||||
|
|
||||||
- name: Elasticsearch
|
- name: Elasticsearch
|
||||||
exporters:
|
exporters:
|
||||||
- name: justwatchcom/elasticsearch_exporter
|
- name: justwatchcom/elasticsearch_exporter
|
||||||
|
|
@ -486,6 +668,29 @@ services:
|
||||||
query: 'changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1'
|
query: 'changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1'
|
||||||
severity: error
|
severity: error
|
||||||
|
|
||||||
|
- name: Zookeeper
|
||||||
|
exporters:
|
||||||
|
- name: cloudflare/kafka_zookeeper_exporter
|
||||||
|
doc_url: https://github.com/cloudflare/kafka_zookeeper_exporter
|
||||||
|
rules:
|
||||||
|
|
||||||
|
- name: Kafka
|
||||||
|
exporters:
|
||||||
|
- name: danielqsj/kafka_exporter
|
||||||
|
doc_url: https://github.com/danielqsj/kafka_exporter
|
||||||
|
rules:
|
||||||
|
- name: Kafka topics replicas
|
||||||
|
description: Kafka topic in-sync partition
|
||||||
|
query: "sum(kafka_topic_partition_in_sync_replica) by (topic) < 3"
|
||||||
|
severity: error
|
||||||
|
- name: Kafka consumers group
|
||||||
|
description: Kafka consumers group
|
||||||
|
query: "sum(kafka_consumergroup_lag) by (consumergroup) > 50"
|
||||||
|
severity: error
|
||||||
|
|
||||||
|
|
||||||
|
- name: Reverse proxies and load balancers
|
||||||
|
services:
|
||||||
- name: Nginx
|
- name: Nginx
|
||||||
exporters:
|
exporters:
|
||||||
- name: nginx-lua-prometheus
|
- name: nginx-lua-prometheus
|
||||||
|
|
@ -597,6 +802,9 @@ services:
|
||||||
query: 'sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5'
|
query: 'sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5'
|
||||||
severity: error
|
severity: error
|
||||||
|
|
||||||
|
|
||||||
|
- name: Runtimes
|
||||||
|
services:
|
||||||
- name: PHP-FPM
|
- name: PHP-FPM
|
||||||
exporters:
|
exporters:
|
||||||
- name: bakins/php-fpm-exporter
|
- name: bakins/php-fpm-exporter
|
||||||
|
|
@ -613,26 +821,41 @@ services:
|
||||||
query: 'jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8'
|
query: 'jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8'
|
||||||
severity: warning
|
severity: warning
|
||||||
|
|
||||||
- name: ZFS
|
- name: Sidekiq
|
||||||
exporters:
|
exporters:
|
||||||
- name: node-exporter
|
- name: Strech/sidekiq-prometheus-exporter
|
||||||
doc_url: https://github.com/prometheus/node_exporter
|
doc_url: https://github.com/Strech/sidekiq-prometheus-exporter
|
||||||
rules:
|
rules:
|
||||||
|
- name: Sidekiq queue size
|
||||||
|
description: Sidekiq queue {{ $labels.name }} is growing
|
||||||
|
query: 'sidekiq_queue_size{} > 100'
|
||||||
|
severity: warning
|
||||||
|
- name: Sidekiq scheduling latency too high
|
||||||
|
description: Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.
|
||||||
|
query: 'max(sidekiq_queue_latency) > 120'
|
||||||
|
severity: error
|
||||||
|
|
||||||
|
|
||||||
|
- name: Orchestrators
|
||||||
|
services:
|
||||||
- name: Kubernetes
|
- name: Kubernetes
|
||||||
exporters:
|
exporters:
|
||||||
- name: kube-state-metrics
|
- name: kube-state-metrics
|
||||||
doc_url: https://github.com/kubernetes/kube-state-metrics/tree/master/docs
|
doc_url: https://github.com/kubernetes/kube-state-metrics/tree/master/docs
|
||||||
rules:
|
rules:
|
||||||
- name: Kubernetes MemoryPressure
|
- name: Kubernetes Node ready
|
||||||
|
description: Node {{ $labels.node }} has been unready for a long time
|
||||||
|
query: 'kube_node_status_condition{condition="Ready",status="true"} == 0'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes memory pressure
|
||||||
description: "{{ $labels.node }} has MemoryPressure condition"
|
description: "{{ $labels.node }} has MemoryPressure condition"
|
||||||
query: 'kube_node_status_condition{condition="MemoryPressure",status="true"} == 1'
|
query: 'kube_node_status_condition{condition="MemoryPressure",status="true"} == 1'
|
||||||
severity: error
|
severity: error
|
||||||
- name: Kubernetes DiskPressure
|
- name: Kubernetes disk pressure
|
||||||
description: "{{ $labels.node }} has DiskPressure condition"
|
description: "{{ $labels.node }} has DiskPressure condition"
|
||||||
query: 'kube_node_status_condition{condition="DiskPressure",status="true"} == 1'
|
query: 'kube_node_status_condition{condition="DiskPressure",status="true"} == 1'
|
||||||
severity: error
|
severity: error
|
||||||
- name: Kubernetes OutOfDisk
|
- name: Kubernetes out of disk
|
||||||
description: "{{ $labels.node }} has OutOfDisk condition"
|
description: "{{ $labels.node }} has OutOfDisk condition"
|
||||||
query: 'kube_node_status_condition{condition="OutOfDisk",status="true"} == 1'
|
query: 'kube_node_status_condition{condition="OutOfDisk",status="true"} == 1'
|
||||||
severity: error
|
severity: error
|
||||||
|
|
@ -643,7 +866,7 @@ services:
|
||||||
- name: Kubernetes CronJob suspended
|
- name: Kubernetes CronJob suspended
|
||||||
description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended"
|
description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended"
|
||||||
query: "kube_cronjob_spec_suspend != 0"
|
query: "kube_cronjob_spec_suspend != 0"
|
||||||
severity: info
|
severity: warning
|
||||||
- name: Kubernetes PersistentVolumeClaim pending
|
- name: Kubernetes PersistentVolumeClaim pending
|
||||||
description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending"
|
description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending"
|
||||||
query: 'kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1'
|
query: 'kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1'
|
||||||
|
|
@ -654,12 +877,92 @@ services:
|
||||||
severity: warning
|
severity: warning
|
||||||
- name: Kubernetes Volume full in four days
|
- name: Kubernetes Volume full in four days
|
||||||
description: "{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available."
|
description: "{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available."
|
||||||
query: "100 * (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 15 and predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0"
|
query: 'predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes PersistentVolume error
|
||||||
|
description: "Persistent volume is in bad state"
|
||||||
|
query: 'kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0'
|
||||||
severity: error
|
severity: error
|
||||||
- name: Kubernetes StatefulSet down
|
- name: Kubernetes StatefulSet down
|
||||||
description: A StatefulSet went down
|
description: A StatefulSet went down
|
||||||
query: "(kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1"
|
query: "(kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1"
|
||||||
severity: error
|
severity: error
|
||||||
|
- name: Kubernetes HPA scaling ability
|
||||||
|
description: Pod is unable to scale
|
||||||
|
query: 'kube_hpa_status_condition{condition="false", status="AbleToScale"} == 1'
|
||||||
|
severity: warning
|
||||||
|
- name: Kubernetes HPA metric availability
|
||||||
|
description: HPA is not able to colelct metrics
|
||||||
|
query: 'kube_hpa_status_condition{condition="false", status="ScalingActive"} == 1'
|
||||||
|
severity: warning
|
||||||
|
- name: Kubernetes HPA scale capability
|
||||||
|
description: The maximum number of desired Pods has been hit
|
||||||
|
query: 'kube_hpa_status_desired_replicas >= kube_hpa_spec_max_replicas'
|
||||||
|
severity: warning
|
||||||
|
- name: Kubernetes Pod not healthy
|
||||||
|
description: Pod has been in a non-ready state for longer than an hour.
|
||||||
|
query: 'min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h]) > 0'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes pod crash looping
|
||||||
|
description: Pod {{ $labels.pod }} is crash looping
|
||||||
|
query: 'rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5'
|
||||||
|
severity: warning
|
||||||
|
- name: Kubernetes ReplicasSet mismatch
|
||||||
|
description: Deployment Replicas mismatch
|
||||||
|
query: 'kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas'
|
||||||
|
severity: warning
|
||||||
|
- name: Kubernetes Deployment replicas mismatch
|
||||||
|
description: Deployment Replicas mismatch
|
||||||
|
query: 'kube_deployment_spec_replicas != kube_deployment_status_replicas_available'
|
||||||
|
severity: warning
|
||||||
|
- name: Kubernetes StatefulSet replicas mismatch
|
||||||
|
description: A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.
|
||||||
|
query: 'kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas'
|
||||||
|
severity: warning
|
||||||
|
- name: Kubernetes Deployment generation mismatch
|
||||||
|
description: A Deployment has failed but has not been rolled back.
|
||||||
|
query: 'kube_deployment_status_observed_generation != kube_deployment_metadata_generation'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes StatefulSet generation mismatch
|
||||||
|
description: A StatefulSet has failed but has not been rolled back.
|
||||||
|
query: 'kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes StatefulSet update not rolled out
|
||||||
|
description: StatefulSet update has not been rolled out.
|
||||||
|
query: 'max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated)'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes DaemonSet rollout stuck
|
||||||
|
description: Some Pods of DaemonSet are not scheduled or not ready
|
||||||
|
query: 'kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes DaemonSet misscheduled
|
||||||
|
description: Some DaemonSet Pods are running where they are not supposed to run
|
||||||
|
query: 'kube_daemonset_status_number_misscheduled > 0'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes CronJob too long
|
||||||
|
description: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.
|
||||||
|
query: 'time() - kube_cronjob_next_schedule_time > 3600'
|
||||||
|
severity: warning
|
||||||
|
- name: Kubernetes job completion
|
||||||
|
description: Kubernetes Job failed to complete
|
||||||
|
query: 'kube_job_spec_completions - kube_job_status_succeeded > 0 or kube_job_status_failed > 0'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes API server errors
|
||||||
|
description: Kubernetes API server is experiencing high error rate
|
||||||
|
query: 'sum(rate(apiserver_request_count{job="apiserver",code=~"^(?:5..)$"}[2m])) / sum(rate(apiserver_request_count{job="apiserver"}[2m])) * 100 > 3'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes API client errors
|
||||||
|
description: Kubernetes API client is experiencing high error rate
|
||||||
|
query: '(sum(rate(rest_client_requests_total{code=~"(4|5).."}[2m])) by (instance, job) / sum(rate(rest_client_requests_total[2m])) by (instance, job)) * 100 > 1'
|
||||||
|
severity: error
|
||||||
|
- name: Kubernetes client certificate expires next week
|
||||||
|
description: A client certificate used to authenticate to the apiserver is expiring next week.
|
||||||
|
query: 'apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60'
|
||||||
|
severity: warning
|
||||||
|
- name: Kubernetes client certificate expires soon
|
||||||
|
description: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.
|
||||||
|
query: 'apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60'
|
||||||
|
severity: error
|
||||||
|
|
||||||
- name: Nomad
|
- name: Nomad
|
||||||
exporters:
|
exporters:
|
||||||
|
|
@ -740,26 +1043,6 @@ services:
|
||||||
query: "histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25"
|
query: "histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25"
|
||||||
severity: warning
|
severity: warning
|
||||||
|
|
||||||
- name: Zookeeper
|
|
||||||
exporters:
|
|
||||||
- name: cloudflare/kafka_zookeeper_exporter
|
|
||||||
doc_url: https://github.com/cloudflare/kafka_zookeeper_exporter
|
|
||||||
rules:
|
|
||||||
|
|
||||||
- name: Kafka
|
|
||||||
exporters:
|
|
||||||
- name: danielqsj/kafka_exporter
|
|
||||||
doc_url: https://github.com/danielqsj/kafka_exporter
|
|
||||||
rules:
|
|
||||||
- name: Kafka topics replicas
|
|
||||||
description: Kafka topic in-sync partition
|
|
||||||
query: "sum(kafka_topic_partition_in_sync_replica) by (topic) < 3"
|
|
||||||
severity: error
|
|
||||||
- name: Kafka consumers group
|
|
||||||
description: Kafka consumers group
|
|
||||||
query: "sum(kafka_consumergroup_lag) by (consumergroup) > 50"
|
|
||||||
severity: error
|
|
||||||
|
|
||||||
- name: Linkerd
|
- name: Linkerd
|
||||||
exporters:
|
exporters:
|
||||||
- rules:
|
- rules:
|
||||||
|
|
@ -768,65 +1051,14 @@ services:
|
||||||
exporters:
|
exporters:
|
||||||
- rules:
|
- rules:
|
||||||
|
|
||||||
- name: Blackbox
|
|
||||||
exporters:
|
|
||||||
- name: prometheus/blackbox_exporter
|
|
||||||
doc_url: https://github.com/prometheus/blackbox_exporter
|
|
||||||
rules:
|
|
||||||
- name: Blackbox probe failed
|
|
||||||
description: Probe failed
|
|
||||||
query: probe_success == 0
|
|
||||||
severity: error
|
|
||||||
- name: Blackbox slow probe
|
|
||||||
description: Blackbox probe took more than 1s to complete
|
|
||||||
query: "avg_over_time(probe_duration_seconds[1m]) > 1"
|
|
||||||
severity: warning
|
|
||||||
- name: Blackbox HTTP Status Code
|
|
||||||
description: HTTP status code is not 200-399
|
|
||||||
query: "probe_http_status_code <= 199 OR probe_http_status_code >= 400"
|
|
||||||
severity: error
|
|
||||||
- name: Blackbox SSL certificate will expire soon
|
|
||||||
description: SSL certificate expires in 30 days
|
|
||||||
query: "probe_ssl_earliest_cert_expiry - time() < 86400 * 30"
|
|
||||||
severity: warning
|
|
||||||
- name: Blackbox SSL certificate expired
|
|
||||||
description: SSL certificate has expired already
|
|
||||||
query: "probe_ssl_earliest_cert_expiry - time() <= 0"
|
|
||||||
severity: error
|
|
||||||
- name: Blackbox HTTP slow requests
|
|
||||||
description: HTTP request took more than 1s
|
|
||||||
query: "avg_over_time(probe_http_duration_seconds[1m]) > 1"
|
|
||||||
severity: warning
|
|
||||||
- name: Blackbox slow ping
|
|
||||||
description: Blackbox ping took more than 1s
|
|
||||||
query: "avg_over_time(probe_icmp_duration_seconds[1m]) > 1"
|
|
||||||
severity: warning
|
|
||||||
|
|
||||||
- name: Windows Server
|
- name: Network and storage
|
||||||
|
services:
|
||||||
|
- name: ZFS
|
||||||
exporters:
|
exporters:
|
||||||
- name: martinlindhe/wmi_exporter
|
- name: node-exporter
|
||||||
doc_url: https://github.com/martinlindhe/wmi_exporter
|
doc_url: https://github.com/prometheus/node_exporter
|
||||||
rules:
|
rules:
|
||||||
- name: Windows Server collector Error
|
|
||||||
description: "Collector {{ $labels.collector }} was not successful"
|
|
||||||
query: "wmi_exporter_collector_success == 0"
|
|
||||||
severity: error
|
|
||||||
- name: Windows Server service Status
|
|
||||||
description: Windows Service state is not OK
|
|
||||||
query: 'wmi_service_status{status="ok"} != 1'
|
|
||||||
severity: error
|
|
||||||
- name: Windows Server CPU Usage
|
|
||||||
description: CPU Usage is more than 80%
|
|
||||||
query: '100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80'
|
|
||||||
severity: warning
|
|
||||||
- name: Windows Server memory Usage
|
|
||||||
description: Memory Usage is more than 90%
|
|
||||||
query: "100*(wmi_os_physical_memory_free_bytes) / wmi_cs_physical_memory_bytes > 90"
|
|
||||||
severity: warning
|
|
||||||
- name: Windows Server disk Space Usage
|
|
||||||
description: Disk Space on Drive is used more than 80%
|
|
||||||
query: "100.0 - 100 * ((wmi_logical_disk_free_bytes{} / 1024 / 1024 ) / (wmi_logical_disk_size_bytes{} / 1024 / 1024)) > 80"
|
|
||||||
severity: error
|
|
||||||
|
|
||||||
- name: OpenEBS
|
- name: OpenEBS
|
||||||
exporters:
|
exporters:
|
||||||
|
|
@ -876,3 +1108,22 @@ services:
|
||||||
description: Number of CoreDNS panics encountered
|
description: Number of CoreDNS panics encountered
|
||||||
query: "increase(coredns_panic_count_total[10m]) > 0"
|
query: "increase(coredns_panic_count_total[10m]) > 0"
|
||||||
severity: error
|
severity: error
|
||||||
|
|
||||||
|
|
||||||
|
- name: Other
|
||||||
|
services:
|
||||||
|
- name: Thanos
|
||||||
|
exporters:
|
||||||
|
- rules:
|
||||||
|
- name: Thanos compaction halted
|
||||||
|
description: Thanos compaction has failed to run and is now halted.
|
||||||
|
query: 'thanos_compactor_halted == 1'
|
||||||
|
severity: error
|
||||||
|
- name: Thanos compact bucket operation failure
|
||||||
|
description: Thanos compaction has failing storage operations
|
||||||
|
query: 'rate(thanos_objstore_bucket_operation_failures_total[1m]) > 0'
|
||||||
|
severity: error
|
||||||
|
- name: Thanos compact not run
|
||||||
|
description: Thanos compaction has not run in 24 hours.
|
||||||
|
query: '(time() - thanos_objstore_bucket_last_successful_upload_time) > 24*60*60'
|
||||||
|
severity: error
|
||||||
|
|
|
||||||
18
index.md
18
index.md
|
|
@ -24,7 +24,20 @@
|
||||||
</h2>
|
</h2>
|
||||||
|
|
||||||
<ul>
|
<ul>
|
||||||
{% for service in site.data.rules.services %}
|
{% for group in site.data.rules.groups %}
|
||||||
|
<li style="margin-top: 30px;">
|
||||||
|
{% assign nbrRules = 0 %}
|
||||||
|
{% for service in group.services %}
|
||||||
|
{% for exporter in service.exporters %}
|
||||||
|
{% for rule in exporter.rules %}
|
||||||
|
{% assign nbrRules = nbrRules | plus: 1 %}
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
<h3>{{ group.name }} <small style="margin-left: 20px;">({{ nbrRules }} rules)</small></h3>
|
||||||
|
<ul>
|
||||||
|
{% for service in group.services %}
|
||||||
<li>
|
<li>
|
||||||
<a href="/rules#{{ service.name | replace: " ", "-" | downcase }}">
|
<a href="/rules#{{ service.name | replace: " ", "-" | downcase }}">
|
||||||
{{ service.name }}
|
{{ service.name }}
|
||||||
|
|
@ -32,3 +45,6 @@
|
||||||
</li>
|
</li>
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
</ul>
|
</ul>
|
||||||
|
</li>
|
||||||
|
{% endfor %}
|
||||||
|
</ul>
|
||||||
15
rules.md
15
rules.md
|
|
@ -19,8 +19,11 @@
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
|
<h1></h1>
|
||||||
|
|
||||||
<ul>
|
<ul>
|
||||||
{% for service in site.data.rules.services %}
|
{% for group in site.data.rules.groups %}
|
||||||
|
{% for service in group.services %}
|
||||||
{% assign serviceIndex = forloop.index %}
|
{% assign serviceIndex = forloop.index %}
|
||||||
{% for exporter in service.exporters %}
|
{% for exporter in service.exporters %}
|
||||||
{% assign nbrRules = exporter.rules | size %}
|
{% assign nbrRules = exporter.rules | size %}
|
||||||
|
|
@ -28,8 +31,7 @@
|
||||||
<h2 id="{{ service.name | replace: " ", "-" | downcase }}">
|
<h2 id="{{ service.name | replace: " ", "-" | downcase }}">
|
||||||
{{ serviceIndex }}.
|
{{ serviceIndex }}.
|
||||||
{{ service.name }}
|
{{ service.name }}
|
||||||
{% if exporter.name %}
|
{% if exporter.name %}:
|
||||||
:
|
|
||||||
{% if exporter.doc_url %}
|
{% if exporter.doc_url %}
|
||||||
<a href="{{ exporter.doc_url }}">
|
<a href="{{ exporter.doc_url }}">
|
||||||
{{ exporter.name }}
|
{{ exporter.name }}
|
||||||
|
|
@ -40,6 +42,9 @@
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
{% if nbrRules > 0 %}
|
{% if nbrRules > 0 %}
|
||||||
|
<small style="font-size: 60%; vertical-align: middle; margin-left: 10px;">
|
||||||
|
({{ nbrRules }} rules)
|
||||||
|
</small>
|
||||||
<span class="clipboard-multiple" data-clipboard-target-id="service-{{ serviceIndex }}">[copy all]</span>
|
<span class="clipboard-multiple" data-clipboard-target-id="service-{{ serviceIndex }}">[copy all]</span>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
</h2>
|
</h2>
|
||||||
|
|
@ -70,8 +75,7 @@
|
||||||
|
|
||||||
{% highlight yaml %}
|
{% highlight yaml %}
|
||||||
{% for comment in comments %}# {{ comment | strip }}
|
{% for comment in comments %}# {{ comment | strip }}
|
||||||
{% endfor %}
|
{% endfor %}- alert: {{ ruleNameCamelcase | remove: ' ' }}
|
||||||
- alert: {{ ruleNameCamelcase | remove: ' ' }}
|
|
||||||
expr: {{ rule.query }}
|
expr: {{ rule.query }}
|
||||||
for: 5m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
|
|
@ -93,4 +97,5 @@
|
||||||
</li>
|
</li>
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
</ul>
|
</ul>
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue