awesome-prometheus-alerts

mirror of https://github.com/samber/awesome-prometheus-alerts.git synced 2026-06-21 00:47:18 +08:00

Author	SHA1	Message	Date
Samuel Berthe	7ee16641ac	feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter) (#520 ) * feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter) * fix: grammar in WireGuard rule comment	2026-03-16 14:20:17 +01:00
Samuel Berthe	f974552ef1	Feat/jaeger alerting rules (#521 ) * Add .worktrees/ to .gitignore * feat: add Jaeger alerting rules (8 rules from official jaeger-mixin) Rules cover agent HTTP errors, RPC errors, client/agent/collector span drops, sampling update failures, throttling update failures, and query request failures. All rules sourced from https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin * fix: rename Jaeger agent RPC alert to Jaeger client RPC The jaeger_client_jaeger_rpc_http_requests metric is client-side, not agent-side. Rename alert to match the actual metric source.	2026-03-16 14:09:03 +01:00
Samuel Berthe	8b443be6d2	feat: add systemd_exporter alerting rules (7 rules) (#522 ) * feat: add systemd_exporter alerting rules (7 rules) Add new Systemd service under Basic resource monitoring with rules for: - Unit failed/inactive state detection - Service crash loop detection - Task limit exhaustion - Socket refused/high connections - Timer missed trigger * fix: narrow systemd unit inactive query to reduce noise Add type="service" and name filter to the inactive unit alert to avoid false positives from legitimately inactive units.	2026-03-16 14:07:14 +01:00
Samuel Berthe	30bbedbc79	feat: add Cloud providers alerting rules (33 rules across 4 exporters) (#519 ) * feat: add Cloud providers alerting rules (33 rules across 4 exporters) New "Cloud providers" category with rules for: - AWS CloudWatch (13 rules): exporter health + EC2, RDS, SQS, ALB, Lambda - Google Cloud / Stackdriver (5 rules): scrape health, API quotas, staleness - DigitalOcean (10 rules): droplets, databases, k8s, load balancers, incidents - Azure (5 rules): API errors, rate limits, collection performance * fix: address PR review - move Cloud providers before Other, fix service name - Move "Cloud providers" group before "Other" in rules.yml for consistent ordering - Rename "Google Cloud / Stackdriver" to "Google Cloud Stackdriver" to avoid awkward /-/ in generated anchors and dist/rules/ paths - Fix README anchor link to match the new service name	2026-03-16 14:06:59 +01:00
Samuel Berthe	fd3bfb02c0	Some fix (#516 ) * fix: use proper zero-traffic guard in Envoy ratio alerts (#511) Replace `+ 1` denominator hack with `and ... > 0` filter in upstream timeout rate and upstream 5xx error rate queries for mathematical correctness and repo consistency. * feat: add alerting rules for prometheus/memcached_exporter * fix: add division-by-zero guards and improve quoting in memcached rules (#512) - Add `and memcached_max_connections > 0` to connection limit queries - Add `and memcached_limit_bytes > 0` to memory usage query - Switch hit-rate query to single quotes for cleaner PromQL readability * fix: fix SNMP interface down query and add job scoping (#507) - Fix ifOperStatus query to use vector matching instead of label filter since ifAdminStatus is a separate metric in snmp_exporter output - Add job=~"snmp." matcher to interface error rate, bandwidth usage, and interface down rules to prevent matching non-SNMP series Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-16 04:50:01 +01:00
Samuel Berthe	97aae5dabf	feat: add GitLab alerting rules (28 rules across 3 exporters) (#518 ) Add new GitLab service under "Other" category with 3 exporters: - Built-in exporter (18 rules): Puma, HTTP errors/latency, Sidekiq jobs, database connection pool, CI/CD pipelines, Ruby process health - Workhorse (3 rules): HTTP error rate, latency, in-flight requests - Gitaly (7 rules): gRPC errors, ResourceExhausted, RPC latency, CPU throttling, auth failures, circuit breaker All metrics verified against gitlabhq/gitlabhq source code. Several rules derived from GitLab Omnibus default alerting rules.	2026-03-16 04:48:52 +01:00
Samuel Berthe	e6cdcdb9e5	feat: add Apache Flink and Apache Spark alerting rules Add 20 new alerting rules under the Runtimes category: - Apache Flink (12 rules): job status, TaskManager registration, slot availability, restarts, checkpoints, backpressure, heap memory, GC, and record processing - Apache Spark (8 rules): worker health, waiting apps, memory/cores exhaustion, executor GC, task failures, and disk spill	2026-03-16 04:46:00 +01:00
Samuel Berthe	88e2c19017	feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi) (#517 ) * feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi) * fix: correct Keycloak metrics-spi metric names and query grouping	2026-03-16 04:40:15 +01:00
Samuel Berthe	20651aa10d	feat: add OpenStack alerting rules (openstack-exporter) (#515 ) * feat: add OpenStack alerting rules (openstack-exporter) Add 20 alerting rules for openstack-exporter/openstack-exporter covering Nova, Neutron, Cinder, Octavia, and Placement services. * docs: add OpenStack to README services list * fix: align OpenStack load balancer alert name with operating_status semantics The operating_status label uses ONLINE/OFFLINE/DEGRADED/ERROR values, not ACTIVE. Rename alert to "not online" and use the label in the description for clarity.	2026-03-16 03:43:51 +01:00
Samuel Berthe	bf7b902881	feat: add process-exporter alerting rules (ncabatoff/process-exporter) (#514 ) * feat: add process-exporter alerting rules (ncabatoff/process-exporter) * docs: add Process to README services list * fix: address PR review feedback for process-exporter rules - Rename service from "Process" to "Process Exporter" for clarity - Fix grammar: "file descriptors usage" → "file descriptor usage" - Clarify CPU alert description as core-equivalent percentage - Rename "high disk IO" to "high disk write IO" for accuracy	2026-03-16 03:31:18 +01:00
Samuel Berthe	2b239736cf	feat: add alerting rules for prometheus/memcached_exporter (#512 )	2026-03-16 03:25:38 +01:00
Samuel Berthe	281142567c	fix: use proper zero-traffic guard in Envoy ratio alerts (#511 ) (#513 ) Replace `+ 1` denominator hack with `and ... > 0` filter in upstream timeout rate and upstream 5xx error rate queries for mathematical correctness and repo consistency.	2026-03-16 03:25:27 +01:00
Samuel Berthe	f97f692596	feat: add Proxmox VE alerting rules (prometheus-pve-exporter) (#509 ) Add 9 alerting rules for Proxmox VE covering node/guest status, CPU, memory, storage, backup coverage, replication, and cluster quorum.	2026-03-16 03:12:06 +01:00
Samuel Berthe	be7a2e4d5d	feat: add IPMI exporter alerting rules (#510 ) * feat: add IPMI exporter alerting rules Add 17 alerting rules for prometheus-community/ipmi_exporter covering temperature, fan, voltage, current, power sensors, chassis status, and system event log monitoring. * docs: add IPMI to README service list * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-16 03:10:10 +01:00
Samuel Berthe	c064d2264e	feat: add Envoy proxy alerting rules using built-in metrics (#511 ) Add 19 alerting rules for Envoy proxy under "Reverse proxies and load balancers" using native metrics from /stats/prometheus endpoint. Covers: server health, HTTP error rates (downstream/upstream), connection saturation, cluster membership, health checks, outlier detection, SSL/TLS certificate expiry, circuit breakers, and request timeouts.	2026-03-16 03:03:57 +01:00
Samuel Berthe	89e703d763	feat: add alerting rules for cloudflare/ebpf_exporter (#508 ) * feat: add alerting rules for cloudflare/ebpf_exporter * docs: add eBPF to README service list	2026-03-16 02:56:04 +01:00
Samuel Berthe	3db9281508	feat: add SNMP exporter alerting rules (#507 ) Add 7 alerting rules for prometheus/snmp_exporter covering device availability, interface status, error rates, bandwidth utilization, and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.	2026-03-16 02:34:34 +01:00
Samuel Berthe	c37ef8f50c	fix: review and fix 74 database & broker alert rules (#504 ) * fix: review and fix 74 database & broker alert rules Comprehensive review of all database and broker alerts covering 16 services. Typos & descriptions (8 fixes): - PGBouncer: "a a server" → "a server" - RabbitMQ: "instace" → "instance", "RabbmitMQ" → "RabbitMQ", "unactive" → "inactive" - Cassandra: write failure said "Read failures", "bad hacker" → "authentication failures" - Solr: replication errors said "failed updates" - Meilisearch: "index is empty" said "instance is down" Duplicates removed (5 fixes): - PostgreSQL: 2 rules using wrong exporter metric (postgresql_errors_total) - ClickHouse: "High Network Traffic" (thread counts) duplicated byte-rate rule - NATS: 2 rules with low thresholds duplicated better rules Broken queries (20 fixes): - Patroni: patroni_master → patroni_primary (renamed in v3) - MongoDB: rate() on gauge → direct ratio for connection queries - MongoDB: removed WiredTiger-incompatible virtual memory rule - Cassandra instaclustr: avg() on counter → rate()[5m] - Cassandra criteo: increase() on JMX rate metric → direct threshold - ClickHouse: increase() on gauge → direct threshold - NATS: rate() on gauge → direct comparison, removed 4 config-value rules - SQL Server: increase() on gauge → direct threshold - Pulsar: moved comparison outside sum() (4 rules) - Hadoop: inverted comparison < 0.2 → > 0.8, counters → increase()[1h] Severity adjustments (7 fixes): - Redis: backup threshold 24h → 48h, rejected connections → warning > 5 - RabbitMQ: no consumer for: 5m with comment - Elasticsearch: unassigned shards added for: 2m - CouchDB: process restarted critical → info - Kafka: consumer group lag → warning, threshold 10000, better description - Hadoop: HBase heap low critical → warning Missing for duration (18 fixes): - Added for: 1m to service-down alerts across MySQL, PostgreSQL, SQL Server, Patroni, Redis, MongoDB, RabbitMQ, Elasticsearch, Cassandra, Zookeeper with restart-tolerance comments Division by zero guards (9 fixes): - Added denominator > 0 guards to ratio queries in PostgreSQL, RabbitMQ, Elasticsearch, ClickHouse, CouchDB, NATS Query design improvements (5 fixes): - Cassandra: removed unnecessary sum() and redundant avg_over_time() - ClickHouse: ZooKeeper avg() → per-instance check - PostgreSQL: sum() → sum by (instance) for SSL and locks - PGBouncer: 30s range window → 2m Hardcoded labels (2 fixes): - ClickHouse: added comment about job="clickhouse" - Cassandra criteo: removed hardcoded service="cas" * fix: address PR review comments - Cassandra connection timeouts: wrap rate() in sum by() (rate() by() is invalid PromQL) - Elasticsearch query latency: add division-by-zero guard - Redis backup: "backuped" → "backed up"	2026-03-16 01:27:18 +01:00
Samuel Berthe	080a792777	data: adding python/ruby/golang (#502 ) * data: adding python/ruby/golang * fix: address review feedback on runtime alerts - JVM non-heap: guard against unbounded metaspace (max_bytes = -1) - JVM old gen GC: note regex only matches CMS/G1/Parallel collectors - JVM/Python file descriptors: note process_* metrics are generic - Go memory usage: fix description (sys_bytes is runtime memory, not host) - Go goroutine spike: use deriv() instead of rate() on gauge - Go GC CPU fraction: note deprecation since Go 1.20 - Go GC duration: clarify quantile="1" is max, not p99 - Python uncollectable: use increase() on counter instead of raw threshold - Add threshold comments for workload-dependent defaults	2026-03-15 19:46:39 +01:00
Samuel Berthe	9ae17eca97	Fix broken and misleading alert rules (#503 ) - Remove 7 meaningless `for: 0m` (ClickHouse, Caddy, Thanos) - Fix Minio obsolete metrics (disk_storage_* -> minio_cluster_capacity_*) - Rename duplicate Blackbox SSL cert rule to disambiguate warning/critical - Simplify PostgreSQL config change query (giant regex -> negative matcher) - Downgrade PostgreSQL SSL compression severity from critical to warning - Fix misleading "Host unusual disk read rate" name and description	2026-03-15 18:08:06 +01:00
Marcin Morawski	eeebb90e6f	Add systemd service name to HostSystemdServiceCrashed summary (#499 ) * Add systemd service name to HostSystemdServiceCrashed summary * Modify systemd service crash rule description Updated the description for the systemd service crash rule to include the service name. --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2026-03-01 20:15:17 +01:00
dxrayz	e60601fdcd	tune Targets Missing rules (#497 ) * tune Targets Missing rules * reworked query logic * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2026-02-21 19:40:10 +01:00
Per Lundberg	51aea96ba7	Adjust OOM kill detected rule (#495 ) * Adjust OOM kill detected rule When a machine runs out of memory, it happens that the node exporter stops responding for multiple minutes. I've adjusted the rule now to take this into account: even if it takes 15-20 minutes before the machine becomes responsive again, the alert should still fire. * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2026-01-30 12:15:27 +01:00
Samuel Berthe	d400e3e64d	feat(k8s): cronjob rule (#491 )	2026-01-07 13:57:42 +01:00
Simon Matic Langford	f810ff531d	Node exporter rules to preserve instance labels (#488 ) * Jenkins node offline for clause (#2) * Convert cpu alert expressions to without() rather than on() * Remove on() expression from network throughput alerts as labels fully match --------- Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>	2026-01-06 16:24:18 +01:00
Simon Matic Langford	79f2858037	Improve Jenkins node alerts to better handle servers with multiple nodes (#484 )	2025-11-17 14:56:04 +01:00
Arve Knudsen	d58bc324ad	Add OpenTelemetry Collector monitoring alerts (#480 ) Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>	2025-11-05 17:08:26 +01:00
andrii.k	9edef74e73	update kafka alerts (#478 )	2025-10-13 14:24:37 +02:00
Riccardo Cannella	7832e01082	haproxy: align v1 and v2 HAProxy backend max active session > 80% alerts (#475 ) * haproxy: align v1 and v2 max current session alerts * fix: remove non-existing label --------- Co-authored-by: Riccardo Cannella <riccardo.cannella@reevo.it>	2025-09-15 15:03:44 +02:00
Samuel Berthe	237e89babc	Update query for unused replication slot rule	2025-09-14 19:22:05 +02:00
Sajjad hassanzadeh	a2c31358d1	Add couchdb alerts (#472 ) * add : additional essential clickhouse alerts * Add new ClickHouse alert rules for monitoring * linting * add : couchdb roles config in rules.yml * add : couchdb alerts in rules directory --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-09-01 15:40:42 +02:00
Sajjad hassanzadeh	7bced89d2d	add : additional essential clickhouse alerts (#471 ) * add : additional essential clickhouse alerts * Add new ClickHouse alert rules for monitoring * linting --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-08-28 23:06:31 +02:00
Samuel Berthe	554850df41	Update rules.yml	2025-06-25 13:32:16 +02:00
Samuel Berthe	748524d580	Update rules.yml	2025-06-17 19:15:52 +02:00
Samuel Berthe	a5a3c2cd92	fix: HostHighCpuUsage (#466 ) closes #457	2025-06-17 17:07:05 +02:00
Samuel Berthe	4b1b8242cb	Update rules.yml	2025-05-21 23:04:12 +02:00
andrii.k	e0e3cdda1d	update istio 4xx alert description (#463 )	2025-05-08 19:49:18 +02:00
Carsten Thiel	79f45a5146	Adding rules for checking FluxCD (#458 )	2025-05-03 22:52:26 +02:00
samber	9f5c641bdd	Publish	2025-04-23 08:31:10 +00:00
samber	aca1bdf1fb	Publish	2025-04-23 08:28:06 +00:00
Samuel Berthe	4666830538	Update rules.yml	2025-04-23 10:18:08 +02:00
Roger	b3d25fafcf	feature/kubestate exporter check if node is scheduling disabeld (#462 ) * feature/kubestate-exporter-check-if-node-is-scheduling-disabeld * commented added * typo in expr * move code to right file --------- Co-authored-by: Roger Sikorski <roger.sikorski@zweiloewen.com> Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-04-23 09:58:29 +02:00
Samuel Berthe	3b440fec7b	Remove buggy HostRequiresReboot rule Closing #459	2025-04-17 17:26:00 +02:00
Samuel Berthe	8b730ef059	Update rules.yml	2025-03-27 17:23:19 +01:00
Motte	69c8208e3c	Added PostgresqlReplicationLagHigh rule (#456 ) * Added PostgresqlReplicationLagHigh rule * Update PostgreSQL replication lag alert settings --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-03-27 14:42:22 +01:00
Pigueiras	97a31f34e5	Fix queries in elasticsearch latency alerts (#455 ) The `elasticsearch_indices_search_fetch_total`, `elasticsearch_indices_search_fetch_time_seconds`, `elasticsearch_indices_indexing_index_time_seconds_total` and `elasticsearch_indices_indexing_index_total` metrics are counters. Dividing these metrics doesn't make sense because a spike in numerator would cause the alert to persist, even if subsequent fetch/index operations are normal. Adding `increase` changes the query to check if operations took, on average, more than X over a 1-minute interval, which was likely the original intent of this alert.	2025-03-26 22:15:24 +01:00
Samuel Berthe	2127c4ce90	Update rules.yml	2025-02-20 16:17:39 +01:00
Roman	c189984d0f	fix node-exporter.yaml missing parentheses (#452 )	2025-02-20 15:05:48 +01:00
Samuel Berthe	6838196343	fix: remove duplicated rule	2025-02-19 15:25:29 +01:00
dzaczek	11a78f0f06	Update google-cadvisor.yml (#382 ) * Update google-cadvisor.yml Expression Explanation: The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires. Alert Details: - Alert Name: ContainerHighLowChangeCpuUsage - Trigger Condition: Absolute change in CPU usage exceeding 25% - Alert Severity: Informational (info) * Add alert rule for high CPU usage change * Change alert severity from warning to info --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-02-16 23:46:53 +01:00

1 2 3 4 5 ...

471 commits