Commit graph

467 commits

Author SHA1 Message Date
Samuel Berthe
fd3bfb02c0
Some fix (#516)
* fix: use proper zero-traffic guard in Envoy ratio alerts (#511)

Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.

* feat: add alerting rules for prometheus/memcached_exporter

* fix: add division-by-zero guards and improve quoting in memcached rules (#512)

- Add `and memcached_max_connections > 0` to connection limit queries
- Add `and memcached_limit_bytes > 0` to memory usage query
- Switch hit-rate query to single quotes for cleaner PromQL readability

* fix: fix SNMP interface down query and add job scoping (#507)

- Fix ifOperStatus query to use vector matching instead of label filter
  since ifAdminStatus is a separate metric in snmp_exporter output
- Add job=~"snmp.*" matcher to interface error rate, bandwidth usage,
  and interface down rules to prevent matching non-SNMP series

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 04:50:01 +01:00
Samuel Berthe
97aae5dabf
feat: add GitLab alerting rules (28 rules across 3 exporters) (#518)
Add new GitLab service under "Other" category with 3 exporters:
- Built-in exporter (18 rules): Puma, HTTP errors/latency, Sidekiq jobs,
  database connection pool, CI/CD pipelines, Ruby process health
- Workhorse (3 rules): HTTP error rate, latency, in-flight requests
- Gitaly (7 rules): gRPC errors, ResourceExhausted, RPC latency,
  CPU throttling, auth failures, circuit breaker

All metrics verified against gitlabhq/gitlabhq source code.
Several rules derived from GitLab Omnibus default alerting rules.
2026-03-16 04:48:52 +01:00
Samuel Berthe
e6cdcdb9e5 feat: add Apache Flink and Apache Spark alerting rules
Add 20 new alerting rules under the Runtimes category:
- Apache Flink (12 rules): job status, TaskManager registration, slot
  availability, restarts, checkpoints, backpressure, heap memory, GC,
  and record processing
- Apache Spark (8 rules): worker health, waiting apps, memory/cores
  exhaustion, executor GC, task failures, and disk spill
2026-03-16 04:46:00 +01:00
Samuel Berthe
88e2c19017
feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi) (#517)
* feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi)

* fix: correct Keycloak metrics-spi metric names and query grouping
2026-03-16 04:40:15 +01:00
Samuel Berthe
20651aa10d
feat: add OpenStack alerting rules (openstack-exporter) (#515)
* feat: add OpenStack alerting rules (openstack-exporter)

Add 20 alerting rules for openstack-exporter/openstack-exporter covering
Nova, Neutron, Cinder, Octavia, and Placement services.

* docs: add OpenStack to README services list

* fix: align OpenStack load balancer alert name with operating_status semantics

The operating_status label uses ONLINE/OFFLINE/DEGRADED/ERROR values,
not ACTIVE. Rename alert to "not online" and use the label in the
description for clarity.
2026-03-16 03:43:51 +01:00
Samuel Berthe
bf7b902881
feat: add process-exporter alerting rules (ncabatoff/process-exporter) (#514)
* feat: add process-exporter alerting rules (ncabatoff/process-exporter)

* docs: add Process to README services list

* fix: address PR review feedback for process-exporter rules

- Rename service from "Process" to "Process Exporter" for clarity
- Fix grammar: "file descriptors usage" → "file descriptor usage"
- Clarify CPU alert description as core-equivalent percentage
- Rename "high disk IO" to "high disk write IO" for accuracy
2026-03-16 03:31:18 +01:00
Samuel Berthe
2b239736cf
feat: add alerting rules for prometheus/memcached_exporter (#512) 2026-03-16 03:25:38 +01:00
Samuel Berthe
281142567c
fix: use proper zero-traffic guard in Envoy ratio alerts (#511) (#513)
Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.
2026-03-16 03:25:27 +01:00
Samuel Berthe
f97f692596
feat: add Proxmox VE alerting rules (prometheus-pve-exporter) (#509)
Add 9 alerting rules for Proxmox VE covering node/guest status,
CPU, memory, storage, backup coverage, replication, and cluster quorum.
2026-03-16 03:12:06 +01:00
Samuel Berthe
be7a2e4d5d
feat: add IPMI exporter alerting rules (#510)
* feat: add IPMI exporter alerting rules

Add 17 alerting rules for prometheus-community/ipmi_exporter covering
temperature, fan, voltage, current, power sensors, chassis status,
and system event log monitoring.

* docs: add IPMI to README service list

* Apply suggestions from code review

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 03:10:10 +01:00
Samuel Berthe
c064d2264e
feat: add Envoy proxy alerting rules using built-in metrics (#511)
Add 19 alerting rules for Envoy proxy under "Reverse proxies and load
balancers" using native metrics from /stats/prometheus endpoint.

Covers: server health, HTTP error rates (downstream/upstream), connection
saturation, cluster membership, health checks, outlier detection,
SSL/TLS certificate expiry, circuit breakers, and request timeouts.
2026-03-16 03:03:57 +01:00
Samuel Berthe
89e703d763
feat: add alerting rules for cloudflare/ebpf_exporter (#508)
* feat: add alerting rules for cloudflare/ebpf_exporter

* docs: add eBPF to README service list
2026-03-16 02:56:04 +01:00
Samuel Berthe
3db9281508
feat: add SNMP exporter alerting rules (#507)
Add 7 alerting rules for prometheus/snmp_exporter covering device
availability, interface status, error rates, bandwidth utilization,
and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.
2026-03-16 02:34:34 +01:00
Samuel Berthe
c37ef8f50c
fix: review and fix 74 database & broker alert rules (#504)
* fix: review and fix 74 database & broker alert rules

Comprehensive review of all database and broker alerts covering 16 services.

Typos & descriptions (8 fixes):
- PGBouncer: "a a server" → "a server"
- RabbitMQ: "instace" → "instance", "RabbmitMQ" → "RabbitMQ",
  "unactive" → "inactive"
- Cassandra: write failure said "Read failures", "bad hacker" →
  "authentication failures"
- Solr: replication errors said "failed updates"
- Meilisearch: "index is empty" said "instance is down"

Duplicates removed (5 fixes):
- PostgreSQL: 2 rules using wrong exporter metric (postgresql_errors_total)
- ClickHouse: "High Network Traffic" (thread counts) duplicated byte-rate rule
- NATS: 2 rules with low thresholds duplicated better rules

Broken queries (20 fixes):
- Patroni: patroni_master → patroni_primary (renamed in v3)
- MongoDB: rate() on gauge → direct ratio for connection queries
- MongoDB: removed WiredTiger-incompatible virtual memory rule
- Cassandra instaclustr: avg() on counter → rate()[5m]
- Cassandra criteo: increase() on JMX rate metric → direct threshold
- ClickHouse: increase() on gauge → direct threshold
- NATS: rate() on gauge → direct comparison, removed 4 config-value rules
- SQL Server: increase() on gauge → direct threshold
- Pulsar: moved comparison outside sum() (4 rules)
- Hadoop: inverted comparison < 0.2 → > 0.8, counters → increase()[1h]

Severity adjustments (7 fixes):
- Redis: backup threshold 24h → 48h, rejected connections → warning > 5
- RabbitMQ: no consumer for: 5m with comment
- Elasticsearch: unassigned shards added for: 2m
- CouchDB: process restarted critical → info
- Kafka: consumer group lag → warning, threshold 10000, better description
- Hadoop: HBase heap low critical → warning

Missing for duration (18 fixes):
- Added for: 1m to service-down alerts across MySQL, PostgreSQL,
  SQL Server, Patroni, Redis, MongoDB, RabbitMQ, Elasticsearch,
  Cassandra, Zookeeper with restart-tolerance comments

Division by zero guards (9 fixes):
- Added denominator > 0 guards to ratio queries in PostgreSQL,
  RabbitMQ, Elasticsearch, ClickHouse, CouchDB, NATS

Query design improvements (5 fixes):
- Cassandra: removed unnecessary sum() and redundant avg_over_time()
- ClickHouse: ZooKeeper avg() → per-instance check
- PostgreSQL: sum() → sum by (instance) for SSL and locks
- PGBouncer: 30s range window → 2m

Hardcoded labels (2 fixes):
- ClickHouse: added comment about job="clickhouse"
- Cassandra criteo: removed hardcoded service="cas"

* fix: address PR review comments

- Cassandra connection timeouts: wrap rate() in sum by() (rate() by() is invalid PromQL)
- Elasticsearch query latency: add division-by-zero guard
- Redis backup: "backuped" → "backed up"
2026-03-16 01:27:18 +01:00
Samuel Berthe
080a792777
data: adding python/ruby/golang (#502)
* data: adding python/ruby/golang

* fix: address review feedback on runtime alerts

- JVM non-heap: guard against unbounded metaspace (max_bytes = -1)
- JVM old gen GC: note regex only matches CMS/G1/Parallel collectors
- JVM/Python file descriptors: note process_* metrics are generic
- Go memory usage: fix description (sys_bytes is runtime memory, not host)
- Go goroutine spike: use deriv() instead of rate() on gauge
- Go GC CPU fraction: note deprecation since Go 1.20
- Go GC duration: clarify quantile="1" is max, not p99
- Python uncollectable: use increase() on counter instead of raw threshold
- Add threshold comments for workload-dependent defaults
2026-03-15 19:46:39 +01:00
Samuel Berthe
9ae17eca97
Fix broken and misleading alert rules (#503)
- Remove 7 meaningless `for: 0m` (ClickHouse, Caddy, Thanos)
- Fix Minio obsolete metrics (disk_storage_* -> minio_cluster_capacity_*)
- Rename duplicate Blackbox SSL cert rule to disambiguate warning/critical
- Simplify PostgreSQL config change query (giant regex -> negative matcher)
- Downgrade PostgreSQL SSL compression severity from critical to warning
- Fix misleading "Host unusual disk read rate" name and description
2026-03-15 18:08:06 +01:00
Marcin Morawski
eeebb90e6f
Add systemd service name to HostSystemdServiceCrashed summary (#499)
* Add systemd service name to HostSystemdServiceCrashed summary

* Modify systemd service crash rule description

Updated the description for the systemd service crash rule to include the service name.

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-03-01 20:15:17 +01:00
dxrayz
e60601fdcd
tune Targets Missing rules (#497)
* tune Targets Missing rules

* reworked query logic

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-02-21 19:40:10 +01:00
Per Lundberg
51aea96ba7
Adjust OOM kill detected rule (#495)
* Adjust OOM kill detected rule

When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-01-30 12:15:27 +01:00
Samuel Berthe
d400e3e64d
feat(k8s): cronjob rule (#491) 2026-01-07 13:57:42 +01:00
Simon Matic Langford
f810ff531d
Node exporter rules to preserve instance labels (#488)
* Jenkins node offline for clause (#2)

* Convert cpu alert expressions to without() rather than on()

* Remove on() expression from network throughput alerts as labels fully match

---------

Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>
2026-01-06 16:24:18 +01:00
Simon Matic Langford
79f2858037
Improve Jenkins node alerts to better handle servers with multiple nodes (#484) 2025-11-17 14:56:04 +01:00
Arve Knudsen
d58bc324ad
Add OpenTelemetry Collector monitoring alerts (#480)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-11-05 17:08:26 +01:00
andrii.k
9edef74e73
update kafka alerts (#478) 2025-10-13 14:24:37 +02:00
Riccardo Cannella
7832e01082
haproxy: align v1 and v2 HAProxy backend max active session > 80% alerts (#475)
* haproxy: align v1 and v2 max current session alerts

* fix: remove non-existing label

---------

Co-authored-by: Riccardo Cannella <riccardo.cannella@reevo.it>
2025-09-15 15:03:44 +02:00
Samuel Berthe
237e89babc
Update query for unused replication slot rule 2025-09-14 19:22:05 +02:00
Sajjad hassanzadeh
a2c31358d1
Add couchdb alerts (#472)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

* add : couchdb roles config in rules.yml

* add : couchdb alerts in rules directory

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-09-01 15:40:42 +02:00
Sajjad hassanzadeh
7bced89d2d
add : additional essential clickhouse alerts (#471)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-08-28 23:06:31 +02:00
Samuel Berthe
554850df41
Update rules.yml 2025-06-25 13:32:16 +02:00
Samuel Berthe
748524d580
Update rules.yml 2025-06-17 19:15:52 +02:00
Samuel Berthe
a5a3c2cd92
fix: HostHighCpuUsage (#466)
closes #457
2025-06-17 17:07:05 +02:00
Samuel Berthe
4b1b8242cb
Update rules.yml 2025-05-21 23:04:12 +02:00
andrii.k
e0e3cdda1d
update istio 4xx alert description (#463) 2025-05-08 19:49:18 +02:00
Carsten Thiel
79f45a5146
Adding rules for checking FluxCD (#458) 2025-05-03 22:52:26 +02:00
samber
9f5c641bdd Publish 2025-04-23 08:31:10 +00:00
samber
aca1bdf1fb Publish 2025-04-23 08:28:06 +00:00
Samuel Berthe
4666830538
Update rules.yml 2025-04-23 10:18:08 +02:00
Roger
b3d25fafcf
feature/kubestate exporter check if node is scheduling disabeld (#462)
* feature/kubestate-exporter-check-if-node-is-scheduling-disabeld

* commented added

* typo in expr

* move code to right file


---------

Co-authored-by: Roger Sikorski <roger.sikorski@zweiloewen.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-04-23 09:58:29 +02:00
Samuel Berthe
3b440fec7b
Remove buggy HostRequiresReboot rule
Closing #459
2025-04-17 17:26:00 +02:00
Samuel Berthe
8b730ef059
Update rules.yml 2025-03-27 17:23:19 +01:00
Motte
69c8208e3c
Added PostgresqlReplicationLagHigh rule (#456)
* Added PostgresqlReplicationLagHigh rule

* Update PostgreSQL replication lag alert settings

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-03-27 14:42:22 +01:00
Pigueiras
97a31f34e5
Fix queries in elasticsearch latency alerts (#455)
The `elasticsearch_indices_search_fetch_total`,
`elasticsearch_indices_search_fetch_time_seconds`,
`elasticsearch_indices_indexing_index_time_seconds_total`
and `elasticsearch_indices_indexing_index_total` metrics
are counters.

Dividing these metrics doesn't make sense because a spike in
numerator would cause the alert to persist, even if subsequent
fetch/index operations are normal. Adding `increase` changes the query
to check if operations took, on average, more than X over
a 1-minute interval, which was likely the original intent of
this alert.
2025-03-26 22:15:24 +01:00
Samuel Berthe
2127c4ce90
Update rules.yml 2025-02-20 16:17:39 +01:00
Roman
c189984d0f
fix node-exporter.yaml missing parentheses (#452) 2025-02-20 15:05:48 +01:00
Samuel Berthe
6838196343
fix: remove duplicated rule 2025-02-19 15:25:29 +01:00
dzaczek
11a78f0f06
Update google-cadvisor.yml (#382)
* Update google-cadvisor.yml

    Expression Explanation:
    The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires.
    
    Alert Details:
    - Alert Name: ContainerHighLowChangeCpuUsage
    - Trigger Condition: Absolute change in CPU usage exceeding 25%
    - Alert Severity: Informational (info)

* Add alert rule for high CPU usage change

* Change alert severity from warning to info

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:46:53 +01:00
Samuel Berthe
add097c489
data: revert 5f57f09 (see #398) 2025-02-16 23:36:44 +01:00
asdf1234
4a7b9b5c72
Update mysqld-exporter.yml (#442)
* Update mysqld-exporter.yml

add some rules

* Add new MySQL monitoring rules

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:29:00 +01:00
Samuel Berthe
fb857e8b39
data: fix rules 2025-02-16 23:16:36 +01:00
Samuel Berthe
ae12871fa9
Update rules.yml 2025-02-04 16:40:21 +01:00