Commit graph

472 commits

Author SHA1 Message Date
Samuel Berthe
b58b498bbb
feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules) (#523)
* feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules)

Add 18 Tempo rules and 49 Mimir rules based on official upstream mixins.
Covers ring health, compaction, TSDB, instance limits, ruler, alertmanager, and more.

* fix: address PR review comments on Tempo/Mimir rules

- Fix Tempo no tenant index builders: add on() for cross-label-set and
- Fix Tempo block list rising: output percentage instead of ratio
- Fix Mimir memory map areas: multiply by 100 to match % description
- Fix all instance limit rules: multiply by 100 to match % descriptions
- Fix distributor inflight requests: add % to description
2026-03-16 14:36:50 +01:00
Samuel Berthe
7ee16641ac
feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter) (#520)
* feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter)

* fix: grammar in WireGuard rule comment
2026-03-16 14:20:17 +01:00
Samuel Berthe
f974552ef1
Feat/jaeger alerting rules (#521)
* Add .worktrees/ to .gitignore

* feat: add Jaeger alerting rules (8 rules from official jaeger-mixin)

Rules cover agent HTTP errors, RPC errors, client/agent/collector span drops,
sampling update failures, throttling update failures, and query request failures.
All rules sourced from https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin

* fix: rename Jaeger agent RPC alert to Jaeger client RPC

The jaeger_client_jaeger_rpc_http_requests metric is client-side,
not agent-side. Rename alert to match the actual metric source.
2026-03-16 14:09:03 +01:00
Samuel Berthe
8b443be6d2
feat: add systemd_exporter alerting rules (7 rules) (#522)
* feat: add systemd_exporter alerting rules (7 rules)

Add new Systemd service under Basic resource monitoring with rules for:
- Unit failed/inactive state detection
- Service crash loop detection
- Task limit exhaustion
- Socket refused/high connections
- Timer missed trigger

* fix: narrow systemd unit inactive query to reduce noise

Add type="service" and name filter to the inactive unit alert
to avoid false positives from legitimately inactive units.
2026-03-16 14:07:14 +01:00
Samuel Berthe
30bbedbc79
feat: add Cloud providers alerting rules (33 rules across 4 exporters) (#519)
* feat: add Cloud providers alerting rules (33 rules across 4 exporters)

New "Cloud providers" category with rules for:
- AWS CloudWatch (13 rules): exporter health + EC2, RDS, SQS, ALB, Lambda
- Google Cloud / Stackdriver (5 rules): scrape health, API quotas, staleness
- DigitalOcean (10 rules): droplets, databases, k8s, load balancers, incidents
- Azure (5 rules): API errors, rate limits, collection performance

* fix: address PR review - move Cloud providers before Other, fix service name

- Move "Cloud providers" group before "Other" in rules.yml for consistent ordering
- Rename "Google Cloud / Stackdriver" to "Google Cloud Stackdriver" to avoid
  awkward /-/ in generated anchors and dist/rules/ paths
- Fix README anchor link to match the new service name
2026-03-16 14:06:59 +01:00
Samuel Berthe
fd3bfb02c0
Some fix (#516)
* fix: use proper zero-traffic guard in Envoy ratio alerts (#511)

Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.

* feat: add alerting rules for prometheus/memcached_exporter

* fix: add division-by-zero guards and improve quoting in memcached rules (#512)

- Add `and memcached_max_connections > 0` to connection limit queries
- Add `and memcached_limit_bytes > 0` to memory usage query
- Switch hit-rate query to single quotes for cleaner PromQL readability

* fix: fix SNMP interface down query and add job scoping (#507)

- Fix ifOperStatus query to use vector matching instead of label filter
  since ifAdminStatus is a separate metric in snmp_exporter output
- Add job=~"snmp.*" matcher to interface error rate, bandwidth usage,
  and interface down rules to prevent matching non-SNMP series

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 04:50:01 +01:00
Samuel Berthe
97aae5dabf
feat: add GitLab alerting rules (28 rules across 3 exporters) (#518)
Add new GitLab service under "Other" category with 3 exporters:
- Built-in exporter (18 rules): Puma, HTTP errors/latency, Sidekiq jobs,
  database connection pool, CI/CD pipelines, Ruby process health
- Workhorse (3 rules): HTTP error rate, latency, in-flight requests
- Gitaly (7 rules): gRPC errors, ResourceExhausted, RPC latency,
  CPU throttling, auth failures, circuit breaker

All metrics verified against gitlabhq/gitlabhq source code.
Several rules derived from GitLab Omnibus default alerting rules.
2026-03-16 04:48:52 +01:00
Samuel Berthe
e6cdcdb9e5 feat: add Apache Flink and Apache Spark alerting rules
Add 20 new alerting rules under the Runtimes category:
- Apache Flink (12 rules): job status, TaskManager registration, slot
  availability, restarts, checkpoints, backpressure, heap memory, GC,
  and record processing
- Apache Spark (8 rules): worker health, waiting apps, memory/cores
  exhaustion, executor GC, task failures, and disk spill
2026-03-16 04:46:00 +01:00
Samuel Berthe
88e2c19017
feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi) (#517)
* feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi)

* fix: correct Keycloak metrics-spi metric names and query grouping
2026-03-16 04:40:15 +01:00
Samuel Berthe
20651aa10d
feat: add OpenStack alerting rules (openstack-exporter) (#515)
* feat: add OpenStack alerting rules (openstack-exporter)

Add 20 alerting rules for openstack-exporter/openstack-exporter covering
Nova, Neutron, Cinder, Octavia, and Placement services.

* docs: add OpenStack to README services list

* fix: align OpenStack load balancer alert name with operating_status semantics

The operating_status label uses ONLINE/OFFLINE/DEGRADED/ERROR values,
not ACTIVE. Rename alert to "not online" and use the label in the
description for clarity.
2026-03-16 03:43:51 +01:00
Samuel Berthe
bf7b902881
feat: add process-exporter alerting rules (ncabatoff/process-exporter) (#514)
* feat: add process-exporter alerting rules (ncabatoff/process-exporter)

* docs: add Process to README services list

* fix: address PR review feedback for process-exporter rules

- Rename service from "Process" to "Process Exporter" for clarity
- Fix grammar: "file descriptors usage" → "file descriptor usage"
- Clarify CPU alert description as core-equivalent percentage
- Rename "high disk IO" to "high disk write IO" for accuracy
2026-03-16 03:31:18 +01:00
Samuel Berthe
2b239736cf
feat: add alerting rules for prometheus/memcached_exporter (#512) 2026-03-16 03:25:38 +01:00
Samuel Berthe
281142567c
fix: use proper zero-traffic guard in Envoy ratio alerts (#511) (#513)
Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.
2026-03-16 03:25:27 +01:00
Samuel Berthe
f97f692596
feat: add Proxmox VE alerting rules (prometheus-pve-exporter) (#509)
Add 9 alerting rules for Proxmox VE covering node/guest status,
CPU, memory, storage, backup coverage, replication, and cluster quorum.
2026-03-16 03:12:06 +01:00
Samuel Berthe
be7a2e4d5d
feat: add IPMI exporter alerting rules (#510)
* feat: add IPMI exporter alerting rules

Add 17 alerting rules for prometheus-community/ipmi_exporter covering
temperature, fan, voltage, current, power sensors, chassis status,
and system event log monitoring.

* docs: add IPMI to README service list

* Apply suggestions from code review

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 03:10:10 +01:00
Samuel Berthe
c064d2264e
feat: add Envoy proxy alerting rules using built-in metrics (#511)
Add 19 alerting rules for Envoy proxy under "Reverse proxies and load
balancers" using native metrics from /stats/prometheus endpoint.

Covers: server health, HTTP error rates (downstream/upstream), connection
saturation, cluster membership, health checks, outlier detection,
SSL/TLS certificate expiry, circuit breakers, and request timeouts.
2026-03-16 03:03:57 +01:00
Samuel Berthe
89e703d763
feat: add alerting rules for cloudflare/ebpf_exporter (#508)
* feat: add alerting rules for cloudflare/ebpf_exporter

* docs: add eBPF to README service list
2026-03-16 02:56:04 +01:00
Samuel Berthe
3db9281508
feat: add SNMP exporter alerting rules (#507)
Add 7 alerting rules for prometheus/snmp_exporter covering device
availability, interface status, error rates, bandwidth utilization,
and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.
2026-03-16 02:34:34 +01:00
Samuel Berthe
c37ef8f50c
fix: review and fix 74 database & broker alert rules (#504)
* fix: review and fix 74 database & broker alert rules

Comprehensive review of all database and broker alerts covering 16 services.

Typos & descriptions (8 fixes):
- PGBouncer: "a a server" → "a server"
- RabbitMQ: "instace" → "instance", "RabbmitMQ" → "RabbitMQ",
  "unactive" → "inactive"
- Cassandra: write failure said "Read failures", "bad hacker" →
  "authentication failures"
- Solr: replication errors said "failed updates"
- Meilisearch: "index is empty" said "instance is down"

Duplicates removed (5 fixes):
- PostgreSQL: 2 rules using wrong exporter metric (postgresql_errors_total)
- ClickHouse: "High Network Traffic" (thread counts) duplicated byte-rate rule
- NATS: 2 rules with low thresholds duplicated better rules

Broken queries (20 fixes):
- Patroni: patroni_master → patroni_primary (renamed in v3)
- MongoDB: rate() on gauge → direct ratio for connection queries
- MongoDB: removed WiredTiger-incompatible virtual memory rule
- Cassandra instaclustr: avg() on counter → rate()[5m]
- Cassandra criteo: increase() on JMX rate metric → direct threshold
- ClickHouse: increase() on gauge → direct threshold
- NATS: rate() on gauge → direct comparison, removed 4 config-value rules
- SQL Server: increase() on gauge → direct threshold
- Pulsar: moved comparison outside sum() (4 rules)
- Hadoop: inverted comparison < 0.2 → > 0.8, counters → increase()[1h]

Severity adjustments (7 fixes):
- Redis: backup threshold 24h → 48h, rejected connections → warning > 5
- RabbitMQ: no consumer for: 5m with comment
- Elasticsearch: unassigned shards added for: 2m
- CouchDB: process restarted critical → info
- Kafka: consumer group lag → warning, threshold 10000, better description
- Hadoop: HBase heap low critical → warning

Missing for duration (18 fixes):
- Added for: 1m to service-down alerts across MySQL, PostgreSQL,
  SQL Server, Patroni, Redis, MongoDB, RabbitMQ, Elasticsearch,
  Cassandra, Zookeeper with restart-tolerance comments

Division by zero guards (9 fixes):
- Added denominator > 0 guards to ratio queries in PostgreSQL,
  RabbitMQ, Elasticsearch, ClickHouse, CouchDB, NATS

Query design improvements (5 fixes):
- Cassandra: removed unnecessary sum() and redundant avg_over_time()
- ClickHouse: ZooKeeper avg() → per-instance check
- PostgreSQL: sum() → sum by (instance) for SSL and locks
- PGBouncer: 30s range window → 2m

Hardcoded labels (2 fixes):
- ClickHouse: added comment about job="clickhouse"
- Cassandra criteo: removed hardcoded service="cas"

* fix: address PR review comments

- Cassandra connection timeouts: wrap rate() in sum by() (rate() by() is invalid PromQL)
- Elasticsearch query latency: add division-by-zero guard
- Redis backup: "backuped" → "backed up"
2026-03-16 01:27:18 +01:00
Samuel Berthe
080a792777
data: adding python/ruby/golang (#502)
* data: adding python/ruby/golang

* fix: address review feedback on runtime alerts

- JVM non-heap: guard against unbounded metaspace (max_bytes = -1)
- JVM old gen GC: note regex only matches CMS/G1/Parallel collectors
- JVM/Python file descriptors: note process_* metrics are generic
- Go memory usage: fix description (sys_bytes is runtime memory, not host)
- Go goroutine spike: use deriv() instead of rate() on gauge
- Go GC CPU fraction: note deprecation since Go 1.20
- Go GC duration: clarify quantile="1" is max, not p99
- Python uncollectable: use increase() on counter instead of raw threshold
- Add threshold comments for workload-dependent defaults
2026-03-15 19:46:39 +01:00
Samuel Berthe
9ae17eca97
Fix broken and misleading alert rules (#503)
- Remove 7 meaningless `for: 0m` (ClickHouse, Caddy, Thanos)
- Fix Minio obsolete metrics (disk_storage_* -> minio_cluster_capacity_*)
- Rename duplicate Blackbox SSL cert rule to disambiguate warning/critical
- Simplify PostgreSQL config change query (giant regex -> negative matcher)
- Downgrade PostgreSQL SSL compression severity from critical to warning
- Fix misleading "Host unusual disk read rate" name and description
2026-03-15 18:08:06 +01:00
Marcin Morawski
eeebb90e6f
Add systemd service name to HostSystemdServiceCrashed summary (#499)
* Add systemd service name to HostSystemdServiceCrashed summary

* Modify systemd service crash rule description

Updated the description for the systemd service crash rule to include the service name.

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-03-01 20:15:17 +01:00
dxrayz
e60601fdcd
tune Targets Missing rules (#497)
* tune Targets Missing rules

* reworked query logic

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-02-21 19:40:10 +01:00
Per Lundberg
51aea96ba7
Adjust OOM kill detected rule (#495)
* Adjust OOM kill detected rule

When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-01-30 12:15:27 +01:00
Samuel Berthe
d400e3e64d
feat(k8s): cronjob rule (#491) 2026-01-07 13:57:42 +01:00
Simon Matic Langford
f810ff531d
Node exporter rules to preserve instance labels (#488)
* Jenkins node offline for clause (#2)

* Convert cpu alert expressions to without() rather than on()

* Remove on() expression from network throughput alerts as labels fully match

---------

Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>
2026-01-06 16:24:18 +01:00
Simon Matic Langford
79f2858037
Improve Jenkins node alerts to better handle servers with multiple nodes (#484) 2025-11-17 14:56:04 +01:00
Arve Knudsen
d58bc324ad
Add OpenTelemetry Collector monitoring alerts (#480)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-11-05 17:08:26 +01:00
andrii.k
9edef74e73
update kafka alerts (#478) 2025-10-13 14:24:37 +02:00
Riccardo Cannella
7832e01082
haproxy: align v1 and v2 HAProxy backend max active session > 80% alerts (#475)
* haproxy: align v1 and v2 max current session alerts

* fix: remove non-existing label

---------

Co-authored-by: Riccardo Cannella <riccardo.cannella@reevo.it>
2025-09-15 15:03:44 +02:00
Samuel Berthe
237e89babc
Update query for unused replication slot rule 2025-09-14 19:22:05 +02:00
Sajjad hassanzadeh
a2c31358d1
Add couchdb alerts (#472)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

* add : couchdb roles config in rules.yml

* add : couchdb alerts in rules directory

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-09-01 15:40:42 +02:00
Sajjad hassanzadeh
7bced89d2d
add : additional essential clickhouse alerts (#471)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-08-28 23:06:31 +02:00
Samuel Berthe
554850df41
Update rules.yml 2025-06-25 13:32:16 +02:00
Samuel Berthe
748524d580
Update rules.yml 2025-06-17 19:15:52 +02:00
Samuel Berthe
a5a3c2cd92
fix: HostHighCpuUsage (#466)
closes #457
2025-06-17 17:07:05 +02:00
Samuel Berthe
4b1b8242cb
Update rules.yml 2025-05-21 23:04:12 +02:00
andrii.k
e0e3cdda1d
update istio 4xx alert description (#463) 2025-05-08 19:49:18 +02:00
Carsten Thiel
79f45a5146
Adding rules for checking FluxCD (#458) 2025-05-03 22:52:26 +02:00
samber
9f5c641bdd Publish 2025-04-23 08:31:10 +00:00
samber
aca1bdf1fb Publish 2025-04-23 08:28:06 +00:00
Samuel Berthe
4666830538
Update rules.yml 2025-04-23 10:18:08 +02:00
Roger
b3d25fafcf
feature/kubestate exporter check if node is scheduling disabeld (#462)
* feature/kubestate-exporter-check-if-node-is-scheduling-disabeld

* commented added

* typo in expr

* move code to right file


---------

Co-authored-by: Roger Sikorski <roger.sikorski@zweiloewen.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-04-23 09:58:29 +02:00
Samuel Berthe
3b440fec7b
Remove buggy HostRequiresReboot rule
Closing #459
2025-04-17 17:26:00 +02:00
Samuel Berthe
8b730ef059
Update rules.yml 2025-03-27 17:23:19 +01:00
Motte
69c8208e3c
Added PostgresqlReplicationLagHigh rule (#456)
* Added PostgresqlReplicationLagHigh rule

* Update PostgreSQL replication lag alert settings

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-03-27 14:42:22 +01:00
Pigueiras
97a31f34e5
Fix queries in elasticsearch latency alerts (#455)
The `elasticsearch_indices_search_fetch_total`,
`elasticsearch_indices_search_fetch_time_seconds`,
`elasticsearch_indices_indexing_index_time_seconds_total`
and `elasticsearch_indices_indexing_index_total` metrics
are counters.

Dividing these metrics doesn't make sense because a spike in
numerator would cause the alert to persist, even if subsequent
fetch/index operations are normal. Adding `increase` changes the query
to check if operations took, on average, more than X over
a 1-minute interval, which was likely the original intent of
this alert.
2025-03-26 22:15:24 +01:00
Samuel Berthe
2127c4ce90
Update rules.yml 2025-02-20 16:17:39 +01:00
Roman
c189984d0f
fix node-exporter.yaml missing parentheses (#452) 2025-02-20 15:05:48 +01:00
Samuel Berthe
6838196343
fix: remove duplicated rule 2025-02-19 15:25:29 +01:00