Commit graph

453 commits

Author SHA1 Message Date
Samuel Berthe
080a792777
data: adding python/ruby/golang (#502)
* data: adding python/ruby/golang

* fix: address review feedback on runtime alerts

- JVM non-heap: guard against unbounded metaspace (max_bytes = -1)
- JVM old gen GC: note regex only matches CMS/G1/Parallel collectors
- JVM/Python file descriptors: note process_* metrics are generic
- Go memory usage: fix description (sys_bytes is runtime memory, not host)
- Go goroutine spike: use deriv() instead of rate() on gauge
- Go GC CPU fraction: note deprecation since Go 1.20
- Go GC duration: clarify quantile="1" is max, not p99
- Python uncollectable: use increase() on counter instead of raw threshold
- Add threshold comments for workload-dependent defaults
2026-03-15 19:46:39 +01:00
Samuel Berthe
9ae17eca97
Fix broken and misleading alert rules (#503)
- Remove 7 meaningless `for: 0m` (ClickHouse, Caddy, Thanos)
- Fix Minio obsolete metrics (disk_storage_* -> minio_cluster_capacity_*)
- Rename duplicate Blackbox SSL cert rule to disambiguate warning/critical
- Simplify PostgreSQL config change query (giant regex -> negative matcher)
- Downgrade PostgreSQL SSL compression severity from critical to warning
- Fix misleading "Host unusual disk read rate" name and description
2026-03-15 18:08:06 +01:00
Marcin Morawski
eeebb90e6f
Add systemd service name to HostSystemdServiceCrashed summary (#499)
* Add systemd service name to HostSystemdServiceCrashed summary

* Modify systemd service crash rule description

Updated the description for the systemd service crash rule to include the service name.

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-03-01 20:15:17 +01:00
dxrayz
e60601fdcd
tune Targets Missing rules (#497)
* tune Targets Missing rules

* reworked query logic

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-02-21 19:40:10 +01:00
Per Lundberg
51aea96ba7
Adjust OOM kill detected rule (#495)
* Adjust OOM kill detected rule

When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-01-30 12:15:27 +01:00
Samuel Berthe
d400e3e64d
feat(k8s): cronjob rule (#491) 2026-01-07 13:57:42 +01:00
Simon Matic Langford
f810ff531d
Node exporter rules to preserve instance labels (#488)
* Jenkins node offline for clause (#2)

* Convert cpu alert expressions to without() rather than on()

* Remove on() expression from network throughput alerts as labels fully match

---------

Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>
2026-01-06 16:24:18 +01:00
Simon Matic Langford
79f2858037
Improve Jenkins node alerts to better handle servers with multiple nodes (#484) 2025-11-17 14:56:04 +01:00
Arve Knudsen
d58bc324ad
Add OpenTelemetry Collector monitoring alerts (#480)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-11-05 17:08:26 +01:00
andrii.k
9edef74e73
update kafka alerts (#478) 2025-10-13 14:24:37 +02:00
Riccardo Cannella
7832e01082
haproxy: align v1 and v2 HAProxy backend max active session > 80% alerts (#475)
* haproxy: align v1 and v2 max current session alerts

* fix: remove non-existing label

---------

Co-authored-by: Riccardo Cannella <riccardo.cannella@reevo.it>
2025-09-15 15:03:44 +02:00
Samuel Berthe
237e89babc
Update query for unused replication slot rule 2025-09-14 19:22:05 +02:00
Sajjad hassanzadeh
a2c31358d1
Add couchdb alerts (#472)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

* add : couchdb roles config in rules.yml

* add : couchdb alerts in rules directory

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-09-01 15:40:42 +02:00
Sajjad hassanzadeh
7bced89d2d
add : additional essential clickhouse alerts (#471)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-08-28 23:06:31 +02:00
Samuel Berthe
554850df41
Update rules.yml 2025-06-25 13:32:16 +02:00
Samuel Berthe
748524d580
Update rules.yml 2025-06-17 19:15:52 +02:00
Samuel Berthe
a5a3c2cd92
fix: HostHighCpuUsage (#466)
closes #457
2025-06-17 17:07:05 +02:00
Samuel Berthe
4b1b8242cb
Update rules.yml 2025-05-21 23:04:12 +02:00
andrii.k
e0e3cdda1d
update istio 4xx alert description (#463) 2025-05-08 19:49:18 +02:00
Carsten Thiel
79f45a5146
Adding rules for checking FluxCD (#458) 2025-05-03 22:52:26 +02:00
samber
9f5c641bdd Publish 2025-04-23 08:31:10 +00:00
samber
aca1bdf1fb Publish 2025-04-23 08:28:06 +00:00
Samuel Berthe
4666830538
Update rules.yml 2025-04-23 10:18:08 +02:00
Roger
b3d25fafcf
feature/kubestate exporter check if node is scheduling disabeld (#462)
* feature/kubestate-exporter-check-if-node-is-scheduling-disabeld

* commented added

* typo in expr

* move code to right file


---------

Co-authored-by: Roger Sikorski <roger.sikorski@zweiloewen.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-04-23 09:58:29 +02:00
Samuel Berthe
3b440fec7b
Remove buggy HostRequiresReboot rule
Closing #459
2025-04-17 17:26:00 +02:00
Samuel Berthe
8b730ef059
Update rules.yml 2025-03-27 17:23:19 +01:00
Motte
69c8208e3c
Added PostgresqlReplicationLagHigh rule (#456)
* Added PostgresqlReplicationLagHigh rule

* Update PostgreSQL replication lag alert settings

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-03-27 14:42:22 +01:00
Pigueiras
97a31f34e5
Fix queries in elasticsearch latency alerts (#455)
The `elasticsearch_indices_search_fetch_total`,
`elasticsearch_indices_search_fetch_time_seconds`,
`elasticsearch_indices_indexing_index_time_seconds_total`
and `elasticsearch_indices_indexing_index_total` metrics
are counters.

Dividing these metrics doesn't make sense because a spike in
numerator would cause the alert to persist, even if subsequent
fetch/index operations are normal. Adding `increase` changes the query
to check if operations took, on average, more than X over
a 1-minute interval, which was likely the original intent of
this alert.
2025-03-26 22:15:24 +01:00
Samuel Berthe
2127c4ce90
Update rules.yml 2025-02-20 16:17:39 +01:00
Roman
c189984d0f
fix node-exporter.yaml missing parentheses (#452) 2025-02-20 15:05:48 +01:00
Samuel Berthe
6838196343
fix: remove duplicated rule 2025-02-19 15:25:29 +01:00
dzaczek
11a78f0f06
Update google-cadvisor.yml (#382)
* Update google-cadvisor.yml

    Expression Explanation:
    The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires.
    
    Alert Details:
    - Alert Name: ContainerHighLowChangeCpuUsage
    - Trigger Condition: Absolute change in CPU usage exceeding 25%
    - Alert Severity: Informational (info)

* Add alert rule for high CPU usage change

* Change alert severity from warning to info

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:46:53 +01:00
Samuel Berthe
add097c489
data: revert 5f57f09 (see #398) 2025-02-16 23:36:44 +01:00
asdf1234
4a7b9b5c72
Update mysqld-exporter.yml (#442)
* Update mysqld-exporter.yml

add some rules

* Add new MySQL monitoring rules

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:29:00 +01:00
Samuel Berthe
fb857e8b39
data: fix rules 2025-02-16 23:16:36 +01:00
Samuel Berthe
ae12871fa9
Update rules.yml 2025-02-04 16:40:21 +01:00
Felix Bühler
10d00c66da
Add caddy.yml (#450) 2025-02-04 14:23:14 +01:00
Samuel Berthe
fc6b3faadc
Fix from #405 2025-01-28 06:04:10 +01:00
Samuel Berthe
d916b7c6ab
Fix from #405 2025-01-28 05:58:49 +01:00
sunlei
cbb2337438
fix: formatting errors (#448)
* fix: formatting errors

* Update query format in rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-01-12 22:01:21 +01:00
Samuel Berthe
bdcc67c04e
Update rules.yml 2024-12-16 12:17:59 +01:00
Samuel Berthe
84a3b517a8
Update rules.yml 2024-12-16 12:17:26 +01:00
Samuel Berthe
a8d7c43b30
Update rules.yml 2024-12-08 21:28:07 +01:00
Samuel Berthe
8c3d06502f
Update rules.yml 2024-12-05 23:37:28 +01:00
Martin Anderson
353ef1ed95
RabbitMQ: add too many ready messages alert (#441)
* RabbitMQ: add too many ready messages alert

* Add RabbitMQ ready messages alert rule

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2024-11-30 10:29:57 +01:00
sipr-invivo
bb75cb2c68
feat: Add rule to Kubernetes Job not starting (#436) 2024-10-28 22:24:10 +01:00
Samuel Berthe
f08e8df514
oops 2024-08-28 08:48:42 +02:00
Samuel Berthe
995ab4d27a
Update rules.yml 2024-08-28 08:46:41 +02:00
Somrat Dutta
8c0bdc2b24
feat: Add NATS and JetStream Prometheus alert rules (#430)
* feat: Add comprehensive NATS and JetStream Prometheus alert rules

- Added multiple Prometheus alert rules for monitoring NATS server and JetStream metrics.
- Included alerts for:
  - High connection count
  - High pending bytes
  - High subscriptions count
  - High routes count
  - High memory usage
  - Slow consumers
  - NATS server downtime
  - High CPU usage
  - High number of active connections
  - High JetStream store and memory usage
  - Subscription limits exceeded
  - High pending messages
  - Authentication timeouts
  - Errors in NATS (JetStream API errors)
  - JetStream consumers limit exceeded
  - Exceeding max payload size
  - Leaf node connection issues
  - Ping operations limit exceeded
  - Write deadline exceeded
- Ensured consistency between `exporter.yml` and `rules.yml` files.
- Improved overall NATS and JetStream monitoring to prevent performance degradation and ensure system reliability.

This commit enhances the visibility of NATS and JetStream operations by providing key metrics to alert on potential issues and optimize system performance.

* Update rules.yml

* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated

* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated

* fix indentation

---------

Co-authored-by: somratdutta <duttasomratand.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
Co-authored-by: somrat.dutta <somrat.dutta@nutanix.com>
2024-08-20 20:37:03 +02:00
Samuel Berthe
d1715de751
fix PostgresqlInvalidIndex rule 2024-08-20 18:31:18 +02:00