Commit graph

165 commits

Author SHA1 Message Date
samber
375a36f82a Publish 2026-03-16 01:56:27 +00:00
samber
9f6d4fd2a2 Publish 2026-03-16 01:34:59 +00:00
samber
e2af1325c6 Publish 2026-03-16 00:27:40 +00:00
samber
879436f440 Publish 2026-03-15 18:47:04 +00:00
samber
1e4e3d17bc Publish 2026-03-15 17:08:32 +00:00
samber
80400e9a56 Publish 2026-03-01 19:15:42 +00:00
samber
0693ed168e Publish 2026-02-21 18:40:35 +00:00
dxrayz
e60601fdcd
tune Targets Missing rules (#497)
* tune Targets Missing rules

* reworked query logic

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-02-21 19:40:10 +01:00
samber
dd10c7ef05 Publish 2026-01-30 11:15:52 +00:00
samber
81081bdda5 Publish 2026-01-07 12:58:08 +00:00
Simon Matic Langford
f810ff531d
Node exporter rules to preserve instance labels (#488)
* Jenkins node offline for clause (#2)

* Convert cpu alert expressions to without() rather than on()

* Remove on() expression from network throughput alerts as labels fully match

---------

Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>
2026-01-06 16:24:18 +01:00
Simon Matic Langford
79f2858037
Improve Jenkins node alerts to better handle servers with multiple nodes (#484) 2025-11-17 14:56:04 +01:00
samber
cea78d7fd6 Publish 2025-11-05 16:08:52 +00:00
samber
4acbddb21a Publish 2025-11-05 16:04:56 +00:00
Samuel Berthe
6e2db98590
feat: add support for exporter-level comments (#481) 2025-11-05 17:04:30 +01:00
samber
ae8cfb0366 Publish 2025-10-13 12:24:59 +00:00
samber
606d6fc592 Publish 2025-09-15 13:04:10 +00:00
samber
b158ebb551 Publish 2025-09-14 17:22:29 +00:00
samber
5fbce5f513 Publish 2025-09-01 13:41:06 +00:00
Sajjad hassanzadeh
a2c31358d1
Add couchdb alerts (#472)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

* add : couchdb roles config in rules.yml

* add : couchdb alerts in rules directory

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-09-01 15:40:42 +02:00
samber
3abc7144aa Publish 2025-08-28 21:07:00 +00:00
Sajjad hassanzadeh
7bced89d2d
add : additional essential clickhouse alerts (#471)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-08-28 23:06:31 +02:00
samber
b04b11ce1d Publish 2025-06-25 11:32:39 +00:00
samber
ea63d8001a Publish 2025-06-17 17:16:15 +00:00
samber
6ebe6d8a8e Publish 2025-06-17 15:07:35 +00:00
samber
a3325114ea Publish 2025-05-21 21:04:42 +00:00
samber
67cf6892a4 Publish 2025-05-20 06:21:45 +00:00
jaqxues
98d6e7db05
Alloy: Fix incorrect alert (#464) 2025-05-20 08:21:14 +02:00
samber
becbe1be3b Publish 2025-05-08 17:49:45 +00:00
samber
fd9da90c1d Publish 2025-05-03 20:52:49 +00:00
samber
9f5c641bdd Publish 2025-04-23 08:31:10 +00:00
samber
aca1bdf1fb Publish 2025-04-23 08:28:06 +00:00
samber
198035eaf4 Publish 2025-04-23 07:58:55 +00:00
samber
a75d5124c5 Publish 2025-04-17 15:26:25 +00:00
samber
32a4bfb19b Publish 2025-03-27 16:23:49 +00:00
samber
93f9daecee Publish 2025-03-27 13:42:51 +00:00
Motte
69c8208e3c
Added PostgresqlReplicationLagHigh rule (#456)
* Added PostgresqlReplicationLagHigh rule

* Update PostgreSQL replication lag alert settings

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-03-27 14:42:22 +01:00
Pigueiras
97a31f34e5
Fix queries in elasticsearch latency alerts (#455)
The `elasticsearch_indices_search_fetch_total`,
`elasticsearch_indices_search_fetch_time_seconds`,
`elasticsearch_indices_indexing_index_time_seconds_total`
and `elasticsearch_indices_indexing_index_total` metrics
are counters.

Dividing these metrics doesn't make sense because a spike in
numerator would cause the alert to persist, even if subsequent
fetch/index operations are normal. Adding `increase` changes the query
to check if operations took, on average, more than X over
a 1-minute interval, which was likely the original intent of
this alert.
2025-03-26 22:15:24 +01:00
samber
7bcae33011 Publish 2025-02-20 15:18:08 +00:00
samber
9963b750ac Publish 2025-02-20 14:06:17 +00:00
samber
807db03d0d Publish 2025-02-19 14:25:58 +00:00
samber
4e49e77d29 Publish 2025-02-16 22:47:17 +00:00
dzaczek
11a78f0f06
Update google-cadvisor.yml (#382)
* Update google-cadvisor.yml

    Expression Explanation:
    The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires.
    
    Alert Details:
    - Alert Name: ContainerHighLowChangeCpuUsage
    - Trigger Condition: Absolute change in CPU usage exceeding 25%
    - Alert Severity: Informational (info)

* Add alert rule for high CPU usage change

* Change alert severity from warning to info

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:46:53 +01:00
samber
7889a9a29b Publish 2025-02-16 22:37:09 +00:00
samber
12b8acb1b8 Publish 2025-02-16 22:29:24 +00:00
asdf1234
4a7b9b5c72
Update mysqld-exporter.yml (#442)
* Update mysqld-exporter.yml

add some rules

* Add new MySQL monitoring rules

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:29:00 +01:00
samber
20f9a36615 Publish 2025-02-16 22:17:02 +00:00
Felix Bühler
10d00c66da
Add caddy.yml (#450) 2025-02-04 14:23:14 +01:00
guruevi
70ac7d9cae
Various updates and quality of life changes (#405)
* smartctl_exporter publishes both drive_trip and current drive temperatures. Since most of the alerts are going to be permanent, it does not make sense to wait for the alert to be on for a certain time. Temperature sensors likewise vary, using the last sample is not sufficient to alert on potential issues.

* Add an option to run GitHub Action manually

* Add an option to force running the action for testing purposes

* Set variables correctly

* Set variables correctly

* Publish

* Clean up some more metrics

* Publish

* Minor bug fixes

* Publish

* Removed queries that throw errors when systems are upgraded. Also fixed and simplified a few Postgres queries.

* Publish

* Refined some more queries

* Publish

* PostgreSQL now has optimized autovacuum behavior

* Publish

* PostgreSQL now has optimized autovacuum behavior

* Publish

* Publish

* Query fails if instance names are not unique across jobs. This fixes it.

* Publish

* Ruby is out of date

---------

Co-authored-by: samber <samber@users.noreply.github.com>
2025-01-28 06:06:47 +01:00
sunlei
cbb2337438
fix: formatting errors (#448)
* fix: formatting errors

* Update query format in rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-01-12 22:01:21 +01:00