Commit graph

137 commits

Author SHA1 Message Date
samber
67cf6892a4 Publish 2025-05-20 06:21:45 +00:00
jaqxues
98d6e7db05
Alloy: Fix incorrect alert (#464) 2025-05-20 08:21:14 +02:00
samber
becbe1be3b Publish 2025-05-08 17:49:45 +00:00
samber
fd9da90c1d Publish 2025-05-03 20:52:49 +00:00
samber
9f5c641bdd Publish 2025-04-23 08:31:10 +00:00
samber
aca1bdf1fb Publish 2025-04-23 08:28:06 +00:00
samber
198035eaf4 Publish 2025-04-23 07:58:55 +00:00
samber
a75d5124c5 Publish 2025-04-17 15:26:25 +00:00
samber
32a4bfb19b Publish 2025-03-27 16:23:49 +00:00
samber
93f9daecee Publish 2025-03-27 13:42:51 +00:00
Motte
69c8208e3c
Added PostgresqlReplicationLagHigh rule (#456)
* Added PostgresqlReplicationLagHigh rule

* Update PostgreSQL replication lag alert settings

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-03-27 14:42:22 +01:00
Pigueiras
97a31f34e5
Fix queries in elasticsearch latency alerts (#455)
The `elasticsearch_indices_search_fetch_total`,
`elasticsearch_indices_search_fetch_time_seconds`,
`elasticsearch_indices_indexing_index_time_seconds_total`
and `elasticsearch_indices_indexing_index_total` metrics
are counters.

Dividing these metrics doesn't make sense because a spike in
numerator would cause the alert to persist, even if subsequent
fetch/index operations are normal. Adding `increase` changes the query
to check if operations took, on average, more than X over
a 1-minute interval, which was likely the original intent of
this alert.
2025-03-26 22:15:24 +01:00
samber
7bcae33011 Publish 2025-02-20 15:18:08 +00:00
samber
9963b750ac Publish 2025-02-20 14:06:17 +00:00
samber
807db03d0d Publish 2025-02-19 14:25:58 +00:00
samber
4e49e77d29 Publish 2025-02-16 22:47:17 +00:00
dzaczek
11a78f0f06
Update google-cadvisor.yml (#382)
* Update google-cadvisor.yml

    Expression Explanation:
    The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires.
    
    Alert Details:
    - Alert Name: ContainerHighLowChangeCpuUsage
    - Trigger Condition: Absolute change in CPU usage exceeding 25%
    - Alert Severity: Informational (info)

* Add alert rule for high CPU usage change

* Change alert severity from warning to info

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:46:53 +01:00
samber
7889a9a29b Publish 2025-02-16 22:37:09 +00:00
samber
12b8acb1b8 Publish 2025-02-16 22:29:24 +00:00
asdf1234
4a7b9b5c72
Update mysqld-exporter.yml (#442)
* Update mysqld-exporter.yml

add some rules

* Add new MySQL monitoring rules

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:29:00 +01:00
samber
20f9a36615 Publish 2025-02-16 22:17:02 +00:00
Felix Bühler
10d00c66da
Add caddy.yml (#450) 2025-02-04 14:23:14 +01:00
guruevi
70ac7d9cae
Various updates and quality of life changes (#405)
* smartctl_exporter publishes both drive_trip and current drive temperatures. Since most of the alerts are going to be permanent, it does not make sense to wait for the alert to be on for a certain time. Temperature sensors likewise vary, using the last sample is not sufficient to alert on potential issues.

* Add an option to run GitHub Action manually

* Add an option to force running the action for testing purposes

* Set variables correctly

* Set variables correctly

* Publish

* Clean up some more metrics

* Publish

* Minor bug fixes

* Publish

* Removed queries that throw errors when systems are upgraded. Also fixed and simplified a few Postgres queries.

* Publish

* Refined some more queries

* Publish

* PostgreSQL now has optimized autovacuum behavior

* Publish

* PostgreSQL now has optimized autovacuum behavior

* Publish

* Publish

* Query fails if instance names are not unique across jobs. This fixes it.

* Publish

* Ruby is out of date

---------

Co-authored-by: samber <samber@users.noreply.github.com>
2025-01-28 06:06:47 +01:00
sunlei
cbb2337438
fix: formatting errors (#448)
* fix: formatting errors

* Update query format in rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-01-12 22:01:21 +01:00
samber
53a369769d Publish 2024-12-16 11:19:08 +00:00
samber
4533f23b79 Publish 2024-12-16 11:17:17 +00:00
dxrayz
52d4a8c744
Update postgres-exporter.yml (#444)
Modify PostgresqlConfigurationChanged for prevent error: "many-to-many matching not allowed: matching labels must be unique on one side" in cases when you have multiple instances of postgres
2024-12-16 12:16:05 +01:00
samber
c5203e94d0 Publish 2024-12-08 20:29:15 +00:00
samber
4e38ae2087 Publish 2024-12-05 22:38:38 +00:00
samber
8a220b1b8a Publish 2024-11-30 09:31:05 +00:00
Martin Anderson
353ef1ed95
RabbitMQ: add too many ready messages alert (#441)
* RabbitMQ: add too many ready messages alert

* Add RabbitMQ ready messages alert rule

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2024-11-30 10:29:57 +01:00
samber
14949721ba Publish 2024-10-28 21:25:18 +00:00
samber
4aa45dee05 Publish 2024-08-28 06:49:52 +00:00
Somrat Dutta
8c0bdc2b24
feat: Add NATS and JetStream Prometheus alert rules (#430)
* feat: Add comprehensive NATS and JetStream Prometheus alert rules

- Added multiple Prometheus alert rules for monitoring NATS server and JetStream metrics.
- Included alerts for:
  - High connection count
  - High pending bytes
  - High subscriptions count
  - High routes count
  - High memory usage
  - Slow consumers
  - NATS server downtime
  - High CPU usage
  - High number of active connections
  - High JetStream store and memory usage
  - Subscription limits exceeded
  - High pending messages
  - Authentication timeouts
  - Errors in NATS (JetStream API errors)
  - JetStream consumers limit exceeded
  - Exceeding max payload size
  - Leaf node connection issues
  - Ping operations limit exceeded
  - Write deadline exceeded
- Ensured consistency between `exporter.yml` and `rules.yml` files.
- Improved overall NATS and JetStream monitoring to prevent performance degradation and ensure system reliability.

This commit enhances the visibility of NATS and JetStream operations by providing key metrics to alert on potential issues and optimize system performance.

* Update rules.yml

* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated

* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated

* fix indentation

---------

Co-authored-by: somratdutta <duttasomratand.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
Co-authored-by: somrat.dutta <somrat.dutta@nutanix.com>
2024-08-20 20:37:03 +02:00
samber
02687db33d Publish 2024-08-20 16:32:36 +00:00
samber
58ade95b8b Publish 2024-07-02 07:34:59 +00:00
Greg
9557d4b50e
feat(meilisearch): add basic set of rules (#425)
* feat(meilisearch): add basic meilisearch rules

* fix(query): use == instead of =

* fix(data): set correct name and use ==

* chore(meilisearch): remove index filter
2024-07-02 09:33:08 +02:00
samber
60c235975c Publish 2024-06-14 18:16:53 +00:00
samber
1ee046b739 Publish 2024-06-06 20:54:49 +00:00
samber
8759c50440 Publish 2024-05-23 12:45:56 +00:00
samber
7dd767c4b4 Publish 2024-05-15 06:10:06 +00:00
samber
826be5877f Publish 2024-05-14 18:44:11 +00:00
R.Sicart
262e451625
kube hpa lint and improvement (#417)
* fix: hpa alerts are using  label but the queries remove it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* fix: hpa alert is using  label but the query removes it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* feat: hpa scale max should not alert when min and max are the same

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

---------

Signed-off-by: R.Sicart <roger.sicart@gmail.com>
2024-05-14 20:43:00 +02:00
samber
81079a2a7e Publish 2024-05-14 18:35:54 +00:00
samber
04886da968 Publish 2024-05-13 10:10:12 +00:00
samber
613401a960 Publish 2024-05-13 09:12:01 +00:00
samber
84b0569c97 Publish 2024-05-13 08:33:30 +00:00
Ali
2547288c13
Added Clickhouse (#412)
* Added Clickhouse

* Update rules.yml

Added reasonable time periods for each query to avoid false positives and in some cased give the system a short window to try to solve the issue.
Also changed the severity level of authentication alerts from critical to info which seems more appropriate

* Modified time period for alerts embedded-exporter.yml

I made a few adjustments in time periods.
See if they seem reasonable or not

* Replication alerts time periods were adjusted

IMHO, replication alerts must be sent right away.
2024-05-13 10:32:18 +02:00
samber
515fca9c10 Publish 2024-05-05 23:33:11 +00:00
samber
5c0963558a Publish 2024-05-02 18:49:56 +00:00