Commit graph

485 commits

Author SHA1 Message Date
andrii.k
e0e3cdda1d
update istio 4xx alert description (#463) 2025-05-08 19:49:18 +02:00
Carsten Thiel
79f45a5146
Adding rules for checking FluxCD (#458) 2025-05-03 22:52:26 +02:00
samber
9f5c641bdd Publish 2025-04-23 08:31:10 +00:00
samber
aca1bdf1fb Publish 2025-04-23 08:28:06 +00:00
Samuel Berthe
4666830538
Update rules.yml 2025-04-23 10:18:08 +02:00
Roger
b3d25fafcf
feature/kubestate exporter check if node is scheduling disabeld (#462)
* feature/kubestate-exporter-check-if-node-is-scheduling-disabeld

* commented added

* typo in expr

* move code to right file


---------

Co-authored-by: Roger Sikorski <roger.sikorski@zweiloewen.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-04-23 09:58:29 +02:00
Samuel Berthe
3b440fec7b
Remove buggy HostRequiresReboot rule
Closing #459
2025-04-17 17:26:00 +02:00
Samuel Berthe
8b730ef059
Update rules.yml 2025-03-27 17:23:19 +01:00
Motte
69c8208e3c
Added PostgresqlReplicationLagHigh rule (#456)
* Added PostgresqlReplicationLagHigh rule

* Update PostgreSQL replication lag alert settings

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-03-27 14:42:22 +01:00
Pigueiras
97a31f34e5
Fix queries in elasticsearch latency alerts (#455)
The `elasticsearch_indices_search_fetch_total`,
`elasticsearch_indices_search_fetch_time_seconds`,
`elasticsearch_indices_indexing_index_time_seconds_total`
and `elasticsearch_indices_indexing_index_total` metrics
are counters.

Dividing these metrics doesn't make sense because a spike in
numerator would cause the alert to persist, even if subsequent
fetch/index operations are normal. Adding `increase` changes the query
to check if operations took, on average, more than X over
a 1-minute interval, which was likely the original intent of
this alert.
2025-03-26 22:15:24 +01:00
Samuel Berthe
2127c4ce90
Update rules.yml 2025-02-20 16:17:39 +01:00
Roman
c189984d0f
fix node-exporter.yaml missing parentheses (#452) 2025-02-20 15:05:48 +01:00
Samuel Berthe
6838196343
fix: remove duplicated rule 2025-02-19 15:25:29 +01:00
dzaczek
11a78f0f06
Update google-cadvisor.yml (#382)
* Update google-cadvisor.yml

    Expression Explanation:
    The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires.
    
    Alert Details:
    - Alert Name: ContainerHighLowChangeCpuUsage
    - Trigger Condition: Absolute change in CPU usage exceeding 25%
    - Alert Severity: Informational (info)

* Add alert rule for high CPU usage change

* Change alert severity from warning to info

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:46:53 +01:00
Samuel Berthe
add097c489
data: revert 5f57f09 (see #398) 2025-02-16 23:36:44 +01:00
asdf1234
4a7b9b5c72
Update mysqld-exporter.yml (#442)
* Update mysqld-exporter.yml

add some rules

* Add new MySQL monitoring rules

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:29:00 +01:00
Samuel Berthe
fb857e8b39
data: fix rules 2025-02-16 23:16:36 +01:00
Samuel Berthe
ae12871fa9
Update rules.yml 2025-02-04 16:40:21 +01:00
Felix Bühler
10d00c66da
Add caddy.yml (#450) 2025-02-04 14:23:14 +01:00
Samuel Berthe
fc6b3faadc
Fix from #405 2025-01-28 06:04:10 +01:00
Samuel Berthe
d916b7c6ab
Fix from #405 2025-01-28 05:58:49 +01:00
sunlei
cbb2337438
fix: formatting errors (#448)
* fix: formatting errors

* Update query format in rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-01-12 22:01:21 +01:00
Samuel Berthe
bdcc67c04e
Update rules.yml 2024-12-16 12:17:59 +01:00
Samuel Berthe
84a3b517a8
Update rules.yml 2024-12-16 12:17:26 +01:00
Samuel Berthe
a8d7c43b30
Update rules.yml 2024-12-08 21:28:07 +01:00
Samuel Berthe
8c3d06502f
Update rules.yml 2024-12-05 23:37:28 +01:00
Martin Anderson
353ef1ed95
RabbitMQ: add too many ready messages alert (#441)
* RabbitMQ: add too many ready messages alert

* Add RabbitMQ ready messages alert rule

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2024-11-30 10:29:57 +01:00
sipr-invivo
bb75cb2c68
feat: Add rule to Kubernetes Job not starting (#436) 2024-10-28 22:24:10 +01:00
Samuel Berthe
f08e8df514
oops 2024-08-28 08:48:42 +02:00
Samuel Berthe
995ab4d27a
Update rules.yml 2024-08-28 08:46:41 +02:00
Somrat Dutta
8c0bdc2b24
feat: Add NATS and JetStream Prometheus alert rules (#430)
* feat: Add comprehensive NATS and JetStream Prometheus alert rules

- Added multiple Prometheus alert rules for monitoring NATS server and JetStream metrics.
- Included alerts for:
  - High connection count
  - High pending bytes
  - High subscriptions count
  - High routes count
  - High memory usage
  - Slow consumers
  - NATS server downtime
  - High CPU usage
  - High number of active connections
  - High JetStream store and memory usage
  - Subscription limits exceeded
  - High pending messages
  - Authentication timeouts
  - Errors in NATS (JetStream API errors)
  - JetStream consumers limit exceeded
  - Exceeding max payload size
  - Leaf node connection issues
  - Ping operations limit exceeded
  - Write deadline exceeded
- Ensured consistency between `exporter.yml` and `rules.yml` files.
- Improved overall NATS and JetStream monitoring to prevent performance degradation and ensure system reliability.

This commit enhances the visibility of NATS and JetStream operations by providing key metrics to alert on potential issues and optimize system performance.

* Update rules.yml

* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated

* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated

* fix indentation

---------

Co-authored-by: somratdutta <duttasomratand.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
Co-authored-by: somrat.dutta <somrat.dutta@nutanix.com>
2024-08-20 20:37:03 +02:00
Samuel Berthe
d1715de751
fix PostgresqlInvalidIndex rule 2024-08-20 18:31:18 +02:00
Samuel Berthe
47e74f65e0
Update rules.yml 2024-07-02 09:33:51 +02:00
Greg
9557d4b50e
feat(meilisearch): add basic set of rules (#425)
* feat(meilisearch): add basic meilisearch rules

* fix(query): use == instead of =

* fix(data): set correct name and use ==

* chore(meilisearch): remove index filter
2024-07-02 09:33:08 +02:00
Samuel Berthe
ca4fb01c6d
Update rules.yml 2024-06-14 20:15:44 +02:00
Samuel Berthe
1e4ea0b3e7
Update rules.yml 2024-06-06 22:53:29 +02:00
Samuel Berthe
9b0ac7d230
Update rules.yml 2024-05-23 14:44:45 +02:00
Samuel Berthe
1adecd9ee7
Update rules.yml 2024-05-15 08:08:58 +02:00
Enes Yalınkaya
9877561b6c
fix elasticsearch rate rules (#418)
* fix elasticsearch rate rules

* fix

* fix

* fix
2024-05-15 08:07:55 +02:00
R.Sicart
262e451625
kube hpa lint and improvement (#417)
* fix: hpa alerts are using  label but the queries remove it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* fix: hpa alert is using  label but the query removes it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* feat: hpa scale max should not alert when min and max are the same

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

---------

Signed-off-by: R.Sicart <roger.sicart@gmail.com>
2024-05-14 20:43:00 +02:00
R.Sicart
8460f9008e
fix: some kube api alert lint (#416)
* fix: apiserver regexp matchers are automatically fully anchored

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* fix: apiserver errors alert is using  label but the query removes it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* fix: apiserver latency alert is using  label but the query removes it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

---------

Signed-off-by: R.Sicart <roger.sicart@gmail.com>
2024-05-14 20:34:43 +02:00
Florian Schlichting
396083a2a1
Fix HaproxyBackendMaxActiveSession: look at current / limit (#413)
haproxy_backend_max_sessions is the maximum number of sessions ever encountered during the lifetime of the HAProxy process. That is, it will never go down until HAProxy is restarted, so the alert continues to fire even though the situation has cleared!

This doesn't make sense. Look at the currently active sessions instead.
2024-05-13 12:09:04 +02:00
Vijay Dharap
870bbd47d2
Fixed HPA rule to use more correct condition (#408)
* Fixed HPA rule to use more correct condition

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2024-05-13 11:10:55 +02:00
Ali
2547288c13
Added Clickhouse (#412)
* Added Clickhouse

* Update rules.yml

Added reasonable time periods for each query to avoid false positives and in some cased give the system a short window to try to solve the issue.
Also changed the severity level of authentication alerts from critical to info which seems more appropriate

* Modified time period for alerts embedded-exporter.yml

I made a few adjustments in time periods.
See if they seem reasonable or not

* Replication alerts time periods were adjusted

IMHO, replication alerts must be sent right away.
2024-05-13 10:32:18 +02:00
enesyalinkaya
59e6a9165d
add new alerts for elasticsearch rules.yml (#411)
This commit adds new Prometheus alert definitions to monitor indexing and query metrics in Elasticsearch clusters. These alerts are essential for detecting performance issues related to indexing and querying activities.
2024-05-06 01:32:00 +02:00
Sergey Shtoltz
aad1c4cd95
RedisOutOfConfiguredMaxmemory: checking if memory limit is set (#410) 2024-05-02 20:48:46 +02:00
Samuel Berthe
267c3e8e70
Update rules.yml 2024-04-29 22:35:43 +02:00
Rastislav Pôbiš
2494ccdf31
Added prepared statements mysqld-exporter alert (#407) 2024-03-26 16:56:15 +01:00
Samuel Berthe
1eb5c5834f
Update rules.yml 2024-03-11 23:28:06 +01:00
Samuel Berthe
90706282ad
Update rules.yml 2024-03-11 22:55:05 +01:00
Samuel Berthe
05c4716c2b
Fix KubernetesAPIserverlatency 2024-02-12 09:41:03 +01:00
Samuel Berthe
f5f6b338a3
fix: high/low cpu alert 2024-02-10 23:24:10 +01:00
Samuel Berthe
937cd35df7
💄 2024-02-10 20:04:17 +01:00
Samuel Berthe
5f57f09db0
fix(HostOutOfInodes): exclude msdosfs FS
See #398
2024-02-10 20:01:19 +01:00
Marek Červenka
4eb0e910e7
SMART monitoring (#402)
* SMART monitoring

* query regex fix

---------

Co-authored-by: Marek Cervenka <cervenka@ipex.cz>
2024-02-09 20:23:30 +01:00
Samuel Berthe
0727f2ef2e
Update rules.yml 2024-01-26 04:10:22 +01:00
josedev-union
c6ff5a59dc
feat: Add rules for Graph Node (#387)
Co-authored-by: josedev-union <josedev-union@users.noreply.github.com>
2024-01-20 20:33:26 +01:00
michaelact
7fa11bf6cc
Add simple and meaningful kube-state-metrics alert summary (#394)
* feat: add 'summary' to be overriden from rules.yml

* chore: add simple and meaningful summary for kubernetes alerts
2023-12-01 18:25:11 +01:00
Samuel Berthe
a4de5323ad
Update rules.yml 2023-11-26 02:18:16 +01:00
Samuel Berthe
76de11d71b
Update rules.yml 2023-10-24 15:03:51 +02:00
Pierre Riteau
cbf7046afa
Fix capitalisation of RabbitMQ (#392) 2023-10-13 17:09:10 +02:00
Vicky Wilson Jacob
7a8f883df6
feat: adding hadoop jmx exporter (#391)
* adding hadoop exporter

* added hadoop rules with jmx exporter

* added hadoop rules with jmx exporter

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2023-10-06 18:48:54 +02:00
Samuel Berthe
bacb433089
Update rules.yml 2023-09-18 20:14:57 +02:00
Samuel Berthe
053cde27e4
Update rules.yml 2023-08-22 15:51:53 +02:00
Pavel Timofeev
6b1685261d
Rework kube-state-metrics alerts (#381)
* Rework kube-state-metrics alerts:
- provide meaningful labels in summary as 'instance' label hardly makes sense in most of them
- rename some alerts to tell more accurate what the problem is
- adjust description trying to follow some kind of the message schema found in other alerts

* move changes to _data/rules.yml

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2023-08-20 00:39:22 +02:00
Samuel Berthe
c3d78786e8
fix ci 2023-08-15 20:27:13 +02:00
Roman Pertl
ecd92399d5
feat: adding patroni alert rules (#369) 2023-08-15 19:54:15 +02:00
fzyzcjy
13e90b3aea
Update rules.yml (#371) 2023-08-15 19:42:46 +02:00
Ted Hahn
94b9f3cfbb
Fix for Postgres max connections. Postgres does not limit connections by database, but total over the server. Additionally, alert labels didn't match across the pair. Using a min by on the right side deals with the possibility additional labels are present on your exporter. (#376) 2023-08-15 19:39:41 +02:00
Samuel Berthe
15e3131547
Update rules.yml 2023-08-15 19:36:22 +02:00
Samuel Berthe
eb3220c8d7
Update rules.yml 2023-08-15 19:34:14 +02:00
Ivan Dudin
86e3e38a99
fix typo (#377) 2023-08-07 19:43:10 +02:00
Samuel Berthe
ff76ceccde
Update rules.yml 2023-07-30 22:24:31 +02:00
Moritz
fe5f78171a
update rules.yml (#374) 2023-07-30 22:21:20 +02:00
Samuel Berthe
8c811045e5
Update rules.yml 2023-07-29 18:20:58 +02:00
Samuel Berthe
32cf16a53d
Update rules.yml 2023-07-12 14:32:43 +02:00
Samuel Berthe
1bb6c602f7
Update rules.yml 2023-07-06 13:54:31 +02:00
Samuel Berthe
5d254811b4
Update rules.yml 2023-06-27 00:28:31 +02:00
Samuel Berthe
47b7748618
Update rules.yml 2023-06-22 18:40:33 +02:00
Samuel Berthe
3d0c5fcafd
Update rules.yml 2023-06-22 18:29:21 +02:00
Samuel Berthe
600a759344
Update rules.yml 2023-06-22 15:01:06 +02:00
Samuel Berthe
ee86c2d233
Update rules.yml 2023-06-22 15:00:40 +02:00
michaelact
7e8bc1a215
Add under-utilized container alerts (#322)
* chore: add container under-utilized allerts

* chore: resolve duplicated query and description
2023-05-21 22:58:04 +02:00
Paul-Élie Testud
c36014f03e
fix(nginx): fix nginx query for histogram_percentile (#351) 2023-04-28 16:06:12 +02:00
deimosOmegaChan
b98b2a2777
fix node-exporter nodename regex expression (#349)
nodename should not depends with the prefix "hostname"
2023-04-25 10:58:52 +02:00
Samuel Berthe
9efec14d26
chore: move from "https://awesome-prometheus-alerts.grep.to" to "https://samber.github.io/awesome-prometheus-alerts/" 2023-04-23 23:32:26 +02:00
Madhu Sudhan
8b9fc8864f
refactor: node-exporter queries to include hostname as label which will be helpful for alerting (#348) 2023-04-23 22:16:08 +02:00
Mikael Lindström
8357165cfb
Update MongoDB replication lag alert to use seconds (#344)
The mongodb_rs_members_optimeDate metric is in milliseconds, the
replication lag query has been updated to reflect this.
2023-04-07 01:42:25 +02:00
Mikael Lindström
2617aa5dab
Fix MongoDB replication headroom query (#342)
The query was changed to use `mongodb_oplog_stats_start` and
`mongodb_oplog_stats_end` in #291 but these metrics does not represent
the start and end of the oplog. The original head and tail metrics are
calculated from the oplog and are consistent with the output of
`db.getReplicationInfo()`.
2023-04-03 10:01:25 +02:00
Samuel Berthe
f9b43cf3bf
Update rules.yml 2023-03-24 14:36:52 +01:00
Kratik Jain
aa2988693b
Adding more rules for Thanos Monitoring (#340)
* Adding more rules for Thanos Components Monitoring

* lint

* lint

* lint

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2023-03-15 18:26:24 +01:00
Samuel Berthe
59891728e4
Solves #336 2023-02-26 02:33:50 +01:00
Samuel Berthe
60cb26681f
Update rules.yml 2023-02-23 15:19:36 +01:00
Samuel Berthe
bde83bc9ee
Update rules.yml 2023-02-17 01:14:19 +01:00
alexandrumarian-portal
1e44e348ee
Hashicorp Vault cluster health (#338)
* Hashicorp Vault cluster health
2023-02-17 01:13:41 +01:00
Samuel Berthe
65a0f969be
Update rules.yml 2023-02-14 14:02:35 +01:00
Yannick Markus
7aeccf2874
Add APC UPS & ZFS exporter (#331)
* add apcupsd_exporter rules

* add zfs_exporter rules
2023-02-12 20:01:26 +01:00
Jan Gosmann
df6d71bad5
Make ElasticsearchNoNewDocuments alert more robust (#334)
Use `elasticsearch_indices_indexing_index_total` instead of
`elasticsearch_indices_docs` because `elasticsearch_indices_docs` might
not update without an index refresh [1]. Refreshes happen every second
by default, *but* only if there have been search requests within the
last 30 seconds [2]. If there are no search requests for a sufficiently
long duration, the alert based on `elasticsearch_indices_docs` will fire
mistakenly.

Apart from that, `elasticsearch_indices_docs` has the gauge metric type
(while `elasticsearch_indices_indexing_index_total` is of the counter
type) and the `increase` function is not intended to be used with
gauges. Drops in the document count would be treated as a reset to 0,
thus showing an increase by all remaining documents.

[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html#index-stats-api-path-params
[2]: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
2023-01-30 17:06:40 +01:00
Samuel Berthe
5e84329360
Update rules.yml 2023-01-16 00:37:38 +01:00
Sören König
40478c50cc
Add under-utilized HPA alert (#330)
This alert should inform when HPAs are scaled more than half the time at their minReplicas, which is an indication of possible cost savings.
In addition, it is assumed that a minimum number of replicas should still be running for redundancy.
2023-01-16 00:36:59 +01:00