Commit graph

373 commits

Author SHA1 Message Date
Samuel Berthe
bacb433089
Update rules.yml 2023-09-18 20:14:57 +02:00
Samuel Berthe
053cde27e4
Update rules.yml 2023-08-22 15:51:53 +02:00
Pavel Timofeev
6b1685261d
Rework kube-state-metrics alerts (#381)
* Rework kube-state-metrics alerts:
- provide meaningful labels in summary as 'instance' label hardly makes sense in most of them
- rename some alerts to tell more accurate what the problem is
- adjust description trying to follow some kind of the message schema found in other alerts

* move changes to _data/rules.yml

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2023-08-20 00:39:22 +02:00
Samuel Berthe
c3d78786e8
fix ci 2023-08-15 20:27:13 +02:00
Roman Pertl
ecd92399d5
feat: adding patroni alert rules (#369) 2023-08-15 19:54:15 +02:00
fzyzcjy
13e90b3aea
Update rules.yml (#371) 2023-08-15 19:42:46 +02:00
Ted Hahn
94b9f3cfbb
Fix for Postgres max connections. Postgres does not limit connections by database, but total over the server. Additionally, alert labels didn't match across the pair. Using a min by on the right side deals with the possibility additional labels are present on your exporter. (#376) 2023-08-15 19:39:41 +02:00
Samuel Berthe
15e3131547
Update rules.yml 2023-08-15 19:36:22 +02:00
Samuel Berthe
eb3220c8d7
Update rules.yml 2023-08-15 19:34:14 +02:00
Ivan Dudin
86e3e38a99
fix typo (#377) 2023-08-07 19:43:10 +02:00
Samuel Berthe
ff76ceccde
Update rules.yml 2023-07-30 22:24:31 +02:00
Moritz
fe5f78171a
update rules.yml (#374) 2023-07-30 22:21:20 +02:00
Samuel Berthe
8c811045e5
Update rules.yml 2023-07-29 18:20:58 +02:00
Samuel Berthe
32cf16a53d
Update rules.yml 2023-07-12 14:32:43 +02:00
Samuel Berthe
1bb6c602f7
Update rules.yml 2023-07-06 13:54:31 +02:00
Samuel Berthe
5d254811b4
Update rules.yml 2023-06-27 00:28:31 +02:00
Samuel Berthe
47b7748618
Update rules.yml 2023-06-22 18:40:33 +02:00
Samuel Berthe
3d0c5fcafd
Update rules.yml 2023-06-22 18:29:21 +02:00
Samuel Berthe
600a759344
Update rules.yml 2023-06-22 15:01:06 +02:00
Samuel Berthe
ee86c2d233
Update rules.yml 2023-06-22 15:00:40 +02:00
michaelact
7e8bc1a215
Add under-utilized container alerts (#322)
* chore: add container under-utilized allerts

* chore: resolve duplicated query and description
2023-05-21 22:58:04 +02:00
Paul-Élie Testud
c36014f03e
fix(nginx): fix nginx query for histogram_percentile (#351) 2023-04-28 16:06:12 +02:00
deimosOmegaChan
b98b2a2777
fix node-exporter nodename regex expression (#349)
nodename should not depends with the prefix "hostname"
2023-04-25 10:58:52 +02:00
Samuel Berthe
9efec14d26
chore: move from "https://awesome-prometheus-alerts.grep.to" to "https://samber.github.io/awesome-prometheus-alerts/" 2023-04-23 23:32:26 +02:00
Madhu Sudhan
8b9fc8864f
refactor: node-exporter queries to include hostname as label which will be helpful for alerting (#348) 2023-04-23 22:16:08 +02:00
Mikael Lindström
8357165cfb
Update MongoDB replication lag alert to use seconds (#344)
The mongodb_rs_members_optimeDate metric is in milliseconds, the
replication lag query has been updated to reflect this.
2023-04-07 01:42:25 +02:00
Mikael Lindström
2617aa5dab
Fix MongoDB replication headroom query (#342)
The query was changed to use `mongodb_oplog_stats_start` and
`mongodb_oplog_stats_end` in #291 but these metrics does not represent
the start and end of the oplog. The original head and tail metrics are
calculated from the oplog and are consistent with the output of
`db.getReplicationInfo()`.
2023-04-03 10:01:25 +02:00
Samuel Berthe
f9b43cf3bf
Update rules.yml 2023-03-24 14:36:52 +01:00
Kratik Jain
aa2988693b
Adding more rules for Thanos Monitoring (#340)
* Adding more rules for Thanos Components Monitoring

* lint

* lint

* lint

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2023-03-15 18:26:24 +01:00
Samuel Berthe
59891728e4
Solves #336 2023-02-26 02:33:50 +01:00
Samuel Berthe
60cb26681f
Update rules.yml 2023-02-23 15:19:36 +01:00
Samuel Berthe
bde83bc9ee
Update rules.yml 2023-02-17 01:14:19 +01:00
alexandrumarian-portal
1e44e348ee
Hashicorp Vault cluster health (#338)
* Hashicorp Vault cluster health
2023-02-17 01:13:41 +01:00
Samuel Berthe
65a0f969be
Update rules.yml 2023-02-14 14:02:35 +01:00
Yannick Markus
7aeccf2874
Add APC UPS & ZFS exporter (#331)
* add apcupsd_exporter rules

* add zfs_exporter rules
2023-02-12 20:01:26 +01:00
Jan Gosmann
df6d71bad5
Make ElasticsearchNoNewDocuments alert more robust (#334)
Use `elasticsearch_indices_indexing_index_total` instead of
`elasticsearch_indices_docs` because `elasticsearch_indices_docs` might
not update without an index refresh [1]. Refreshes happen every second
by default, *but* only if there have been search requests within the
last 30 seconds [2]. If there are no search requests for a sufficiently
long duration, the alert based on `elasticsearch_indices_docs` will fire
mistakenly.

Apart from that, `elasticsearch_indices_docs` has the gauge metric type
(while `elasticsearch_indices_indexing_index_total` is of the counter
type) and the `increase` function is not intended to be used with
gauges. Drops in the document count would be treated as a reset to 0,
thus showing an increase by all remaining documents.

[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html#index-stats-api-path-params
[2]: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
2023-01-30 17:06:40 +01:00
Samuel Berthe
5e84329360
Update rules.yml 2023-01-16 00:37:38 +01:00
Sören König
40478c50cc
Add under-utilized HPA alert (#330)
This alert should inform when HPAs are scaled more than half the time at their minReplicas, which is an indication of possible cost savings.
In addition, it is assumed that a minimum number of replicas should still be running for redundancy.
2023-01-16 00:36:59 +01:00
Samuel Berthe
160d0adcc2
Update rules.yml 2023-01-13 18:35:37 +01:00
Panos Rontogiannis
8f48bbfb25
Cert rules issues (#329)
* add comment for BlackboxSslCertificateExpired rule

* use last_over_time to make certificate rules less prone to flapping

* add lower bound thresholds on BlackboxSslCertificateWillExpireSoon rules to avoid overlap

* changed upper bound threshold for BlackboxSslCertificateWillExpireSoon to 20 days

* make BlackboxSslCertificateWillExpireSoon description clearer

* use days in certificate rules queries to improve notification values

Co-authored-by: Panos Rontogiannis <pronto@admin.grnet.gr>
2023-01-06 11:27:46 +01:00
Samuel Berthe
032eb896f5
rearrange 2022-12-06 10:37:09 +01:00
michaelact
447bb94c4d
Add under-utilized host and hardware alerts (#320)
* chore: add under-utilized alerts

* docs: add under-utilized alerts

* chore: add alert consideration times

* chore: delete generated alert rules file

* chore: not using for, instead in rule
2022-12-06 10:26:50 +01:00
Samuel Berthe
c00dd87733
fix kube rule 2022-12-04 23:12:35 +01:00
Samuel Berthe
a381fb5e22
Merge branch 'master' of github.com:samber/awesome-prometheus-alerts 2022-12-04 23:12:05 +01:00
Samuel Berthe
a0c32093cb
oops 2022-12-04 23:12:00 +01:00
MatthieuFin
a5f32a0fab
fix(rule): fixing KubernetesPodNotHealthy (#215 #253) (#263) 2022-12-04 23:08:24 +01:00
michaelact
4466a07962
fix: add space for labels KubernetesJobFailed alert rule (#321)
Co-authored-by: xb4dc0d3
2022-11-30 12:28:23 +01:00
Samuel Berthe
1b25cbe568
See #323 2022-11-30 12:26:36 +01:00
Samuel Berthe
5956d28148
data: fix haproxy rule #319 2022-11-15 09:47:34 +01:00
Samuel Berthe
f484d30d66
data: fix haproxy rule #319 2022-11-11 14:46:56 +01:00