Commit graph

482 commits

Author SHA1 Message Date
Samuel Berthe
5f57f09db0
fix(HostOutOfInodes): exclude msdosfs FS
See #398
2024-02-10 20:01:19 +01:00
Marek Červenka
4eb0e910e7
SMART monitoring (#402)
* SMART monitoring

* query regex fix

---------

Co-authored-by: Marek Cervenka <cervenka@ipex.cz>
2024-02-09 20:23:30 +01:00
Samuel Berthe
0727f2ef2e
Update rules.yml 2024-01-26 04:10:22 +01:00
josedev-union
c6ff5a59dc
feat: Add rules for Graph Node (#387)
Co-authored-by: josedev-union <josedev-union@users.noreply.github.com>
2024-01-20 20:33:26 +01:00
michaelact
7fa11bf6cc
Add simple and meaningful kube-state-metrics alert summary (#394)
* feat: add 'summary' to be overriden from rules.yml

* chore: add simple and meaningful summary for kubernetes alerts
2023-12-01 18:25:11 +01:00
Samuel Berthe
a4de5323ad
Update rules.yml 2023-11-26 02:18:16 +01:00
Samuel Berthe
76de11d71b
Update rules.yml 2023-10-24 15:03:51 +02:00
Pierre Riteau
cbf7046afa
Fix capitalisation of RabbitMQ (#392) 2023-10-13 17:09:10 +02:00
Vicky Wilson Jacob
7a8f883df6
feat: adding hadoop jmx exporter (#391)
* adding hadoop exporter

* added hadoop rules with jmx exporter

* added hadoop rules with jmx exporter

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2023-10-06 18:48:54 +02:00
Samuel Berthe
bacb433089
Update rules.yml 2023-09-18 20:14:57 +02:00
Samuel Berthe
053cde27e4
Update rules.yml 2023-08-22 15:51:53 +02:00
Pavel Timofeev
6b1685261d
Rework kube-state-metrics alerts (#381)
* Rework kube-state-metrics alerts:
- provide meaningful labels in summary as 'instance' label hardly makes sense in most of them
- rename some alerts to tell more accurate what the problem is
- adjust description trying to follow some kind of the message schema found in other alerts

* move changes to _data/rules.yml

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2023-08-20 00:39:22 +02:00
Samuel Berthe
c3d78786e8
fix ci 2023-08-15 20:27:13 +02:00
Roman Pertl
ecd92399d5
feat: adding patroni alert rules (#369) 2023-08-15 19:54:15 +02:00
fzyzcjy
13e90b3aea
Update rules.yml (#371) 2023-08-15 19:42:46 +02:00
Ted Hahn
94b9f3cfbb
Fix for Postgres max connections. Postgres does not limit connections by database, but total over the server. Additionally, alert labels didn't match across the pair. Using a min by on the right side deals with the possibility additional labels are present on your exporter. (#376) 2023-08-15 19:39:41 +02:00
Samuel Berthe
15e3131547
Update rules.yml 2023-08-15 19:36:22 +02:00
Samuel Berthe
eb3220c8d7
Update rules.yml 2023-08-15 19:34:14 +02:00
Ivan Dudin
86e3e38a99
fix typo (#377) 2023-08-07 19:43:10 +02:00
Samuel Berthe
ff76ceccde
Update rules.yml 2023-07-30 22:24:31 +02:00
Moritz
fe5f78171a
update rules.yml (#374) 2023-07-30 22:21:20 +02:00
Samuel Berthe
8c811045e5
Update rules.yml 2023-07-29 18:20:58 +02:00
Samuel Berthe
32cf16a53d
Update rules.yml 2023-07-12 14:32:43 +02:00
Samuel Berthe
1bb6c602f7
Update rules.yml 2023-07-06 13:54:31 +02:00
Samuel Berthe
5d254811b4
Update rules.yml 2023-06-27 00:28:31 +02:00
Samuel Berthe
47b7748618
Update rules.yml 2023-06-22 18:40:33 +02:00
Samuel Berthe
3d0c5fcafd
Update rules.yml 2023-06-22 18:29:21 +02:00
Samuel Berthe
600a759344
Update rules.yml 2023-06-22 15:01:06 +02:00
Samuel Berthe
ee86c2d233
Update rules.yml 2023-06-22 15:00:40 +02:00
michaelact
7e8bc1a215
Add under-utilized container alerts (#322)
* chore: add container under-utilized allerts

* chore: resolve duplicated query and description
2023-05-21 22:58:04 +02:00
Paul-Élie Testud
c36014f03e
fix(nginx): fix nginx query for histogram_percentile (#351) 2023-04-28 16:06:12 +02:00
deimosOmegaChan
b98b2a2777
fix node-exporter nodename regex expression (#349)
nodename should not depends with the prefix "hostname"
2023-04-25 10:58:52 +02:00
Samuel Berthe
9efec14d26
chore: move from "https://awesome-prometheus-alerts.grep.to" to "https://samber.github.io/awesome-prometheus-alerts/" 2023-04-23 23:32:26 +02:00
Madhu Sudhan
8b9fc8864f
refactor: node-exporter queries to include hostname as label which will be helpful for alerting (#348) 2023-04-23 22:16:08 +02:00
Mikael Lindström
8357165cfb
Update MongoDB replication lag alert to use seconds (#344)
The mongodb_rs_members_optimeDate metric is in milliseconds, the
replication lag query has been updated to reflect this.
2023-04-07 01:42:25 +02:00
Mikael Lindström
2617aa5dab
Fix MongoDB replication headroom query (#342)
The query was changed to use `mongodb_oplog_stats_start` and
`mongodb_oplog_stats_end` in #291 but these metrics does not represent
the start and end of the oplog. The original head and tail metrics are
calculated from the oplog and are consistent with the output of
`db.getReplicationInfo()`.
2023-04-03 10:01:25 +02:00
Samuel Berthe
f9b43cf3bf
Update rules.yml 2023-03-24 14:36:52 +01:00
Kratik Jain
aa2988693b
Adding more rules for Thanos Monitoring (#340)
* Adding more rules for Thanos Components Monitoring

* lint

* lint

* lint

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2023-03-15 18:26:24 +01:00
Samuel Berthe
59891728e4
Solves #336 2023-02-26 02:33:50 +01:00
Samuel Berthe
60cb26681f
Update rules.yml 2023-02-23 15:19:36 +01:00
Samuel Berthe
bde83bc9ee
Update rules.yml 2023-02-17 01:14:19 +01:00
alexandrumarian-portal
1e44e348ee
Hashicorp Vault cluster health (#338)
* Hashicorp Vault cluster health
2023-02-17 01:13:41 +01:00
Samuel Berthe
65a0f969be
Update rules.yml 2023-02-14 14:02:35 +01:00
Yannick Markus
7aeccf2874
Add APC UPS & ZFS exporter (#331)
* add apcupsd_exporter rules

* add zfs_exporter rules
2023-02-12 20:01:26 +01:00
Jan Gosmann
df6d71bad5
Make ElasticsearchNoNewDocuments alert more robust (#334)
Use `elasticsearch_indices_indexing_index_total` instead of
`elasticsearch_indices_docs` because `elasticsearch_indices_docs` might
not update without an index refresh [1]. Refreshes happen every second
by default, *but* only if there have been search requests within the
last 30 seconds [2]. If there are no search requests for a sufficiently
long duration, the alert based on `elasticsearch_indices_docs` will fire
mistakenly.

Apart from that, `elasticsearch_indices_docs` has the gauge metric type
(while `elasticsearch_indices_indexing_index_total` is of the counter
type) and the `increase` function is not intended to be used with
gauges. Drops in the document count would be treated as a reset to 0,
thus showing an increase by all remaining documents.

[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html#index-stats-api-path-params
[2]: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
2023-01-30 17:06:40 +01:00
Samuel Berthe
5e84329360
Update rules.yml 2023-01-16 00:37:38 +01:00
Sören König
40478c50cc
Add under-utilized HPA alert (#330)
This alert should inform when HPAs are scaled more than half the time at their minReplicas, which is an indication of possible cost savings.
In addition, it is assumed that a minimum number of replicas should still be running for redundancy.
2023-01-16 00:36:59 +01:00
Samuel Berthe
160d0adcc2
Update rules.yml 2023-01-13 18:35:37 +01:00
Panos Rontogiannis
8f48bbfb25
Cert rules issues (#329)
* add comment for BlackboxSslCertificateExpired rule

* use last_over_time to make certificate rules less prone to flapping

* add lower bound thresholds on BlackboxSslCertificateWillExpireSoon rules to avoid overlap

* changed upper bound threshold for BlackboxSslCertificateWillExpireSoon to 20 days

* make BlackboxSslCertificateWillExpireSoon description clearer

* use days in certificate rules queries to improve notification values

Co-authored-by: Panos Rontogiannis <pronto@admin.grnet.gr>
2023-01-06 11:27:46 +01:00
Samuel Berthe
032eb896f5
rearrange 2022-12-06 10:37:09 +01:00
michaelact
447bb94c4d
Add under-utilized host and hardware alerts (#320)
* chore: add under-utilized alerts

* docs: add under-utilized alerts

* chore: add alert consideration times

* chore: delete generated alert rules file

* chore: not using for, instead in rule
2022-12-06 10:26:50 +01:00
Samuel Berthe
c00dd87733
fix kube rule 2022-12-04 23:12:35 +01:00
Samuel Berthe
a381fb5e22
Merge branch 'master' of github.com:samber/awesome-prometheus-alerts 2022-12-04 23:12:05 +01:00
Samuel Berthe
a0c32093cb
oops 2022-12-04 23:12:00 +01:00
MatthieuFin
a5f32a0fab
fix(rule): fixing KubernetesPodNotHealthy (#215 #253) (#263) 2022-12-04 23:08:24 +01:00
michaelact
4466a07962
fix: add space for labels KubernetesJobFailed alert rule (#321)
Co-authored-by: xb4dc0d3
2022-11-30 12:28:23 +01:00
Samuel Berthe
1b25cbe568
See #323 2022-11-30 12:26:36 +01:00
Samuel Berthe
5956d28148
data: fix haproxy rule #319 2022-11-15 09:47:34 +01:00
Samuel Berthe
f484d30d66
data: fix haproxy rule #319 2022-11-11 14:46:56 +01:00
Valery Voronov
1e46eacbe7
fix: added NodeNetworkUnavailable alerts, rm unused OOD alert (#318) 2022-10-31 15:47:27 +01:00
Nicolai Antiferov
9419e3fe7e
fix: Update elasticsearch_exporter repository (#317)
Was migrated some time ago to https://github.com/prometheus-community/elasticsearch_exporter

Fix #316
2022-10-31 10:10:46 +01:00
Samuel Berthe
cdf4551ab7
Merge branch 'master' of github.com:samber/awesome-prometheus-alerts 2022-10-24 16:55:36 +02:00
Samuel Berthe
19c4223ce7
fix(minio): update queries 2022-10-24 16:54:38 +02:00
meoww-bot
98d8a7b53b
fix: check inodes space for all mountpoints (#315) 2022-10-24 13:47:12 +02:00
Samuel Berthe
6ba9eb104c
feat: adding cloudflare exporter (#310) 2022-10-03 16:57:24 +02:00
Yonah Dissen
55b049eb28
add argocd rules (#309)
* add argocd rules

* fix(argocd): move contrib into _data/rules.yml instead of dist/...

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2022-10-02 18:05:30 +02:00
meoww-bot
86d5efe399
Fix broken link (#305) 2022-08-30 09:51:07 +02:00
Samuel Berthe
40c0ff32f0
oops 2022-08-28 17:47:17 +02:00
Brett
0887515f98
Added query for node warmup before reporing it's down (#304)
Co-authored-by: Brett Yoakum <yoakum@adobe.com>
2022-08-28 16:31:15 +02:00
Samuel Berthe
b49a49c920
Update rules.yml 2022-08-16 20:17:46 +02:00
Samuel Berthe
250a71e95a
fix(postgresql): remove broken rules 2022-08-01 22:43:30 +02:00
Samuel Berthe
d8f7ecd5b4
adding zpool alert 2022-07-24 01:56:17 +02:00
Samuel Berthe
34081e4f43
fix #292 2022-07-24 00:42:21 +02:00
Samuel Berthe
9bbb65ffe1
Update rules.yml 2022-07-24 00:20:54 +02:00
Samuel Berthe
67266bbca6
Merge branch 'master' of github.com:samber/awesome-prometheus-alerts 2022-07-06 12:50:02 +02:00
Samuel Berthe
95af2b4d95
fix: fix quantile query 2022-07-06 12:49:49 +02:00
Pooya
03fdabbfc5
Changed metric names to match new metric names. (#291)
* Changed alert names to match new alert names.

* Added MongodbReplicaMemberHealth to check health of replica members health which is added in new metrics

Co-authored-by: Pooya Dowlatabadi <pooya.dowlatabadi@arvancloud.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2022-06-27 17:29:07 +02:00
Samuel Berthe
4201302285
Update rules.yml 2022-06-23 22:29:21 +02:00
Samuel Berthe
9bbe04799f
feat: build and publish into dist/rules 2022-06-15 01:42:18 +02:00
Samuel Berthe
cbc20228e2
fix #226 2022-06-14 22:12:00 +02:00
Samuel Berthe
10b810fd6e
fix #276 2022-06-14 22:03:34 +02:00
Samuel Berthe
23876f8c6b
fix #155 2022-06-14 22:00:00 +02:00
Samuel Berthe
075d85b2d6
fix #236 2022-06-14 21:36:59 +02:00
Samuel Berthe
72a0d78638
Merge branch 'master' of github.com:samber/awesome-prometheus-alerts 2022-06-14 21:29:22 +02:00
Samuel Berthe
e82b504e00
fixes #251 2022-06-14 21:29:12 +02:00
Bastien Dronneau
bac2e99aee
docs(postgresql): add auto prefix in order to match query (#288) 2022-06-14 21:19:00 +02:00
Samuel Berthe
b36ea8f45d
data: adding rule "Host CPU high iowait" 2022-06-09 02:04:45 +02:00
Samuel Berthe
0207783284
data: change postgresql exporter name 2022-06-09 01:00:35 +02:00
Samuel Berthe
3faf1332a1
fix: PrometheusAllTargetsMissing (#283) 2022-06-09 00:43:40 +02:00
Samuel Berthe
2323541f2d
data: adding mgob query 2022-06-09 00:23:17 +02:00
Samuel Berthe
08d482f314
doc: add postgrseql bloat 2022-06-07 02:32:09 +02:00
Samuel Berthe
4662cd2812
doc: improve pulsar doc 2022-06-07 01:29:31 +02:00
Marcel Körtgen
074e3e6d04
Add pulsar rules (#286)
* Add pulsar rules

* Add webrick, cf.:
- https://github.com/github/pages-gem/issues/752

* Update gems (minitest / ruby 3 issue)

* Add repo info (workaround), cf.
- https://github.com/jekyll/jekyll/issues/4705
2022-06-07 01:21:10 +02:00
Samuel Berthe
4d26719d41
removed some rules 2022-04-19 00:07:31 +02:00
Samuel Berthe
97810b6537
change severity of PostgresqlConfigurationChanged to info 2022-04-18 23:37:17 +02:00
Samuel Berthe
8941f71c6c
chore(ci): adding test with promtool (#281) 2022-04-18 23:30:32 +02:00
Samuel Berthe
4d161ee0a5
feat(jenkins): add "jenkins outdated plugin" rule 2022-04-18 20:29:36 +02:00
Samuel
718b002826
fix / increases requires interval (#279) 2022-04-18 20:17:33 +02:00
Koen Dierckx
21ddd2f752
Added Alert manager job alert (#272)
Co-authored-by: DIERCKXK <koen.dierckx@vito.be>
2022-01-23 19:36:36 +01:00
armondressler
038e46743d
fixed erroneous usage of rate() function on gauges (#270)
Co-authored-by: Dressler Armon, B2B-PAP-HLT-DO-ENG <armon.dressler@swisscom.com>
2022-01-16 03:24:36 +01:00
MikeN. Paxos
78a7e61050
added jenkins alert rules for jenkins metrics plugin (#268)
* added jenkins alert rules

* Update rules.yml

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2021-12-27 12:48:07 +01:00
Samuel Berthe
fd0f2805c0
Renaming kube_hpa_* to kube_horizontalpodautoscaler_*
Fixes #266
2021-12-07 23:16:40 +01:00
Samuel Berthe
f3ef333a3e
doc: remove comment 2021-12-07 23:14:23 +01:00
Damon Vincent
a12f5263c2
Filter parent groups from Docker container alerts (#267) 2021-12-07 23:05:27 +01:00
Samuel Berthe
2ca7f5bebe
doc: more explicit description for HostClock* rules (#265) 2021-12-02 20:54:23 +01:00
Lauri Võsandi
2be7e9684c
Add HostNetworkBondDegraded (#260) 2021-12-02 20:48:11 +01:00
John Losito
1a7690a1a3
Add rule for reboot-required (#262) 2021-12-02 20:45:33 +01:00
leemos
ee3c878b06
apiserver_request_count has been turned off (#264) 2021-12-02 20:23:56 +01:00
Torsten Bøgh Köster
4e1a26cab3
Add Solr rules (#258) 2021-11-21 18:53:32 +01:00
chaoxiaodi
7a40d7f423
Update rules.yml (#252) 2021-10-27 14:00:35 +02:00
Samuel Berthe
7857afab6e
fix(rule): fixing KubernetesOutOfCapacity (#227) 2021-10-17 17:14:44 +02:00
Samuel Berthe
a978cfb5a1
doc: more explicit "ContainerAbsent" and "ContainerKilled" rules (#247) 2021-10-10 20:13:30 +02:00
Samuel Berthe
4e0d99dd09
fix(mongodb): fix query for MongodbReplicationHeadroom rule (#250) 2021-10-10 20:12:06 +02:00
kayge
2d9e4ae431
Cleaning up typos in rules.yml (#248) 2021-10-09 01:05:15 +02:00
Andre Martins
36ca52e598
adding alerts to promtail and loki (#241)
Co-authored-by: apmbktf <andre.pasqualinoto-martins@itau-unibanco.com.br>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2021-10-03 22:12:59 +02:00
Christian Zenker
7c67f02ee6
The metric is called 'thanos_compact_halted' (#243)
According to https://github.com/thanos-io/thanos/blob/main/examples/alerts/alerts.md
2021-09-21 15:48:27 +02:00
Ondřej Nový
abfae043bb
Fix typo in description (#242) 2021-09-19 23:37:51 +02:00
Samuel Berthe
a225087b06
prevent +inf max value 2021-08-19 23:45:58 +02:00
gökhan
b9222993ac
istio pilot duplicate cluster (#220) 2021-08-19 21:23:27 +02:00
Guillaume
6fcdcff5e3
Fix bad syntax for Haproxy rules (#232)
Aggregations require parentheses around expressions
2021-08-19 21:22:39 +02:00
flf2ko
a02a7e6eab
Fix "percentil" typo in Etcd rules (#234) 2021-08-19 21:21:16 +02:00
Krasimir Nedelchev
3d69117f33
Add missing parenthesis to rule (#237) 2021-08-19 21:20:11 +02:00
Igor Churmeev
3612c9cc3e
Add alerts for Hashicorp Vault (#238)
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2021-08-19 21:19:43 +02:00
Andre Martins
b47359c2fd
added alerts to cortex (#240)
* added alerts to cortex

* Update rules.yml

Co-authored-by: apmbktf <andre.pasqualinoto-martins@itau-unibanco.com.br>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2021-08-19 20:31:46 +02:00
Benjamin Dos Santos
7304d40539
fix(HostNetworkInterfaceSaturated): display network interface name in description (#239)
`$labels.interface` doesn't exist, use `$labels.device` instead
2021-08-16 16:29:12 +02:00
Gjed
c2b8178304
Loki alerts (#218)
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2021-07-04 23:59:46 +02:00
asteny
243c0280cf
Haproxy 2 embedded exporter fixes (#229) 2021-07-04 23:28:58 +02:00
Alexandros Orfanos
6a6f89bad5
Add php-fpm max-children alert (#224) 2021-06-29 12:37:54 +02:00
Alberto del Barrio
0ba7c2a47e
fix typo (#228) 2021-06-27 14:16:42 +02:00
Samuel Berthe
092d0f8bda
fix(haproxy): some query were using wrong metrics name 2021-05-01 22:48:54 +02:00
Samuel Berthe
e044fddd11
feat(data): reverse traefik exporters order 2021-05-01 22:12:12 +02:00
Samuel Berthe
af30d0f06c
fix(node_exporter): better alert description for EDAC + network errors (#204) 2021-05-01 22:01:10 +02:00
Samuel Berthe
135d4b7c1a
fix(data): for KubernetesPodNotHealthy, insert a step of subquery execution time 2021-05-01 20:30:35 +02:00
Samuel Berthe
54b1e674b2
fix(data): fix pg replicatino lag query 2021-05-01 19:58:42 +02:00
Moritz
335ba16032
Fix upper/lowercase of systemd (#207)
The're quite clear on how they want it to be written:

https://unix.stackexchange.com/review/suggested-edits/372414
2021-05-01 19:44:06 +02:00
Samuel Berthe
1c44cd7818
feat(data): adding k8s rule - detect container killed by oomkiller 2021-05-01 19:33:03 +02:00
Gustavo Kazuo Motizuki
18672ff0f9
Improve KubernetesOutOfCapacity alert (#211) 2021-05-01 19:27:46 +02:00
Samuel
97c48862d7
fix(haproxy) (#213) 2021-05-01 18:58:46 +02:00
Samuel Berthe
b9f09e7f93
fix(freeswitch): move to the networking section 2021-05-01 18:53:04 +02:00
Samuel
823b8edd7e
feat(freeswitch) (#214) 2021-05-01 18:45:36 +02:00
Samuel Berthe
c3ba0cf199
data: rename coredns metric 2021-03-28 00:34:56 +01:00
Samuel Berthe
b9db2c0c68
data: fix some elasticsearch rules 2021-02-26 11:31:06 +01:00
Samuel Berthe
1d0fd50033
fix(data): quickfix on cassandra, because i merged a little bit to fast pr-196 2021-02-22 14:44:45 +01:00
ko-christ
24ae7de2f5
Fill in PrometheusRules for instaclustr/cassandra-exporter (#196) 2021-02-22 14:38:40 +01:00
Samuel Berthe
19f9316868
Merge pull request #197 from yasharne/new_minio 2021-02-22 14:09:38 +01:00
Yashar Nesabian
f166c909f1 removed old minio rules 2021-02-22 11:35:49 +03:30
Samuel Berthe
ca31cc8a71
fix(data): fix node exporter temperature alarm 2021-02-21 19:05:10 +01:00
Yashar Nesabian
def11767bf added minio disk space usage missed condition 2021-02-16 21:33:33 +03:30
Yashar Nesabian
4c5ff1fc68 Added new minio alert rules 2021-02-16 21:06:14 +03:30
Samuel Berthe
6d7ef1cdbb
Merge branch 'master' of github.com:samber/awesome-prometheus-alerts 2021-02-07 20:47:59 +01:00