Commit graph

300 commits

Author SHA1 Message Date
Samuel Berthe
d3ecfaaad3
Merge pull request #139 from xkfen/istio 2020-12-30 18:47:28 +01:00
Samuel Berthe
2f6d4921c6
fix initial istio alerts 2020-12-30 18:46:50 +01:00
Samuel Berthe
fa4325218f
Merge branch 'master' of github.com:samber/awesome-prometheus-alerts 2020-12-30 17:46:58 +01:00
Samuel Berthe
ed62bdc567
alerts node_exporter: improve network and disk rules 2020-12-30 17:45:30 +01:00
Tosin Ogunrinde
0add93363f Fix JVM "JVM memory filling up" alert 2020-12-30 00:30:08 +00:00
Samuel Berthe
f686698f68
Merge pull request #166 from cityofships/fix_es
Fix Elasticsearch "No new documents" alert
2020-12-28 16:50:47 +01:00
Samuel Berthe
965fefab89
fix alert description 2020-12-28 16:40:11 +01:00
Carl Düvel
a7c5155002
Add cpu steal alert 2020-12-21 19:06:45 +01:00
Piotr Parczewski
f7d08e364b
Fix Elasticsearch "No new documents" alert.
Prometheus rate() function calculates the per-second average rate
of increase. This means the alert gets triggered whenever during
last 10 minutes there were less than 1 document ingested *per second*
(60 documents per minute).

Signed-off-by: Piotr Parczewski <piotr@stackhpc.com>
2020-12-17 15:00:01 +01:00
Per Lundberg
f673fe72c3
Update rules.yml
Fixes bug in previous commit. `or` has lower precedence than `<` in PromQL so hence the need for the grouping using parentheses.
2020-11-27 11:08:46 +02:00
Per Lundberg
00dd58eace
Fix Redis missing master query
The previous approach fails because of the "missing data" semantics in Prometheus. If the Redis server is down, PromQL will typically return "no data" instead of 0 for a `count()`; this is by design in Prometheus.

This suggestion as given by @slovdahl works around this by returning an vector with a single `0` entry in this case, making the query work as intended.
2020-11-25 16:06:05 +02:00
Samuel Berthe
2186841f29
Merge pull request #140 from yasharne/percona_mongodb 2020-11-15 18:12:20 +01:00
Vincent Fiset
6ed4358452 remove replset_oplog based alerts 2020-11-09 11:14:01 -05:00
Samuel Berthe
3ccfaa47ea
remove useless brackets 2020-11-07 18:08:02 +01:00
Samuel Berthe
9f144acb30
haproxy: fix description of request errors 2020-11-07 18:07:20 +01:00
Samuel Berthe
be20363602
rate is better than irate for alerting 2020-11-07 17:46:18 +01:00
Liudmyla Derkach
e6113ff2db feat: adding few useful rabbitmq alerts 2020-10-30 19:10:52 +02:00
Yashar Nesabian
2a2ecf8a8c change alert rules which were using avg to show more accurate value based on the replica set 2020-10-24 22:03:42 +03:30
Felix Breidenstein
1b6cd55200 Adapt rules for windows to new exporter 2020-10-20 14:52:36 +02:00
Nabil BENDAFI
e024c542ed feat(kubernetes): add Out of capacity 2020-10-16 12:15:56 +02:00
Samuel Berthe
ead7db708e
alert on containers CPU: add a comment to exclude cAdvisor 2020-10-11 21:38:48 +02:00
Samuel Berthe
50b4c499fa
rules: adding a few cassandra alerts 2020-10-11 19:55:18 +02:00
Samuel Berthe
0cf82fd3e7
Merge branch 'master' into NetworkSpeed 2020-10-11 19:39:59 +02:00
Samuel Berthe
06205cd91c
Update rules.yml 2020-10-11 19:39:17 +02:00
Samuel Berthe
89252f999f
Merge branch 'master' into master 2020-10-11 19:26:04 +02:00
Samuel Berthe
66e6581b07
Merge pull request #121 from osterik/master
check free space for all mountpoints
2020-10-11 19:22:27 +02:00
Samuel Berthe
ea7e6d6aa9
Merge pull request #125 from mcanevet/patch-1
Fix HAProxy rules
2020-10-11 18:21:41 +02:00
Samuel Berthe
8616b0241c
Merge pull request #130 from nabilbendafi/feature/traefik_rules 2020-10-11 18:10:06 +02:00
Samuel Berthe
e8572f618b
Merge pull request #133 from tux-00/master 2020-10-11 18:07:11 +02:00
Samuel Berthe
2f6b9832fa
Update rules.yml 2020-10-11 18:06:06 +02:00
Samuel Berthe
8af9ca4ba8
Merge pull request #134 from nanorobocop/fix-prometheus-job-missing-alert
Fix PrometheusJobMissing alert
2020-10-11 17:48:42 +02:00
Samuel Berthe
2e6e46da45
Merge branch 'master' into master 2020-10-11 17:42:51 +02:00
Samuel Berthe
c469d26c4d
Merge pull request #137 from Ozarklake/sql_server_rules 2020-10-11 17:37:40 +02:00
Samuel Berthe
bafcd1e922
Update rules.yml 2020-10-11 17:35:46 +02:00
Samuel Berthe
e60fc805f6
Merge pull request #138 from nirav-chotai/nchotai/fix-hpa-alerts
[PLEASE_MERGE] Fix HPA alerts
2020-10-11 17:24:13 +02:00
Samuel Berthe
45103f0a0d
Merge branch 'master' into master 2020-10-11 17:10:20 +02:00
Samuel Berthe
7a609adf18
adding comment to container OOM killer warning 2020-10-11 16:11:44 +02:00
Samuel Berthe
cf70272309
fix(container memory limit): filter by containers having max memory setting 2020-10-11 16:08:54 +02:00
Samuel Berthe
4128004475
Merge pull request #119 from fernandocarletti/patch-1
fix: container ContainerMemoryUsage alert
2020-10-11 16:06:33 +02:00
Samuel Berthe
f67162bf57
Merge pull request #148 from fsschmitt/fix/disk-latency-unit
Fix time unit on disk read/write latency rule
2020-10-11 15:49:15 +02:00
fsschmitt
4266b4d326 Fix time unit on disk read/write latency rule 2020-10-06 14:36:22 +01:00
fsschmitt
5288c9a2f5 Fix node_md_disks state from fail to failed 2020-10-06 13:33:50 +01:00
Daniel Andrzejewski
fc4797db9e small fix 2020-09-17 15:19:14 +02:00
Daniel Andrzejewski
6c5f708179 node_disk_write_time_seconds_total is in seconds, not in milliseconds. node_disk_write_time_seconds_total should be grater than 0, otherwise you get +Inf result. 2020-09-17 15:13:42 +02:00
Yashar Nesabian
d6b39a7f3f More accurate alerts
added `mondodb instance down` alert and changed the `too many
connections` alert to fire when the connections are more than 80% of the
available connections.
removed `mongodb_replset_member_state` based alerts as I don't have
enough information on them
2020-08-09 10:35:39 +04:30
Yashar Nesabian
3ce1084f5b Added percona mongodb alert rules 2020-08-03 10:45:32 +04:30
kaifen.xie
a04eef39c0 add istio 2020-07-25 23:24:36 +08:00
Nirav Chotai
8fb5da83de
Fix HPA alerts
- Fixing KubernetesHpaMetricAvailability
- Fixing KubernetesHpaScalingAbility
2020-07-24 13:32:44 +08:00
Ozarklake
88e812c78e add sql server rules 2020-07-17 15:02:41 +08:00
Ozarklake
4e66d17d01 add sql server rules 2020-07-17 14:58:26 +08:00
Ozarklake
e009c5d8b5
Optimizing mysql slow query alert rules 2020-07-14 12:55:17 +08:00
Mansur Marvanov
05e521c0a8 Fix PrometheusJobMissing alert 2020-07-09 16:36:45 +09:00
tux
add6d9c2f3 Add official rabbitmq exporter rules 2020-06-30 15:48:42 +02:00
Nabil BENDAFI
b324c6f32f feat(traefik): add rules for Traefik v2
Fixes #7
2020-06-23 13:40:01 +02:00
Mickaël Canévet
24f7095cd5
Fix HAProxy rules 2020-05-29 10:11:54 +02:00
Ilya Kisleyko
663b0e94da
check free space for all mountpoints 2020-05-20 20:04:32 +03:00
Anton Smolkov
bbbe14f2bd
Update rules.yml
WMI memory alert had opposite meaning, triggered on 90% free instead of 90% used
2020-05-19 11:07:11 +03:00
Fernando Carletti
e6de413146
fix: container ContainerMemoryUsage alert 2020-05-18 17:38:05 -05:00
Rob Brown
5050fd64d5 Correct "device" to "interface" 2020-05-14 16:57:19 +01:00
Samuel Berthe
da1e4f6301
💄 replacing "error" severity by "critical", repo wide 2020-05-14 17:20:19 +02:00
Rob Brown
5d3e812fd7 Add HostNetworkNot1GbSpeed rule 2020-05-14 15:00:24 +01:00
Samuel Berthe
7293bca720
Merge pull request #107 from robert-will-brown/NetworkTransmitErrors 2020-05-09 21:32:40 +02:00
Samuel Berthe
b081f28f5d
Merge pull request #112 from robert-will-brown/SpeedTestExporter 2020-05-09 21:31:33 +02:00
Samuel Berthe
660312d0ea
fix OOM killer threshold 2020-05-09 21:25:13 +02:00
Samuel Berthe
6d6b41e241
Merge pull request #108 from robert-will-brown/EdacMemoryErrors 2020-05-09 21:23:01 +02:00
Rob Brown
8faa295745 Add SpeedTest stanza 2020-05-09 10:20:55 +01:00
Rob Brown
ee4e046c66 Add "> 0" at the end of NetworkTransmitErrors queries 2020-05-09 10:18:21 +01:00
Samuel Berthe
d5f6388899
renaming some mysql alerts 2020-05-09 02:11:18 +02:00
Rob Brown
5d83e393cc Add initial Speedtest Exporter rules 2020-05-08 15:25:54 +01:00
Rob Brown
8912db93bc Fix "greater than" value 2020-05-04 19:04:52 +01:00
Rob Brown
4b22c078ea Align EDAC errors with comments 2020-05-04 18:47:20 +01:00
Samuel Berthe
718cd2188c
shame on me 2020-05-04 00:10:43 +02:00
Samuel Berthe
eb8dc736a3
improve acuracy for context switching query 2020-05-04 00:05:33 +02:00
Samuel Berthe
790139211e
fix typo: postgresql replication lag 2020-05-03 23:23:21 +02:00
Samuel Berthe
648b83250a
improve accuracy "Kubernetes Pod not healthy" query 2020-05-03 18:01:25 +02:00
Ondrej Zalesky
d3d13946e6 fix "Kubernetes Pod not healthy" query 2020-04-30 22:53:25 +02:00
Rob Brown
981e82d649 Add HostEDACUncorrectableErrorsdetected and HostEDACCorrectableErrorsdetected rules 2020-04-30 13:27:30 +01:00
Rob Brown
f87e6d300d Added spacing as per standard 2020-04-30 12:39:12 +01:00
Rob Brown
c57a5e6e36 Add HostNetworkReceiveErrors and HostNetworkTransmitErrors rules 2020-04-30 12:38:23 +01:00
Samuel Berthe
951d80121f
Merge branch 'master' of github.com:samber/awesome-prometheus-alerts 2020-04-06 09:13:29 +02:00
Samuel Berthe
e97023d2a4
linkerd2: adding first rule 2020-04-06 09:01:51 +02:00
Selçuk Arıbalı
c98a04784e
FIX KubernetesPodnothealthy Alert
Kube state metrics assigns value of current pod phase with 1, so according to that Kubernetes Pod not healthy fixed.
2020-04-02 21:01:04 +03:00
Samuel Berthe
c20227b458
oops: adding one-to-one vector matching to mysql subqueries 2020-03-31 16:02:28 +02:00
Matthias Crauwels
79b5ad3b5d
removed avg grouping where possible 2020-03-31 11:42:05 +02:00
Matthias Crauwels
4860250360
added some extra MySQL checks 2020-03-30 11:24:58 +02:00
Samuel Berthe
d9286f6c39
doc: add instructions to rules yaml file 2020-03-28 15:12:21 +01:00
Samuel Berthe
2cda73aa3a
fix(kubernetes): min_over_time takes a time range as paremeter 2020-03-26 16:19:26 +01:00
Samuel Berthe
329583ac36
Fix typo and make pg and mysql similar 2020-03-25 16:44:49 +01:00
luhellma
5559e0140b fix: double usage in query and alert configuration 2020-03-25 16:34:04 +01:00
luhellma
5d8f911d97 feat: Add new rules for MySQLd_exporter from prometheus 2020-03-25 11:57:29 +01:00
luhellma
a4fc086b9a fix wrong number of equal sign in query 2020-03-20 15:22:20 +01:00
luhellma
3d41e2b3ca Add rules for apache 2020-03-20 15:08:13 +01:00
Alexander Knipping
caaea2eeb7 Fix typo in DeadManSwitch alert
Rename it from snitch into switch.
2020-03-18 15:21:38 +01:00
Samuel Berthe
34e62cb327
nginx: adding latency metric 2020-03-17 22:26:46 +01:00
Samuel Berthe
07dde61116
elasticsearch: adding disk watermark alerts 2020-03-17 21:19:58 +01:00
Samuel Berthe
2ecdb636b2
oops 2020-03-17 21:08:09 +01:00
Samuel Berthe
c653b37e15
adding rules to prometheus self monitoring 2020-03-17 20:56:49 +01:00
Samuel Berthe
fc3e72041c
Merge branch 'master' of github.com:samber/awesome-prometheus-alerts 2020-03-17 19:05:57 +01:00
Samuel Berthe
5125c683c5
adding alerts for Ceph 2020-03-17 18:50:08 +01:00
Alexander Knipping
c82df5d005 Fix PrometheusRuleEvaluationSlow
Fixes the rule PrometheusRuleEvaluationSlow as it should fire if
prometheus_rule_group_last_duration_seconds takes longer than
prometheus_rule_group_interval_seconds.

prometheus_rule_group_last_duration_seconds: The duration of the last rule group evaluation.
prometheus_rule_group_interval_seconds: The interval of a rule group.
2020-03-17 15:14:40 +01:00
Samuel Berthe
5b457b0e52
adding github buttons to layout 2020-03-09 23:31:27 +01:00
Samuel Berthe
f554b72671
Add alert for kubernetes api latency 2020-03-09 21:55:17 +01:00
Samuel Berthe
0b89a764ee
Adding exporters: sidekiq, pgbouncer and thanos.
Adding rules to: prometheus, kubernetes, redis, docker and postgresql.
Arranging exporters into categories.
Showing number of rules.
Thanks to Gitlab for opensourcing alerting rules!
2020-03-09 21:18:56 +01:00
Samuel Berthe
affacde49b
adding prometheus internal alerts 2020-03-09 00:16:17 +01:00
Samuel Berthe
99e3e64252
Insert Commit Message Here 2020-03-08 22:21:30 +01:00
Samuel Berthe
77eccab0e9
some random changes on rules 2020-03-08 20:30:22 +01:00
Samuel Berthe
542adc3ca7
Adding minio rules 2020-03-08 18:55:53 +01:00
Samuel Berthe
b5469f2a59
Doc: organizing sections 2020-03-08 17:39:49 +01:00
Samuel Berthe
5bace11107
data: ensure alert name prefix 2020-03-08 17:24:39 +01:00
Samuel Berthe
953878df03
HAProxy 1.*: adding rules 2020-03-08 17:17:06 +01:00
Samuel Berthe
7dbbbb0e09
Doc: organizing lb and reverse proxy 2020-03-08 16:10:33 +01:00
Samuel Berthe
718a039313
Adding an alert for prometheus internals: rule evaluation slowing down 2020-03-08 15:08:11 +01:00
Samuel Berthe
072a435f32
Fixing @jpds queries ;) 🚀 2020-03-08 14:41:36 +01:00
Samuel Berthe
f620fe31ee
Merge pull request #36 from jpds/prom-errors
_data/rules.yml: Added Prometheus error alerts.
2020-03-08 14:29:18 +01:00
Samuel Berthe
6ba051d747
doc: adding a comment to PostgresqlReplicationLag alert 2020-03-07 19:30:58 +01:00
Samuel Berthe
05a2c9604b
Renaming some alert categories 2020-03-07 19:06:54 +01:00
Samuel Berthe
6edcdc75af
my brain is out for vacation, please forgive me 2020-03-07 18:57:09 +01:00
Samuel Berthe
b97ece8c69
Adding alerts for criteo/cassandra_exporter 2020-03-07 18:51:34 +01:00
Samuel Berthe
cde4e243ae
no quotes no cry 2020-03-07 17:59:42 +01:00
Samuel Berthe
0add8466c6
Merge pull request #82 from samber/feat-nodeexporter-raid
Added RAID alerts (node-exporter)
2020-03-07 17:51:39 +01:00
Samuel Berthe
ab477bb21e
Added RAID alerts 2020-03-07 17:50:41 +01:00
Danilo Magalhães
5bd2e03c51
Update rules.yml
Group by instance and name instead of only instance.  
Change from container_spec_memory_limit_bytes to correct max memory metric container_spec_memory_limit_bytes.
2020-02-27 11:08:09 +00:00
Samuel Berthe
a9c9629cb5 oops 2020-01-25 00:16:49 +01:00
Samuel Berthe
134264026a
Does not alert on tmpfs volume filling-up. Closing #77 2020-01-25 00:13:01 +01:00
iamdenchik
29b66f9b3e fix check free disk space 2020-01-15 12:40:19 +05:00
Mateusz Legięcki
a72feb4ff6
Fix Etcd rule: Insufficient Members 2020-01-03 12:58:25 +01:00
Mahesh Paolini-Subramanya
88b55f1dee Replace 'ip' by 'instance' in some rules
The metrics return 'instance', not 'ip'
This PR fixes the rules to use 'instance'
2019-12-27 09:18:16 -05:00
Rob Brown
ce51db2a6f Added Prometheus Not connected to alertmanager alert 2019-12-18 15:38:23 +00:00
Rob Brown
97ecdab26c Added "Disk will fill in 4 hours" alert 2019-12-18 15:32:52 +00:00
Rob Brown
58f843dbc6 Added hardware temperature alerts 2019-12-12 17:29:23 +00:00
Josef Kříž
d10e30aed0
Fixed rabbitmq cluster down rule 2019-12-02 13:12:02 +01:00
Maxime Brunet
1e2a35e058
elasticsearch: Alert for no new docs on data nodes only
We can have nodes that are not masters, but don not hold any data. For example the client/coordinating nodes set up by the `stable/elasticsearch` helm chart:
https://github.com/helm/charts/tree/master/stable/elasticsearch#client-and-coordinating-nodes

And we can also have nodes being data and master nodes simultaneously.
So I think, this alert has to look for `es_data_node="true"` to be correct.
2019-11-06 15:23:26 -05:00
Samuel Berthe
9306d8947f
PG: Alert in case of high rollback ratio (#64)
PG: Alert in case of high rollback ratio
2019-10-31 12:02:03 +01:00
Samuel Berthe
0c9a24a4e7 feat(pg): alert in case of high rollback ratio 2019-10-31 12:00:53 +01:00
Samuel Berthe
cca2872ade
typo 2019-10-31 11:47:57 +01:00
Samuel Berthe
768fac56ae
Merge pull request #62 from jdorel/patch-1
SllCertificateExpired synthax
2019-10-29 12:15:15 +01:00
Samuel Berthe
20744c3d3d
Update rules.yml 2019-10-29 12:12:43 +01:00
Jonas DOREL
80aebe84e9 Add Kubernetes alerts from kube-state-metric exporter 2019-10-29 11:59:14 +01:00
Jonas DOREL
267a064d26
SllCertificateExpired synthax
Match other alert names, without the `has` part.
2019-10-29 11:39:01 +01:00
Samuel Berthe
82cf3ac1ef adding cassandra 2019-10-26 17:48:22 +02:00
Samuel Berthe
4f9e88bad4 improving blackbox alerts 2019-10-26 17:43:18 +02:00
Samuel Berthe
dfa5446cd5 adding comments in data structure 2019-10-26 17:25:35 +02:00
Samuel Berthe
8f6c85774a
Clean data file 2019-09-25 16:36:10 +02:00
olivier beyler
e3628c5ba8 Add OpenEBS and Minio alert
Signed-off-by: olivier beyler <olivier.beyler@orange.com>
2019-09-25 16:13:44 +02:00
Samuel Berthe
1f4a1f8052
Updating Traefik -> Traefik v1.* 2019-09-25 14:23:16 +02:00
Andrey Dudin
6d9866cefb
Fix typo in query of PG DeadLocks 2019-09-25 02:42:44 +03:00
Samuel Berthe
f7f94ed81e
Fixed time interval (10min->10m) 2019-09-13 18:08:04 +02:00
timfeirg
37ef9a6f5c
free memory should include node_memory_Slab_bytes 2019-09-03 15:47:17 +08:00
Samuel Berthe
51e7231b3d fix(blackbox exporter): alert when http >= 400 instead of 300 2019-08-29 19:03:54 +02:00
Jonas Kongslund
9bd8b3698f Add CollectorError alert for WMI exporter 2019-08-22 13:52:15 +04:00