From 50b4c499faffe2ab26b6594fd1ba4df4133171cd Mon Sep 17 00:00:00 2001 From: Samuel Berthe Date: Sun, 11 Oct 2020 19:55:18 +0200 Subject: [PATCH] rules: adding a few cassandra alerts --- _data/rules.yml | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/_data/rules.yml b/_data/rules.yml index caeb699..d16033a 100644 --- a/_data/rules.yml +++ b/_data/rules.yml @@ -802,6 +802,30 @@ groups: description: Something is going wrong with cassandra storage query: 'changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1' severity: critical + - name: Cassandra tombstone dump + description: Too much tombstones scanned in queries + query: 'cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000' + severity: critical + - name: Cassandra client request unvailable write + description: Write failures have occurred because too many nodes are unavailable + query: 'changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0' + severity: critical + - name: Cassandra client request unvailable read + description: Read failures have occurred because too many nodes are unavailable + query: 'changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0' + severity: critical + - name: Cassandra client request write failure + description: A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large. + query: 'increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0' + severity: critical + - name: Cassandra client request read failure + description: A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large. + query: 'increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 0' + severity: critical + - name: Cassandra cache hit rate key cache + description: Key cache hit rate is below 85% + query: 'cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85' + severity: critical - name: Zookeeper exporters: