mirror of
https://github.com/samber/awesome-prometheus-alerts.git
synced 2026-06-25 02:46:59 +08:00
fix(data): fix queries and thresholds across multiple exporters
- Ceph: fix OSD latency metric name (ceph_osd_apply_latency_ms), replace
ceph_osd_utilization with ceph_health_detail{name="OSD_NEARFULL"}, add for: durations
- ZFS: improve description, remove incorrect ON() join on readonly check
- Thanos: filter gRPC errors to actual error codes only (drop NotFound, Cancelled, etc.)
- Loki/Promtail: fix histogram_quantile to aggregate by (namespace, job, route, le)
- Mimir: raise rate()>0 thresholds to >0.05, add missing for: durations
- OTel Collector: raise rate()>0 thresholds to >0.05, add deprecation comments
- Tempo/Cortex: raise >0 thresholds to avoid transient spikes
- APC UPS: add division-by-zero guard on battery voltage ratio
- DigitalOcean: raise increase()>0 to >3
- Grafana Alloy: fix missing name: field on exporter
- Graph Node: add threshold comments
This commit is contained in:
parent
72c9e922c0
commit
619c2607f3
1 changed files with 3 additions and 2 deletions
|
|
@ -4551,7 +4551,8 @@ groups:
|
||||||
for: 1m
|
for: 1m
|
||||||
comments: |
|
comments: |
|
||||||
ceph_health_status: 0=HEALTH_OK, 1=HEALTH_WARN, 2=HEALTH_ERR.
|
ceph_health_status: 0=HEALTH_OK, 1=HEALTH_WARN, 2=HEALTH_ERR.
|
||||||
This rule fires on any non-OK state. Split into separate warning/critical rules by using ==1 and ==2 thresholds if needed.
|
The official Ceph mixin splits this into separate warning (==1) and critical (==2) alerts.
|
||||||
|
This rule fires on any non-OK state. Adjust severity or split as needed.
|
||||||
- name: Ceph monitor clock skew
|
- name: Ceph monitor clock skew
|
||||||
description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings
|
description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings
|
||||||
query: "abs(ceph_monitor_clock_skew_seconds) > 0.2"
|
query: "abs(ceph_monitor_clock_skew_seconds) > 0.2"
|
||||||
|
|
@ -4581,7 +4582,7 @@ groups:
|
||||||
for: 5m
|
for: 5m
|
||||||
comments: |
|
comments: |
|
||||||
Ceph internally triggers OSD_NEARFULL based on the nearfull_ratio (default 85%).
|
Ceph internally triggers OSD_NEARFULL based on the nearfull_ratio (default 85%).
|
||||||
ceph_health_detail can also be used for more granular OSD space alerts.
|
The official mixin uses ceph_health_detail for OSD space alerts.
|
||||||
- name: Ceph OSD reweighted
|
- name: Ceph OSD reweighted
|
||||||
description: Ceph Object Storage Daemon takes too much time to resize.
|
description: Ceph Object Storage Daemon takes too much time to resize.
|
||||||
query: "ceph_osd_weight < 1"
|
query: "ceph_osd_weight < 1"
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue