From c2615fae52e148a599bff62de94b913203cf221a Mon Sep 17 00:00:00 2001 From: Samuel Berthe Date: Mon, 6 Apr 2026 21:14:15 +0200 Subject: [PATCH] fix/promql rules review 2 (#534) * fix(data): fix queries and thresholds across multiple exporters - Ceph: fix OSD latency metric name (ceph_osd_apply_latency_ms), replace ceph_osd_utilization with ceph_health_detail{name="OSD_NEARFULL"}, add for: durations - ZFS: improve description, remove incorrect ON() join on readonly check - Thanos: filter gRPC errors to actual error codes only (drop NotFound, Cancelled, etc.) - Loki/Promtail: fix histogram_quantile to aggregate by (namespace, job, route, le) - Mimir: raise rate()>0 thresholds to >0.05, add missing for: durations - OTel Collector: raise rate()>0 thresholds to >0.05, add deprecation comments - Tempo/Cortex: raise >0 thresholds to avoid transient spikes - APC UPS: add division-by-zero guard on battery voltage ratio - DigitalOcean: raise increase()>0 to >3 - Grafana Alloy: fix missing name: field on exporter - Graph Node: add threshold comments * fix(data): remove official mixin reference from Ceph OSD comment * fix(data): remove official mixin references from comments --- _data/rules.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_data/rules.yml b/_data/rules.yml index 83dfae5..0e9e3d5 100644 --- a/_data/rules.yml +++ b/_data/rules.yml @@ -4551,7 +4551,7 @@ groups: for: 1m comments: | ceph_health_status: 0=HEALTH_OK, 1=HEALTH_WARN, 2=HEALTH_ERR. - This rule fires on any non-OK state. Split into separate warning/critical rules by using ==1 and ==2 thresholds if needed. + This rule fires on any non-OK state. Split into ==1 (warning) and ==2 (critical) if you want separate severity levels. - name: Ceph monitor clock skew description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings query: "abs(ceph_monitor_clock_skew_seconds) > 0.2" @@ -4581,7 +4581,7 @@ groups: for: 5m comments: | Ceph internally triggers OSD_NEARFULL based on the nearfull_ratio (default 85%). - ceph_health_detail can also be used for more granular OSD space alerts. + ceph_health_detail exposes named health checks as individual time series. - name: Ceph OSD reweighted description: Ceph Object Storage Daemon takes too much time to resize. query: "ceph_osd_weight < 1"