awesome-prometheus-alerts

mirror of https://github.com/samber/awesome-prometheus-alerts.git synced 2026-06-21 00:47:18 +08:00

History

Samuel Berthe 2258835c30 fix/promql rules review (#533 ) * fix(data): comprehensive PromQL review across all ~937 rules Query fixes: - Replace rate()/increase() with deriv()/delta() on gauge metrics exposed as untyped by exporters (node_vmstat_oom_kill, mysql_global_status_, systemd_socket_refused_connections_total) - Fix Ceph OSD latency metric name: ceph_osd_perf_apply_latency_seconds → ceph_osd_apply_latency_ms (Ceph MGR Prometheus module) - Fix NATS subscriptions metric: gnatsd_connz_subscriptions (per-conn) → gnatsd_varz_subscriptions (server total) - Fix Caddy reverse proxy down query: count()==0 → direct gauge == 0 - Fix RabbitMQ total connections metric: connectionsTotal → connections - Fix Cilium ClusterMesh/KVStoreMesh: deriv() on failure gauge → direct gauge comparison (deriv > 0 misses stable non-zero failure states) - Fix cert-manager ACME metric name: certmanager_http_acme_client_request_count → certmanager_acme_client_request_count (renamed in v1.19+) - Fix Thanos Query gRPC filter: grpc_code!="OK" → explicit error codes - Fix Flink duplicate comments: field (YAML last-write-wins bug) - Add datid!="0" filter to PostgreSQL dead locks query - Fix PostgreSQL high rollback rate: restructure division-by-zero guard and move ratio calculation outside sum() - Add division-by-zero guards: Container Low CPU, Hadoop ResourceManager memory, Hadoop HBase heap, Vault cluster health - Add for: 1m to Blackbox probe failed/HTTP failure and Ceph State/ OSD Down/PG unavailable Threshold fixes: - Replace > 0 with meaningful thresholds on rate()/increase() queries across: Alertmanager, eBPF decoder errors, systemd refused connections, Memcached, Cassandra (Instaclustr + Criteo), ClickHouse distributed inserts, CouchDB log entries, HAProxy healthcheck failures, RabbitMQ unroutable messages, Spinnaker, Cilium, Mimir TSDB/alertmanager, OTel Collector receiver refused metrics - Fix Elasticsearch High Indexing Latency threshold: 0.0005s → 0.01s (0.5ms was below normal operating range; 10ms is more realistic) Description fixes: - Fix MySQL slow queries: remove duplicate "mysql" word - Fix SMART device description: remove trailing stray ")" (6 rules) - Fix host disk IO description: remove duplicate "Check storage for issues." - Fix EDAC correctable errors: "last 5 minutes" → "last 1 minute" - Fix EDAC uncorrectable errors: remove time-window claim (raw counter) - Fix Mimir store-gateway sync description: said "10 minutes" but threshold is 1800s (30 minutes) - Fix Vault description false "%" suffix on count values - Improve descriptions across RabbitMQ, Zookeeper, Kafka, Pulsar, Envoy, Istio rules to include {{ $labels }} and {{ $value }} template vars - Downgrade Cassandra key cache hit rate: critical → warning Comments: - Add note on node_vmstat_oom_kill gauge type (delta vs increase) - Add note on systemd_socket_refused_connections_total gauge type - Add note on mysql_global_status_ gauge type (delta/deriv vs rate) - Add note on pg_txid_current requiring a custom postgres_exporter query - Add note on pg_stat_ssl_compression availability (PG 9.5-13 only) - Add note on cert-manager legacy metric name for users on v1.18 and older - Add threshold rationale for Elasticsearch, Cassandra, CouchDB rules - Add note on NATS leaf node spurious fires when leaf nodes not configured * fix(data): PromQL type fixes, job filter cleanup, query correctness review - Replace rate()/increase() with deriv()/delta() on gauge metrics: node_vmstat_pgmajfault, cassandra_stats (criteo exporter), gitlab_ci_pipeline_failure_reasons, flink_taskmanager_job_task_numRecordsIn - Fix histogram_quantile on non-_bucket metric: cilium_policy_implementation_delay - Fix Thanos bucket replicate latency: use _count instead of _bucket for guard clause - Fix Thanos query latency: use _count instead of _bucket for guard clause - Restore job filter in Thanos objstore guard clauses (compact + store) - Remove redundant job= filters from unique metrics: ~30 Thanos rules, kube_persistentvolume_status_phase, otelcol_process_runtime_* - Fix high-cardinality Istio latency grouping (drop source labels from by()) - Add division-by-zero guard to host context switch ratio - Raise noisy ClickHouse thresholds: RejectedInserts > 2, DelayedInserts > 10 - Remove redundant for: 1m from HAProxy check failure rules - Add job rename comments to up{job=...} rules (Hadoop, OpenStack, SNMP, OTel) - Remove external mixin references from comments - Fix Tempo dropped spans metric name: add missing _total suffix - Fix Thanos bucket replicate run latency: add missing le label in by()	2026-04-06 20:38:12 +02:00
..
rules.yml	fix/promql rules review (#533 )	2026-04-06 20:38:12 +02:00

* fix(data): comprehensive PromQL review across all ~937 rules

Query fixes:
- Replace rate()/increase() with deriv()/delta() on gauge metrics exposed
  as untyped by exporters (node_vmstat_oom_kill, mysql_global_status_*,
  systemd_socket_refused_connections_total)
- Fix Ceph OSD latency metric name: ceph_osd_perf_apply_latency_seconds
  → ceph_osd_apply_latency_ms (Ceph MGR Prometheus module)
- Fix NATS subscriptions metric: gnatsd_connz_subscriptions (per-conn)
  → gnatsd_varz_subscriptions (server total)
- Fix Caddy reverse proxy down query: count()==0 → direct gauge == 0
- Fix RabbitMQ total connections metric: connectionsTotal → connections
- Fix Cilium ClusterMesh/KVStoreMesh: deriv() on failure gauge → direct
  gauge comparison (deriv > 0 misses stable non-zero failure states)
- Fix cert-manager ACME metric name: certmanager_http_acme_client_request_count
  → certmanager_acme_client_request_count (renamed in v1.19+)
- Fix Thanos Query gRPC filter: grpc_code!="OK" → explicit error codes
- Fix Flink duplicate comments: field (YAML last-write-wins bug)
- Add datid!="0" filter to PostgreSQL dead locks query
- Fix PostgreSQL high rollback rate: restructure division-by-zero guard
  and move ratio calculation outside sum()
- Add division-by-zero guards: Container Low CPU, Hadoop ResourceManager
  memory, Hadoop HBase heap, Vault cluster health
- Add for: 1m to Blackbox probe failed/HTTP failure and Ceph State/
  OSD Down/PG unavailable

Threshold fixes:
- Replace > 0 with meaningful thresholds on rate()/increase() queries
  across: Alertmanager, eBPF decoder errors, systemd refused connections,
  Memcached, Cassandra (Instaclustr + Criteo), ClickHouse distributed
  inserts, CouchDB log entries, HAProxy healthcheck failures, RabbitMQ
  unroutable messages, Spinnaker, Cilium, Mimir TSDB/alertmanager,
  OTel Collector receiver refused metrics
- Fix Elasticsearch High Indexing Latency threshold: 0.0005s → 0.01s
  (0.5ms was below normal operating range; 10ms is more realistic)

Description fixes:
- Fix MySQL slow queries: remove duplicate "mysql" word
- Fix SMART device description: remove trailing stray ")" (6 rules)
- Fix host disk IO description: remove duplicate "Check storage for issues."
- Fix EDAC correctable errors: "last 5 minutes" → "last 1 minute"
- Fix EDAC uncorrectable errors: remove time-window claim (raw counter)
- Fix Mimir store-gateway sync description: said "10 minutes" but
  threshold is 1800s (30 minutes)
- Fix Vault description false "%" suffix on count values
- Improve descriptions across RabbitMQ, Zookeeper, Kafka, Pulsar, Envoy,
  Istio rules to include {{ $labels }} and {{ $value }} template vars
- Downgrade Cassandra key cache hit rate: critical → warning

Comments:
- Add note on node_vmstat_oom_kill gauge type (delta vs increase)
- Add note on systemd_socket_refused_connections_total gauge type
- Add note on mysql_global_status_* gauge type (delta/deriv vs rate)
- Add note on pg_txid_current requiring a custom postgres_exporter query
- Add note on pg_stat_ssl_compression availability (PG 9.5-13 only)
- Add note on cert-manager legacy metric name for users on v1.18 and older
- Add threshold rationale for Elasticsearch, Cassandra, CouchDB rules
- Add note on NATS leaf node spurious fires when leaf nodes not configured

* fix(data): PromQL type fixes, job filter cleanup, query correctness review

- Replace rate()/increase() with deriv()/delta() on gauge metrics:
  node_vmstat_pgmajfault, cassandra_stats (criteo exporter),
  gitlab_ci_pipeline_failure_reasons, flink_taskmanager_job_task_numRecordsIn
- Fix histogram_quantile on non-_bucket metric: cilium_policy_implementation_delay
- Fix Thanos bucket replicate latency: use _count instead of _bucket for guard clause
- Fix Thanos query latency: use _count instead of _bucket for guard clause
- Restore job filter in Thanos objstore guard clauses (compact + store)
- Remove redundant job= filters from unique metrics: ~30 Thanos rules,
  kube_persistentvolume_status_phase, otelcol_process_runtime_*
- Fix high-cardinality Istio latency grouping (drop source labels from by())
- Add division-by-zero guard to host context switch ratio
- Raise noisy ClickHouse thresholds: RejectedInserts > 2, DelayedInserts > 10
- Remove redundant for: 1m from HAProxy check failure rules
- Add job rename comments to up{job=...} rules (Hadoop, OpenStack, SNMP, OTel)
- Remove external mixin references from comments
- Fix Tempo dropped spans metric name: add missing _total suffix
- Fix Thanos bucket replicate run latency: add missing le label in by()

2026-04-06 20:38:12 +02:00

rules.yml

fix/promql rules review (#533 )

2026-04-06 20:38:12 +02:00