fix(data): PromQL type fixes, job filter cleanup, query correctness review

- Replace rate()/increase() with deriv()/delta() on gauge metrics:
  node_vmstat_pgmajfault, cassandra_stats (criteo exporter),
  gitlab_ci_pipeline_failure_reasons, flink_taskmanager_job_task_numRecordsIn
- Fix histogram_quantile on non-_bucket metric: cilium_policy_implementation_delay
- Fix Thanos bucket replicate latency: use _count instead of _bucket for guard clause
- Fix Thanos query latency: use _count instead of _bucket for guard clause
- Restore job filter in Thanos objstore guard clauses (compact + store)
- Remove redundant job= filters from unique metrics: ~30 Thanos rules,
  kube_persistentvolume_status_phase, otelcol_process_runtime_*
- Fix high-cardinality Istio latency grouping (drop source labels from by())
- Add division-by-zero guard to host context switch ratio
- Raise noisy ClickHouse thresholds: RejectedInserts > 2, DelayedInserts > 10
- Remove redundant for: 1m from HAProxy check failure rules
- Add job rename comments to up{job=...} rules (Hadoop, OpenStack, SNMP, OTel)
- Remove external mixin references from comments
- Fix Tempo dropped spans metric name: add missing _total suffix
- Fix Thanos bucket replicate run latency: add missing le label in by()
This commit is contained in:
Samuel Berthe 2026-04-06 20:12:18 +02:00
parent ac32c98098
commit 04a8ae2fe3
No known key found for this signature in database
GPG key ID: 64863511FFBD0E3C

View file

@ -148,8 +148,10 @@ groups:
for: 2m for: 2m
- name: Host memory under memory pressure - name: Host memory under memory pressure
description: "The node is under heavy memory pressure. High rate of major page faults ({{ $value }}/s)." description: "The node is under heavy memory pressure. High rate of major page faults ({{ $value }}/s)."
query: "(rate(node_vmstat_pgmajfault[5m]) > 1000)" query: "(deriv(node_vmstat_pgmajfault[5m]) > 1000)"
severity: warning severity: warning
comments: |
node_vmstat_pgmajfault is exposed as untyped/gauge by node_exporter (from /proc/vmstat), so deriv() is used instead of rate().
- name: Host Memory is underutilized - name: Host Memory is underutilized
description: "Node memory usage is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})" description: "Node memory usage is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})"
query: "min_over_time(node_memory_MemFree_bytes[1w]) > node_memory_MemTotal_bytes * .8" query: "min_over_time(node_memory_MemFree_bytes[1w]) > node_memory_MemTotal_bytes * .8"
@ -238,7 +240,7 @@ groups:
for: 5m for: 5m
- name: Host context switching high - name: Host context switching high
description: Context switching is growing on the node (twice the daily average during the last 15m) description: Context switching is growing on the node (twice the daily average during the last 15m)
query: '(rate(node_context_switches_total[15m])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) / (rate(node_context_switches_total[1d])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) > 2' query: '(rate(node_context_switches_total[15m])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) / (rate(node_context_switches_total[1d])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) > 2 and rate(node_context_switches_total[1d]) > 0'
severity: warning severity: warning
comments: | comments: |
x2 context switches is an arbitrary number. x2 context switches is an arbitrary number.
@ -266,7 +268,7 @@ groups:
description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} has insufficient drives remaining." description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} has insufficient drives remaining."
query: '((node_md_disks_required - ignoring(state) node_md_disks{state="active"}) > 0)' query: '((node_md_disks_required - ignoring(state) node_md_disks{state="active"}) > 0)'
comments: | comments: |
Uses ignoring(state) to handle additional labels on node_md_disks. Matches the official node-exporter mixin. Uses ignoring(state) to handle additional labels on node_md_disks.
severity: critical severity: critical
- name: Host software RAID disk failure - name: Host software RAID disk failure
description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} needs attention." description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} needs attention."
@ -1620,7 +1622,7 @@ groups:
for: 2m for: 2m
- name: Cassandra authentication failures - name: Cassandra authentication failures
description: Increase of Cassandra authentication failures description: Increase of Cassandra authentication failures
query: 'rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5' query: 'delta(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5'
severity: warning severity: warning
for: 2m for: 2m
- name: Cassandra node down - name: Cassandra node down
@ -1657,7 +1659,7 @@ groups:
for: 2m for: 2m
- name: Cassandra connection timeouts total (Criteo) - name: Cassandra connection timeouts total (Criteo)
description: Some connection between nodes are ending in timeout description: Some connection between nodes are ending in timeout
query: 'rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5' query: 'delta(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5'
severity: critical severity: critical
for: 2m for: 2m
- name: Cassandra storage exceptions (Criteo) - name: Cassandra storage exceptions (Criteo)
@ -1776,12 +1778,12 @@ groups:
- name: ClickHouse rejected insert queries - name: ClickHouse rejected insert queries
description: "INSERTs rejected due to too many active data parts. Reduce insert frequency." description: "INSERTs rejected due to too many active data parts. Reduce insert frequency."
query: "increase(ClickHouseProfileEvents_RejectedInserts[1m]) > 0" query: "increase(ClickHouseProfileEvents_RejectedInserts[1m]) > 2"
severity: warning severity: warning
for: 1m for: 1m
- name: ClickHouse delayed insert queries - name: ClickHouse delayed insert queries
description: "INSERTs delayed due to high number of active parts." description: "INSERTs delayed due to high number of active parts."
query: "increase(ClickHouseProfileEvents_DelayedInserts[5m]) > 0" query: "increase(ClickHouseProfileEvents_DelayedInserts[5m]) > 10"
severity: warning severity: warning
for: 2m for: 2m
- name: ClickHouse zookeeper hardware exception - name: ClickHouse zookeeper hardware exception
@ -2390,7 +2392,6 @@ groups:
description: Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m) description: Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m)
query: increase(haproxy_server_check_failures_total[1m]) > 2 query: increase(haproxy_server_check_failures_total[1m]) > 2
severity: warning severity: warning
for: 1m
- name: prometheus/haproxy_exporter (HAProxy < v2) - name: prometheus/haproxy_exporter (HAProxy < v2)
slug: haproxy-exporter-v1 slug: haproxy-exporter-v1
doc_url: https://github.com/prometheus/haproxy_exporter doc_url: https://github.com/prometheus/haproxy_exporter
@ -2470,7 +2471,6 @@ groups:
description: Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m) description: Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m)
query: "increase(haproxy_server_check_failures_total[1m]) > 2" query: "increase(haproxy_server_check_failures_total[1m]) > 2"
severity: warning severity: warning
for: 1m
- name: Traefik - name: Traefik
exporters: exporters:
@ -2696,7 +2696,7 @@ groups:
for: 1m for: 1m
- name: Istio latency 99 percentile - name: Istio latency 99 percentile
description: "Istio p99 request latency is {{ $value }}ms (threshold: 1000ms)." description: "Istio p99 request latency is {{ $value }}ms (threshold: 1000ms)."
query: "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (destination_canonical_service, destination_workload_namespace, source_canonical_service, source_workload_namespace, le)) > 1000" query: "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (destination_canonical_service, destination_workload_namespace, le)) > 1000"
severity: warning severity: warning
for: 1m for: 1m
- name: Istio Pilot Duplicate Entry - name: Istio Pilot Duplicate Entry
@ -3041,7 +3041,7 @@ groups:
Threshold: more than 100ms/sec of GC time (10% of wall clock). Adjust based on your workload. Threshold: more than 100ms/sec of GC time (10% of wall clock). Adjust based on your workload.
- name: Flink no records processed - name: Flink no records processed
description: "Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes." description: "Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes."
query: "rate(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0" query: "delta(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0"
severity: warning severity: warning
for: 5m for: 5m
comments: | comments: |
@ -3125,6 +3125,7 @@ groups:
comments: | comments: |
When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0, When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0,
so this alert may not fire. Prefer application-level availability metrics if available. so this alert may not fire. Prefer application-level availability metrics if available.
Rename job="hadoop-namenode" to match the actual job name in your Prometheus scrape config.
# Alert rule for ResourceManager availability # Alert rule for ResourceManager availability
- name: Hadoop Resource Manager Down - name: Hadoop Resource Manager Down
@ -3135,6 +3136,7 @@ groups:
comments: | comments: |
When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0, When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0,
so this alert may not fire. Prefer application-level availability metrics if available. so this alert may not fire. Prefer application-level availability metrics if available.
Rename job="hadoop-resourcemanager" to match the actual job name in your Prometheus scrape config.
# Alert rule for DataNode status # Alert rule for DataNode status
- name: Hadoop Data Node Out Of Service - name: Hadoop Data Node Out Of Service
@ -3269,7 +3271,7 @@ groups:
severity: critical severity: critical
- name: Kubernetes PersistentVolume error - name: Kubernetes PersistentVolume error
description: "Persistent volume {{ $labels.persistentvolume }} is in bad state" description: "Persistent volume {{ $labels.persistentvolume }} is in bad state"
query: 'kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0' query: 'kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0'
severity: critical severity: critical
- name: Kubernetes StatefulSet down - name: Kubernetes StatefulSet down
description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down
@ -3510,6 +3512,8 @@ groups:
query: 'up{job=~".*openstack.*"} == 0' query: 'up{job=~".*openstack.*"} == 0'
severity: critical severity: critical
for: 2m for: 2m
comments: |
Adjust the job label regex to match the actual job name in your Prometheus scrape config.
- name: OpenStack Nova agent down - name: OpenStack Nova agent down
description: "Nova agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}" description: "Nova agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}"
query: 'openstack_nova_agent_state{adminState="enabled"} == 0' query: 'openstack_nova_agent_state{adminState="enabled"} == 0'
@ -3804,7 +3808,7 @@ groups:
for: 5m for: 5m
- name: GitLab CI pipeline failures increasing - name: GitLab CI pipeline failures increasing
description: "GitLab CI pipeline failures are increasing on {{ $labels.instance }} ({{ $value }}/s)." description: "GitLab CI pipeline failures are increasing on {{ $labels.instance }} ({{ $value }}/s)."
query: "rate(gitlab_ci_pipeline_failure_reasons[5m]) > 0.05" query: "deriv(gitlab_ci_pipeline_failure_reasons[5m]) > 0.05"
severity: warning severity: warning
for: 10m for: 10m
comments: | comments: |
@ -3903,7 +3907,6 @@ groups:
comments: | comments: |
ResourceExhausted errors from Gitaly mean Git operations are being rejected due to ResourceExhausted errors from Gitaly mean Git operations are being rejected due to
concurrency limits. This directly impacts users trying to push, pull, or clone. concurrency limits. This directly impacts users trying to push, pull, or clone.
This alert is derived from the GitLab Omnibus default rules.
- name: GitLab Gitaly high RPC latency - name: GitLab Gitaly high RPC latency
description: "Gitaly on {{ $labels.instance }} p95 unary RPC latency exceeds 1 second ({{ $value }}s)." description: "Gitaly on {{ $labels.instance }} p95 unary RPC latency exceeds 1 second ({{ $value }}s)."
query: 'histogram_quantile(0.95, sum(rate(grpc_server_handling_seconds_bucket{job="gitaly",grpc_type="unary"}[5m])) by (le)) > 1' query: 'histogram_quantile(0.95, sum(rate(grpc_server_handling_seconds_bucket{job="gitaly",grpc_type="unary"}[5m])) by (le)) > 1'
@ -4238,7 +4241,8 @@ groups:
query: 'up{job=~"snmp.*"} == 0' query: 'up{job=~"snmp.*"} == 0'
severity: critical severity: critical
for: 5m for: 5m
comments: From the official snmp-mixin. comments: |
Rename job=~"snmp.*" to match the actual job name in your Prometheus scrape config.
- name: SNMP interface down - name: SNMP interface down
description: "Interface {{ $labels.ifDescr }} on {{ $labels.instance }} is operationally down while administratively up." description: "Interface {{ $labels.ifDescr }} on {{ $labels.instance }} is operationally down while administratively up."
query: '(ifOperStatus{job=~"snmp.*"} == 2) and on(instance, job, ifIndex) (ifAdminStatus{job=~"snmp.*"} == 1)' query: '(ifOperStatus{job=~"snmp.*"} == 2) and on(instance, job, ifIndex) (ifAdminStatus{job=~"snmp.*"} == 1)'
@ -4378,7 +4382,7 @@ groups:
for: 5m for: 5m
- name: Cilium agent policy implementation delay - name: Cilium agent policy implementation delay
description: "Cilium agent {{ $labels.pod }} P99 policy deployment latency exceeds 60 seconds. Endpoints may run with stale policies." description: "Cilium agent {{ $labels.pod }} P99 policy deployment latency exceeds 60 seconds. Endpoints may run with stale policies."
query: "histogram_quantile(0.99, sum(rate(cilium_policy_implementation_delay[5m])) by (le, pod)) > 60" query: "histogram_quantile(0.99, sum(rate(cilium_policy_implementation_delay_bucket[5m])) by (le, pod)) > 60"
severity: warning severity: warning
for: 5m for: 5m
comments: Threshold of 60s is a rough default. Adjust based on cluster size and policy complexity. comments: Threshold of 60s is a rough default. Adjust based on cluster size and policy complexity.
@ -4510,8 +4514,7 @@ groups:
for: 1m for: 1m
comments: | comments: |
ceph_health_status: 0=HEALTH_OK, 1=HEALTH_WARN, 2=HEALTH_ERR. ceph_health_status: 0=HEALTH_OK, 1=HEALTH_WARN, 2=HEALTH_ERR.
The official Ceph mixin splits this into separate warning (==1) and critical (==2) alerts. This rule fires on any non-OK state. Split into separate warning/critical rules by using ==1 and ==2 thresholds if needed.
This rule fires on any non-OK state. Adjust severity or split as needed.
- name: Ceph monitor clock skew - name: Ceph monitor clock skew
description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings description: Ceph monitor clock skew detected. Please check ntp and hardware clock settings
query: "abs(ceph_monitor_clock_skew_seconds) > 0.2" query: "abs(ceph_monitor_clock_skew_seconds) > 0.2"
@ -4541,7 +4544,7 @@ groups:
for: 5m for: 5m
comments: | comments: |
Ceph internally triggers OSD_NEARFULL based on the nearfull_ratio (default 85%). Ceph internally triggers OSD_NEARFULL based on the nearfull_ratio (default 85%).
The official mixin uses ceph_health_detail for OSD space alerts. ceph_health_detail can also be used for more granular OSD space alerts.
- name: Ceph OSD reweighted - name: Ceph OSD reweighted
description: Ceph Object Storage Daemon takes too much time to resize. description: Ceph Object Storage Daemon takes too much time to resize.
query: "ceph_osd_weight < 1" query: "ceph_osd_weight < 1"
@ -4874,12 +4877,12 @@ groups:
for: 5m for: 5m
- name: Thanos Compactor Halted - name: Thanos Compactor Halted
description: "Thanos Compact {{$labels.job}} has failed to run and now is halted." description: "Thanos Compact {{$labels.job}} has failed to run and now is halted."
query: 'thanos_compact_halted{job=~".*thanos-compact.*"} == 1' query: 'thanos_compact_halted == 1'
severity: warning severity: warning
for: 5m for: 5m
- name: Thanos Compactor High Compaction Failures - name: Thanos Compactor High Compaction Failures
description: "Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions." description: "Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions."
query: '(sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5) and sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m])) > 0' query: '(sum by (job) (rate(thanos_compact_group_compactions_failures_total[5m])) / sum by (job) (rate(thanos_compact_group_compactions_total[5m])) * 100 > 5) and sum by (job) (rate(thanos_compact_group_compactions_total[5m])) > 0'
severity: warning severity: warning
for: 15m for: 15m
- name: Thanos Compact Bucket High Operation Failures - name: Thanos Compact Bucket High Operation Failures
@ -4919,12 +4922,12 @@ groups:
Filters to actual error codes only. grpc_code!="OK" would include benign codes like NotFound, AlreadyExists, and Cancelled. Filters to actual error codes only. grpc_code!="OK" would include benign codes like NotFound, AlreadyExists, and Cancelled.
- name: Thanos Query High D N S Failures - name: Thanos Query High D N S Failures
description: "Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints." description: "Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints."
query: '(sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m]))) * 100 > 1 and sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m])) > 0' query: '(sum by (job) (rate(thanos_query_store_apis_dns_failures_total[5m])) / sum by (job) (rate(thanos_query_store_apis_dns_lookups_total[5m]))) * 100 > 1 and sum by (job) (rate(thanos_query_store_apis_dns_lookups_total[5m])) > 0'
severity: warning severity: warning
for: 15m for: 15m
- name: Thanos Query Instant Latency High - name: Thanos Query Instant Latency High
description: "Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries." description: "Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries."
query: '(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40 and sum by (job) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m])) > 0)' query: '(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query"}[5m])) > 0)'
severity: critical severity: critical
for: 10m for: 10m
- name: Thanos Query Range Latency High - name: Thanos Query Range Latency High
@ -4952,22 +4955,22 @@ groups:
for: 10m for: 10m
- name: Thanos Receive High Replication Failures - name: Thanos Receive High Replication Failures
description: "Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests." description: "Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests."
query: 'thanos_receive_replication_factor > 1 and ((sum by (job) (rate(thanos_receive_replications_total{result="error", job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_replications_total{job=~".*thanos-receive.*"}[5m]))) > (max by (job) (floor((thanos_receive_replication_factor{job=~".*thanos-receive.*"}+1)/ 2)) / max by (job) (thanos_receive_hashring_nodes{job=~".*thanos-receive.*"}))) * 100' query: 'thanos_receive_replication_factor > 1 and ((sum by (job) (rate(thanos_receive_replications_total{result="error"}[5m])) / sum by (job) (rate(thanos_receive_replications_total[5m]))) > (max by (job) (floor((thanos_receive_replication_factor+1)/ 2)) / max by (job) (thanos_receive_hashring_nodes))) * 100'
severity: warning severity: warning
for: 5m for: 5m
- name: Thanos Receive High Forward Request Failures - name: Thanos Receive High Forward Request Failures
description: "Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests." description: "Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests."
query: '(sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~".*thanos-receive.*"}[5m]))/ sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m]))) * 100 > 20 and sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m])) > 0' query: '(sum by (job) (rate(thanos_receive_forward_requests_total{result="error"}[5m]))/ sum by (job) (rate(thanos_receive_forward_requests_total[5m]))) * 100 > 20 and sum by (job) (rate(thanos_receive_forward_requests_total[5m])) > 0'
severity: info severity: info
for: 5m for: 5m
- name: Thanos Receive High Hashring File Refresh Failures - name: Thanos Receive High Hashring File Refresh Failures
description: "Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed." description: "Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed."
query: '(sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m])) > 0) and sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m])) > 0' query: '(sum by (job) (rate(thanos_receive_hashrings_file_errors_total[5m])) / sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total[5m])) > 0) and sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total[5m])) > 0'
severity: warning severity: warning
for: 15m for: 15m
- name: Thanos Receive Config Reload Failure - name: Thanos Receive Config Reload Failure
description: "Thanos Receive {{$labels.job}} has not been able to reload hashring configurations." description: "Thanos Receive {{$labels.job}} has not been able to reload hashring configurations."
query: 'avg by (job) (thanos_receive_config_last_reload_successful{job=~".*thanos-receive.*"}) != 1' query: 'avg by (job) (thanos_receive_config_last_reload_successful) != 1'
severity: warning severity: warning
for: 5m for: 5m
- name: Thanos Receive No Upload - name: Thanos Receive No Upload
@ -4987,7 +4990,7 @@ groups:
for: 5m for: 5m
- name: Thanos Sidecar No Connection To Started Prometheus - name: Thanos Sidecar No Connection To Started Prometheus
description: "Thanos Sidecar {{$labels.instance}} is unhealthy." description: "Thanos Sidecar {{$labels.instance}} is unhealthy."
query: 'thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0 and on (namespace, pod) prometheus_tsdb_data_replay_duration_seconds != 0' query: 'thanos_sidecar_prometheus_up == 0 and on (namespace, pod) prometheus_tsdb_data_replay_duration_seconds != 0'
severity: critical severity: critical
for: 5m for: 5m
- name: Thanos Store - name: Thanos Store
@ -5000,7 +5003,7 @@ groups:
for: 5m for: 5m
- name: Thanos Store Series Gate Latency High - name: Thanos Store Series Gate Latency High
description: "Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests." description: "Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests."
query: '(histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0)' query: '(histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket[5m]))) > 2 and sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count[5m])) > 0)'
severity: warning severity: warning
for: 10m for: 10m
- name: Thanos Store Bucket High Operation Failures - name: Thanos Store Bucket High Operation Failures
@ -5018,12 +5021,12 @@ groups:
rules: rules:
- name: Thanos Rule Queue Is Dropping Alerts - name: Thanos Rule Queue Is Dropping Alerts
description: "Thanos Rule {{$labels.instance}} is failing to queue alerts ({{ $value | humanize }}/s)." description: "Thanos Rule {{$labels.instance}} is failing to queue alerts ({{ $value | humanize }}/s)."
query: 'sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0' query: 'sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total[5m])) > 0'
severity: critical severity: critical
for: 5m for: 5m
- name: Thanos Rule Sender Is Failing Alerts - name: Thanos Rule Sender Is Failing Alerts
description: "Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager ({{ $value | humanize }}/s)." description: "Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager ({{ $value | humanize }}/s)."
query: 'sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0' query: 'sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total[5m])) > 0'
severity: critical severity: critical
for: 5m for: 5m
- name: Thanos Rule High Rule Evaluation Failures - name: Thanos Rule High Rule Evaluation Failures
@ -5033,7 +5036,7 @@ groups:
for: 5m for: 5m
- name: Thanos Rule High Rule Evaluation Warnings - name: Thanos Rule High Rule Evaluation Warnings
description: "Thanos Rule {{$labels.instance}} has high number of evaluation warnings ({{ $value | humanize }}/s)." description: "Thanos Rule {{$labels.instance}} has high number of evaluation warnings ({{ $value | humanize }}/s)."
query: 'sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~".*thanos-rule.*"}[5m])) > 0.05' query: 'sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total[5m])) > 0.05'
comments: | comments: |
Threshold of 0.05/s avoids firing on transient single-event spikes. Threshold of 0.05/s avoids firing on transient single-event spikes.
severity: info severity: info
@ -5050,17 +5053,17 @@ groups:
for: 5m for: 5m
- name: Thanos Rule Config Reload Failure - name: Thanos Rule Config Reload Failure
description: "Thanos Rule {{$labels.job}} has not been able to reload its configuration." description: "Thanos Rule {{$labels.job}} has not been able to reload its configuration."
query: 'avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~".*thanos-rule.*"}) != 1' query: 'avg by (job, instance) (thanos_rule_config_last_reload_successful) != 1'
severity: info severity: info
for: 5m for: 5m
- name: Thanos Rule Query High D N S Failures - name: Thanos Rule Query High D N S Failures
description: "Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints." description: "Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints."
query: '(sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) > 0' query: '(sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total[5m])) / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total[5m])) > 0'
severity: warning severity: warning
for: 15m for: 15m
- name: Thanos Rule Alertmanager High D N S Failures - name: Thanos Rule Alertmanager High D N S Failures
description: "Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints." description: "Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints."
query: '(sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) > 0' query: '(sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total[5m])) / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total[5m])) * 100 > 1) and sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total[5m])) > 0'
severity: warning severity: warning
for: 15m for: 15m
- name: Thanos Rule No Evaluation For10 Intervals - name: Thanos Rule No Evaluation For10 Intervals
@ -5070,7 +5073,7 @@ groups:
for: 5m for: 5m
- name: Thanos No Rule Evaluations - name: Thanos No Rule Evaluations
description: "Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes." description: "Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes."
query: 'sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0 and sum by (job, instance) (thanos_rule_loaded_rules{job=~".*thanos-rule.*"}) > 0' query: 'sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0 and sum by (job, instance) (thanos_rule_loaded_rules) > 0'
severity: critical severity: critical
for: 5m for: 5m
- name: Thanos Bucket Replicate - name: Thanos Bucket Replicate
@ -5078,12 +5081,12 @@ groups:
rules: rules:
- name: Thanos Bucket Replicate Error Rate - name: Thanos Bucket Replicate Error Rate
description: "Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed." description: "Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed."
query: '(sum by (job) (rate(thanos_replicate_replication_runs_total{result="error", job=~".*thanos-bucket-replicate.*"}[5m])) / on (job) group_left sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m]))) * 100 >= 10 and sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m])) > 0' query: '(sum by (job) (rate(thanos_replicate_replication_runs_total{result="error"}[5m])) / on (job) group_left sum by (job) (rate(thanos_replicate_replication_runs_total[5m]))) * 100 >= 10 and sum by (job) (rate(thanos_replicate_replication_runs_total[5m])) > 0'
severity: critical severity: critical
for: 5m for: 5m
- name: Thanos Bucket Replicate Run Latency - name: Thanos Bucket Replicate Run Latency
description: "Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations." description: "Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations."
query: '(histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))) > 20 and sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])) > 0)' query: '(histogram_quantile(0.99, sum by (job, le) (rate(thanos_replicate_replication_run_duration_seconds_bucket[5m]))) > 20 and sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_count[5m])) > 0)'
severity: critical severity: critical
for: 5m for: 5m
- name: Thanos Component Absent - name: Thanos Component Absent
@ -5270,7 +5273,7 @@ groups:
severity: critical severity: critical
for: 24h for: 24h
comments: | comments: |
Official Tempo mixin normalizes by backend-worker count. Adjust threshold based on your compactor configuration. Threshold of 100 blocks per compactor instance. Normalize by backend-worker count if needed. Adjust based on your environment.
- name: Tempo distributor usage tracker errors - name: Tempo distributor usage tracker errors
description: "Tempo distributor usage tracker errors for {{ $labels.job }} at {{ $value | humanize }}/s (reason {{ $labels.reason }})." description: "Tempo distributor usage tracker errors for {{ $labels.job }} at {{ $value | humanize }}/s (reason {{ $labels.reason }})."
query: sum by (job, reason) (rate(tempo_distributor_usage_tracker_errors_total[5m])) > 0.05 query: sum by (job, reason) (rate(tempo_distributor_usage_tracker_errors_total[5m])) > 0.05
@ -5285,7 +5288,7 @@ groups:
for: 15m for: 15m
- name: Tempo metrics generator service graphs dropping spans - name: Tempo metrics generator service graphs dropping spans
description: Tempo metrics generator is dropping {{ printf "%.2f" $value }}% of spans in service graphs for {{ $labels.job }}. description: Tempo metrics generator is dropping {{ printf "%.2f" $value }}% of spans in service graphs for {{ $labels.job }}.
query: '100 * sum by (job) (rate(tempo_metrics_generator_processor_service_graphs_dropped_spans[5m])) / sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0.5 and sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0' query: '100 * sum by (job) (rate(tempo_metrics_generator_processor_service_graphs_dropped_spans_total[5m])) / sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0.5 and sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0'
severity: warning severity: warning
for: 15m for: 15m
- name: Tempo metrics generator collections failing - name: Tempo metrics generator collections failing
@ -5453,7 +5456,7 @@ groups:
description: Mimir store-gateway {{ $labels.instance }} has not synced the bucket for more than 30 minutes. description: Mimir store-gateway {{ $labels.instance }} has not synced the bucket for more than 30 minutes.
query: (time() - cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 1800) and cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 0 query: (time() - cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 1800) and cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 0
comments: | comments: |
Threshold aligned with official Mimir mixin (30 minutes). Threshold of 30 minutes. Adjust based on your sync interval.
severity: critical severity: critical
for: 5m for: 5m
- name: Mimir store gateway no synced tenants - name: Mimir store gateway no synced tenants
@ -5495,7 +5498,7 @@ groups:
description: "Mimir compactor has found {{ $value }} blocks that cannot be compacted (reason {{ $labels.reason }})." description: "Mimir compactor has found {{ $value }} blocks that cannot be compacted (reason {{ $labels.reason }})."
query: increase(cortex_compactor_blocks_marked_for_no_compaction_total[24h]) > 0 query: increase(cortex_compactor_blocks_marked_for_no_compaction_total[24h]) > 0
comments: | comments: |
Using 24h window per official mixin — compaction skips are rare events. Using a 24h window as compaction skips are rare events.
severity: warning severity: warning
for: 5m for: 5m
# Ruler # Ruler
@ -5616,6 +5619,8 @@ groups:
query: 'up{job=~".*otel.*collector.*"} == 0' query: 'up{job=~".*otel.*collector.*"} == 0'
severity: critical severity: critical
for: 1m for: 1m
comments: |
Adjust the job label regex to match the actual job name in your Prometheus scrape config.
- name: OpenTelemetry Collector receiver refused spans - name: OpenTelemetry Collector receiver refused spans
description: "OpenTelemetry Collector is refusing {{ $value | humanize }}/s spans on {{ $labels.receiver }}." description: "OpenTelemetry Collector is refusing {{ $value | humanize }}/s spans on {{ $labels.receiver }}."
query: 'rate(otelcol_receiver_refused_spans[5m]) > 0.05' query: 'rate(otelcol_receiver_refused_spans[5m]) > 0.05'
@ -5680,7 +5685,7 @@ groups:
for: 5m for: 5m
- name: OpenTelemetry Collector high memory usage - name: OpenTelemetry Collector high memory usage
description: "OpenTelemetry Collector memory usage is above 90%" description: "OpenTelemetry Collector memory usage is above 90%"
query: '(otelcol_process_runtime_heap_alloc_bytes{job=~".*otel.*collector.*"} / on(instance, job) otelcol_process_runtime_total_sys_memory_bytes{job=~".*otel.*collector.*"}) > 0.9' query: '(otelcol_process_runtime_heap_alloc_bytes / on(instance, job) otelcol_process_runtime_total_sys_memory_bytes) > 0.9'
severity: warning severity: warning
for: 5m for: 5m
- name: OpenTelemetry Collector OTLP receiver errors - name: OpenTelemetry Collector OTLP receiver errors