awesome-prometheus-alerts

mirror of https://github.com/samber/awesome-prometheus-alerts.git synced 2026-06-21 00:47:18 +08:00

Author	SHA1	Message	Date
nucocloud	4c9da9ed24	Add LiteLLM section to Other group with 3 alerting rules (#553 ) LiteLLM (https://github.com/BerriAI/litellm) is a popular LLM-gateway/proxy that exposes Prometheus metrics via its built-in callback. There were no existing alerting rules for LiteLLM in this repo, despite its growing adoption as an OpenAI/Anthropic-compatible proxy. Added 3 alerts covering the most common operational concerns: 1. LiteLLM provider spend over budget — soft-warning on cumulative 24h spend per model-name regex. Useful when LiteLLM's native `provider_budget_config` hard-cap is unavailable, disabled, or buggy (e.g. BerriAI/litellm#26701). 2. LiteLLM proxy failed requests rate high — error-rate ratio alert for downstream LLM provider availability/auth issues. 3. LiteLLM request latency p95 high — histogram-quantile alert for downstream provider response-time degradation. All 3 rules tested via `promtool check rules` (SUCCESS) and validated on a real LiteLLM v1.83.7 production deployment. Reference: https://docs.litellm.ai/docs/proxy/prometheus	2026-04-29 15:03:07 +02:00
Samuel Berthe	353133d23f	jaeger v2 otel exporter alerts (#552 ) * feat(jaeger): add v2 OTEL-based alerts and keep v1 as legacy Jaeger v2 is built on OpenTelemetry Collector and no longer exposes jaeger_agent_* / jaeger_collector_* / jaeger_client_* metrics. - Add "Embedded exporter (v2+)" with 8 rules targeting: - jaeger_storage_requests_total (error rate, unavailability, no reads) - jaeger_storage_latency_seconds_bucket (p99 latency) - http_server_request_duration_seconds_* via otelhttp (search errors, search latency, single-trace retrieval latency, service discovery errors) - Rename existing exporter to "Embedded exporter (legacy, <v2)" with slug embedded-exporter-legacy and a v1 EOL notice (Dec 31 2025) * chore: adding node version to github action	2026-04-22 00:55:36 +02:00
Samuel Berthe	c2615fae52	fix/promql rules review 2 (#534 ) * fix(data): fix queries and thresholds across multiple exporters - Ceph: fix OSD latency metric name (ceph_osd_apply_latency_ms), replace ceph_osd_utilization with ceph_health_detail{name="OSD_NEARFULL"}, add for: durations - ZFS: improve description, remove incorrect ON() join on readonly check - Thanos: filter gRPC errors to actual error codes only (drop NotFound, Cancelled, etc.) - Loki/Promtail: fix histogram_quantile to aggregate by (namespace, job, route, le) - Mimir: raise rate()>0 thresholds to >0.05, add missing for: durations - OTel Collector: raise rate()>0 thresholds to >0.05, add deprecation comments - Tempo/Cortex: raise >0 thresholds to avoid transient spikes - APC UPS: add division-by-zero guard on battery voltage ratio - DigitalOcean: raise increase()>0 to >3 - Grafana Alloy: fix missing name: field on exporter - Graph Node: add threshold comments * fix(data): remove official mixin reference from Ceph OSD comment * fix(data): remove official mixin references from comments	2026-04-06 21:14:15 +02:00
Samuel Berthe	2258835c30	fix/promql rules review (#533 ) * fix(data): comprehensive PromQL review across all ~937 rules Query fixes: - Replace rate()/increase() with deriv()/delta() on gauge metrics exposed as untyped by exporters (node_vmstat_oom_kill, mysql_global_status_, systemd_socket_refused_connections_total) - Fix Ceph OSD latency metric name: ceph_osd_perf_apply_latency_seconds → ceph_osd_apply_latency_ms (Ceph MGR Prometheus module) - Fix NATS subscriptions metric: gnatsd_connz_subscriptions (per-conn) → gnatsd_varz_subscriptions (server total) - Fix Caddy reverse proxy down query: count()==0 → direct gauge == 0 - Fix RabbitMQ total connections metric: connectionsTotal → connections - Fix Cilium ClusterMesh/KVStoreMesh: deriv() on failure gauge → direct gauge comparison (deriv > 0 misses stable non-zero failure states) - Fix cert-manager ACME metric name: certmanager_http_acme_client_request_count → certmanager_acme_client_request_count (renamed in v1.19+) - Fix Thanos Query gRPC filter: grpc_code!="OK" → explicit error codes - Fix Flink duplicate comments: field (YAML last-write-wins bug) - Add datid!="0" filter to PostgreSQL dead locks query - Fix PostgreSQL high rollback rate: restructure division-by-zero guard and move ratio calculation outside sum() - Add division-by-zero guards: Container Low CPU, Hadoop ResourceManager memory, Hadoop HBase heap, Vault cluster health - Add for: 1m to Blackbox probe failed/HTTP failure and Ceph State/ OSD Down/PG unavailable Threshold fixes: - Replace > 0 with meaningful thresholds on rate()/increase() queries across: Alertmanager, eBPF decoder errors, systemd refused connections, Memcached, Cassandra (Instaclustr + Criteo), ClickHouse distributed inserts, CouchDB log entries, HAProxy healthcheck failures, RabbitMQ unroutable messages, Spinnaker, Cilium, Mimir TSDB/alertmanager, OTel Collector receiver refused metrics - Fix Elasticsearch High Indexing Latency threshold: 0.0005s → 0.01s (0.5ms was below normal operating range; 10ms is more realistic) Description fixes: - Fix MySQL slow queries: remove duplicate "mysql" word - Fix SMART device description: remove trailing stray ")" (6 rules) - Fix host disk IO description: remove duplicate "Check storage for issues." - Fix EDAC correctable errors: "last 5 minutes" → "last 1 minute" - Fix EDAC uncorrectable errors: remove time-window claim (raw counter) - Fix Mimir store-gateway sync description: said "10 minutes" but threshold is 1800s (30 minutes) - Fix Vault description false "%" suffix on count values - Improve descriptions across RabbitMQ, Zookeeper, Kafka, Pulsar, Envoy, Istio rules to include {{ $labels }} and {{ $value }} template vars - Downgrade Cassandra key cache hit rate: critical → warning Comments: - Add note on node_vmstat_oom_kill gauge type (delta vs increase) - Add note on systemd_socket_refused_connections_total gauge type - Add note on mysql_global_status_ gauge type (delta/deriv vs rate) - Add note on pg_txid_current requiring a custom postgres_exporter query - Add note on pg_stat_ssl_compression availability (PG 9.5-13 only) - Add note on cert-manager legacy metric name for users on v1.18 and older - Add threshold rationale for Elasticsearch, Cassandra, CouchDB rules - Add note on NATS leaf node spurious fires when leaf nodes not configured * fix(data): PromQL type fixes, job filter cleanup, query correctness review - Replace rate()/increase() with deriv()/delta() on gauge metrics: node_vmstat_pgmajfault, cassandra_stats (criteo exporter), gitlab_ci_pipeline_failure_reasons, flink_taskmanager_job_task_numRecordsIn - Fix histogram_quantile on non-_bucket metric: cilium_policy_implementation_delay - Fix Thanos bucket replicate latency: use _count instead of _bucket for guard clause - Fix Thanos query latency: use _count instead of _bucket for guard clause - Restore job filter in Thanos objstore guard clauses (compact + store) - Remove redundant job= filters from unique metrics: ~30 Thanos rules, kube_persistentvolume_status_phase, otelcol_process_runtime_* - Fix high-cardinality Istio latency grouping (drop source labels from by()) - Add division-by-zero guard to host context switch ratio - Raise noisy ClickHouse thresholds: RejectedInserts > 2, DelayedInserts > 10 - Remove redundant for: 1m from HAProxy check failure rules - Add job rename comments to up{job=...} rules (Hadoop, OpenStack, SNMP, OTel) - Remove external mixin references from comments - Fix Tempo dropped spans metric name: add missing _total suffix - Fix Thanos bucket replicate run latency: add missing le label in by()	2026-04-06 20:38:12 +02:00
Emil Bostijancic	7ba6b2d367	feat: add OpenSearch alerting rules (OpenSearch exporter plugin) (#532 )	2026-03-31 16:39:38 +02:00
Samuel Berthe	e3a7165a65	fix(data): remove malformed summary fields, replace increase() by rate(), remove redundant avg_over_time	2026-03-18 21:40:30 +01:00
Samuel Berthe	1aafa40913	fix(data): prevent division by 0	2026-03-18 18:06:00 +01:00
Samuel Berthe	a4581ed322	fix(data): fix tresholds, comments, intervals, units... (#529 )	2026-03-18 12:22:55 +01:00
Samuel Berthe	03963ef6f9	refactor(categories): change categories and move some exporters (#528 )	2026-03-17 13:30:13 +01:00
Samuel Berthe	2b99cf1f76	Feat/cilium alerting rules (#526 ) * Add .worktrees/ to .gitignore * feat: add Cilium alerting rules (32 rules across agent, operator, ClusterMesh, KVStoreMesh, Hubble) * fix: use job label instead of k8s_app, switch to single-quoted YAML strings * remove Cilium agent high restart rate alert	2026-03-16 17:10:59 +01:00
Samuel Berthe	5071e01ad9	Feature/spinnaker alerts (#527 ) * Add .worktrees/ to .gitignore * feat: add Spinnaker alerting rules (12 rules) Add Prometheus alerting rules for Spinnaker built-in exporter covering Orca queue health, circuit breakers, Igor polling monitors, Gate API throttling, Clouddriver errors, and AWS rate limiting. Metric names validated against uneeq-oss/spinnaker-mixin dashboards. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-16 16:52:31 +01:00
Samuel Berthe	1455e0fd77	feat: add Oracle Database alerting rules (8 rules) (#525 ) Add Prometheus alerting rules for Oracle Database using iamseth/oracledb_exporter. Rules based on Grafana oracledb-mixin and exporter default metrics: - DB down, session/process limit, tablespace capacity (warning+critical), high rollbacks, active sessions, user I/O wait time.	2026-03-16 16:39:35 +01:00
Samuel Berthe	d8315eb3bc	Feature/cert manager rules (#524 ) * Add .worktrees/ to .gitignore * feat: add cert-manager alerting rules (4 rules) Add Prometheus alerting rules for cert-manager under the "Network, security and storage" category: - Cert-Manager absent (service down detection) - Certificate expiring soon (21-day threshold) - Certificate not ready (readiness check) - Hitting ACME rate limits (rate limit detection) Based on imusmanmalik/cert-manager-mixin and official cert-manager metrics documentation. * docs: add cert-manager to README	2026-03-16 15:01:07 +01:00
Samuel Berthe	b58b498bbb	feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules) (#523 ) * feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules) Add 18 Tempo rules and 49 Mimir rules based on official upstream mixins. Covers ring health, compaction, TSDB, instance limits, ruler, alertmanager, and more. * fix: address PR review comments on Tempo/Mimir rules - Fix Tempo no tenant index builders: add on() for cross-label-set and - Fix Tempo block list rising: output percentage instead of ratio - Fix Mimir memory map areas: multiply by 100 to match % description - Fix all instance limit rules: multiply by 100 to match % descriptions - Fix distributor inflight requests: add % to description	2026-03-16 14:36:50 +01:00
Samuel Berthe	7ee16641ac	feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter) (#520 ) * feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter) * fix: grammar in WireGuard rule comment	2026-03-16 14:20:17 +01:00
Samuel Berthe	f974552ef1	Feat/jaeger alerting rules (#521 ) * Add .worktrees/ to .gitignore * feat: add Jaeger alerting rules (8 rules from official jaeger-mixin) Rules cover agent HTTP errors, RPC errors, client/agent/collector span drops, sampling update failures, throttling update failures, and query request failures. All rules sourced from https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin * fix: rename Jaeger agent RPC alert to Jaeger client RPC The jaeger_client_jaeger_rpc_http_requests metric is client-side, not agent-side. Rename alert to match the actual metric source.	2026-03-16 14:09:03 +01:00
Samuel Berthe	8b443be6d2	feat: add systemd_exporter alerting rules (7 rules) (#522 ) * feat: add systemd_exporter alerting rules (7 rules) Add new Systemd service under Basic resource monitoring with rules for: - Unit failed/inactive state detection - Service crash loop detection - Task limit exhaustion - Socket refused/high connections - Timer missed trigger * fix: narrow systemd unit inactive query to reduce noise Add type="service" and name filter to the inactive unit alert to avoid false positives from legitimately inactive units.	2026-03-16 14:07:14 +01:00
Samuel Berthe	30bbedbc79	feat: add Cloud providers alerting rules (33 rules across 4 exporters) (#519 ) * feat: add Cloud providers alerting rules (33 rules across 4 exporters) New "Cloud providers" category with rules for: - AWS CloudWatch (13 rules): exporter health + EC2, RDS, SQS, ALB, Lambda - Google Cloud / Stackdriver (5 rules): scrape health, API quotas, staleness - DigitalOcean (10 rules): droplets, databases, k8s, load balancers, incidents - Azure (5 rules): API errors, rate limits, collection performance * fix: address PR review - move Cloud providers before Other, fix service name - Move "Cloud providers" group before "Other" in rules.yml for consistent ordering - Rename "Google Cloud / Stackdriver" to "Google Cloud Stackdriver" to avoid awkward /-/ in generated anchors and dist/rules/ paths - Fix README anchor link to match the new service name	2026-03-16 14:06:59 +01:00
Samuel Berthe	fd3bfb02c0	Some fix (#516 ) * fix: use proper zero-traffic guard in Envoy ratio alerts (#511) Replace `+ 1` denominator hack with `and ... > 0` filter in upstream timeout rate and upstream 5xx error rate queries for mathematical correctness and repo consistency. * feat: add alerting rules for prometheus/memcached_exporter * fix: add division-by-zero guards and improve quoting in memcached rules (#512) - Add `and memcached_max_connections > 0` to connection limit queries - Add `and memcached_limit_bytes > 0` to memory usage query - Switch hit-rate query to single quotes for cleaner PromQL readability * fix: fix SNMP interface down query and add job scoping (#507) - Fix ifOperStatus query to use vector matching instead of label filter since ifAdminStatus is a separate metric in snmp_exporter output - Add job=~"snmp." matcher to interface error rate, bandwidth usage, and interface down rules to prevent matching non-SNMP series Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-16 04:50:01 +01:00
Samuel Berthe	97aae5dabf	feat: add GitLab alerting rules (28 rules across 3 exporters) (#518 ) Add new GitLab service under "Other" category with 3 exporters: - Built-in exporter (18 rules): Puma, HTTP errors/latency, Sidekiq jobs, database connection pool, CI/CD pipelines, Ruby process health - Workhorse (3 rules): HTTP error rate, latency, in-flight requests - Gitaly (7 rules): gRPC errors, ResourceExhausted, RPC latency, CPU throttling, auth failures, circuit breaker All metrics verified against gitlabhq/gitlabhq source code. Several rules derived from GitLab Omnibus default alerting rules.	2026-03-16 04:48:52 +01:00
Samuel Berthe	e6cdcdb9e5	feat: add Apache Flink and Apache Spark alerting rules Add 20 new alerting rules under the Runtimes category: - Apache Flink (12 rules): job status, TaskManager registration, slot availability, restarts, checkpoints, backpressure, heap memory, GC, and record processing - Apache Spark (8 rules): worker health, waiting apps, memory/cores exhaustion, executor GC, task failures, and disk spill	2026-03-16 04:46:00 +01:00
Samuel Berthe	88e2c19017	feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi) (#517 ) * feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi) * fix: correct Keycloak metrics-spi metric names and query grouping	2026-03-16 04:40:15 +01:00
Samuel Berthe	20651aa10d	feat: add OpenStack alerting rules (openstack-exporter) (#515 ) * feat: add OpenStack alerting rules (openstack-exporter) Add 20 alerting rules for openstack-exporter/openstack-exporter covering Nova, Neutron, Cinder, Octavia, and Placement services. * docs: add OpenStack to README services list * fix: align OpenStack load balancer alert name with operating_status semantics The operating_status label uses ONLINE/OFFLINE/DEGRADED/ERROR values, not ACTIVE. Rename alert to "not online" and use the label in the description for clarity.	2026-03-16 03:43:51 +01:00
Samuel Berthe	bf7b902881	feat: add process-exporter alerting rules (ncabatoff/process-exporter) (#514 ) * feat: add process-exporter alerting rules (ncabatoff/process-exporter) * docs: add Process to README services list * fix: address PR review feedback for process-exporter rules - Rename service from "Process" to "Process Exporter" for clarity - Fix grammar: "file descriptors usage" → "file descriptor usage" - Clarify CPU alert description as core-equivalent percentage - Rename "high disk IO" to "high disk write IO" for accuracy	2026-03-16 03:31:18 +01:00
Samuel Berthe	2b239736cf	feat: add alerting rules for prometheus/memcached_exporter (#512 )	2026-03-16 03:25:38 +01:00
Samuel Berthe	281142567c	fix: use proper zero-traffic guard in Envoy ratio alerts (#511 ) (#513 ) Replace `+ 1` denominator hack with `and ... > 0` filter in upstream timeout rate and upstream 5xx error rate queries for mathematical correctness and repo consistency.	2026-03-16 03:25:27 +01:00
Samuel Berthe	f97f692596	feat: add Proxmox VE alerting rules (prometheus-pve-exporter) (#509 ) Add 9 alerting rules for Proxmox VE covering node/guest status, CPU, memory, storage, backup coverage, replication, and cluster quorum.	2026-03-16 03:12:06 +01:00
Samuel Berthe	be7a2e4d5d	feat: add IPMI exporter alerting rules (#510 ) * feat: add IPMI exporter alerting rules Add 17 alerting rules for prometheus-community/ipmi_exporter covering temperature, fan, voltage, current, power sensors, chassis status, and system event log monitoring. * docs: add IPMI to README service list * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-16 03:10:10 +01:00
Samuel Berthe	c064d2264e	feat: add Envoy proxy alerting rules using built-in metrics (#511 ) Add 19 alerting rules for Envoy proxy under "Reverse proxies and load balancers" using native metrics from /stats/prometheus endpoint. Covers: server health, HTTP error rates (downstream/upstream), connection saturation, cluster membership, health checks, outlier detection, SSL/TLS certificate expiry, circuit breakers, and request timeouts.	2026-03-16 03:03:57 +01:00
Samuel Berthe	89e703d763	feat: add alerting rules for cloudflare/ebpf_exporter (#508 ) * feat: add alerting rules for cloudflare/ebpf_exporter * docs: add eBPF to README service list	2026-03-16 02:56:04 +01:00
Samuel Berthe	3db9281508	feat: add SNMP exporter alerting rules (#507 ) Add 7 alerting rules for prometheus/snmp_exporter covering device availability, interface status, error rates, bandwidth utilization, and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.	2026-03-16 02:34:34 +01:00
Samuel Berthe	c37ef8f50c	fix: review and fix 74 database & broker alert rules (#504 ) * fix: review and fix 74 database & broker alert rules Comprehensive review of all database and broker alerts covering 16 services. Typos & descriptions (8 fixes): - PGBouncer: "a a server" → "a server" - RabbitMQ: "instace" → "instance", "RabbmitMQ" → "RabbitMQ", "unactive" → "inactive" - Cassandra: write failure said "Read failures", "bad hacker" → "authentication failures" - Solr: replication errors said "failed updates" - Meilisearch: "index is empty" said "instance is down" Duplicates removed (5 fixes): - PostgreSQL: 2 rules using wrong exporter metric (postgresql_errors_total) - ClickHouse: "High Network Traffic" (thread counts) duplicated byte-rate rule - NATS: 2 rules with low thresholds duplicated better rules Broken queries (20 fixes): - Patroni: patroni_master → patroni_primary (renamed in v3) - MongoDB: rate() on gauge → direct ratio for connection queries - MongoDB: removed WiredTiger-incompatible virtual memory rule - Cassandra instaclustr: avg() on counter → rate()[5m] - Cassandra criteo: increase() on JMX rate metric → direct threshold - ClickHouse: increase() on gauge → direct threshold - NATS: rate() on gauge → direct comparison, removed 4 config-value rules - SQL Server: increase() on gauge → direct threshold - Pulsar: moved comparison outside sum() (4 rules) - Hadoop: inverted comparison < 0.2 → > 0.8, counters → increase()[1h] Severity adjustments (7 fixes): - Redis: backup threshold 24h → 48h, rejected connections → warning > 5 - RabbitMQ: no consumer for: 5m with comment - Elasticsearch: unassigned shards added for: 2m - CouchDB: process restarted critical → info - Kafka: consumer group lag → warning, threshold 10000, better description - Hadoop: HBase heap low critical → warning Missing for duration (18 fixes): - Added for: 1m to service-down alerts across MySQL, PostgreSQL, SQL Server, Patroni, Redis, MongoDB, RabbitMQ, Elasticsearch, Cassandra, Zookeeper with restart-tolerance comments Division by zero guards (9 fixes): - Added denominator > 0 guards to ratio queries in PostgreSQL, RabbitMQ, Elasticsearch, ClickHouse, CouchDB, NATS Query design improvements (5 fixes): - Cassandra: removed unnecessary sum() and redundant avg_over_time() - ClickHouse: ZooKeeper avg() → per-instance check - PostgreSQL: sum() → sum by (instance) for SSL and locks - PGBouncer: 30s range window → 2m Hardcoded labels (2 fixes): - ClickHouse: added comment about job="clickhouse" - Cassandra criteo: removed hardcoded service="cas" * fix: address PR review comments - Cassandra connection timeouts: wrap rate() in sum by() (rate() by() is invalid PromQL) - Elasticsearch query latency: add division-by-zero guard - Redis backup: "backuped" → "backed up"	2026-03-16 01:27:18 +01:00
Samuel Berthe	080a792777	data: adding python/ruby/golang (#502 ) * data: adding python/ruby/golang * fix: address review feedback on runtime alerts - JVM non-heap: guard against unbounded metaspace (max_bytes = -1) - JVM old gen GC: note regex only matches CMS/G1/Parallel collectors - JVM/Python file descriptors: note process_* metrics are generic - Go memory usage: fix description (sys_bytes is runtime memory, not host) - Go goroutine spike: use deriv() instead of rate() on gauge - Go GC CPU fraction: note deprecation since Go 1.20 - Go GC duration: clarify quantile="1" is max, not p99 - Python uncollectable: use increase() on counter instead of raw threshold - Add threshold comments for workload-dependent defaults	2026-03-15 19:46:39 +01:00
Samuel Berthe	9ae17eca97	Fix broken and misleading alert rules (#503 ) - Remove 7 meaningless `for: 0m` (ClickHouse, Caddy, Thanos) - Fix Minio obsolete metrics (disk_storage_* -> minio_cluster_capacity_*) - Rename duplicate Blackbox SSL cert rule to disambiguate warning/critical - Simplify PostgreSQL config change query (giant regex -> negative matcher) - Downgrade PostgreSQL SSL compression severity from critical to warning - Fix misleading "Host unusual disk read rate" name and description	2026-03-15 18:08:06 +01:00
Marcin Morawski	eeebb90e6f	Add systemd service name to HostSystemdServiceCrashed summary (#499 ) * Add systemd service name to HostSystemdServiceCrashed summary * Modify systemd service crash rule description Updated the description for the systemd service crash rule to include the service name. --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2026-03-01 20:15:17 +01:00
dxrayz	e60601fdcd	tune Targets Missing rules (#497 ) * tune Targets Missing rules * reworked query logic * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2026-02-21 19:40:10 +01:00
Per Lundberg	51aea96ba7	Adjust OOM kill detected rule (#495 ) * Adjust OOM kill detected rule When a machine runs out of memory, it happens that the node exporter stops responding for multiple minutes. I've adjusted the rule now to take this into account: even if it takes 15-20 minutes before the machine becomes responsive again, the alert should still fire. * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2026-01-30 12:15:27 +01:00
Samuel Berthe	d400e3e64d	feat(k8s): cronjob rule (#491 )	2026-01-07 13:57:42 +01:00
Simon Matic Langford	f810ff531d	Node exporter rules to preserve instance labels (#488 ) * Jenkins node offline for clause (#2) * Convert cpu alert expressions to without() rather than on() * Remove on() expression from network throughput alerts as labels fully match --------- Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>	2026-01-06 16:24:18 +01:00
Simon Matic Langford	79f2858037	Improve Jenkins node alerts to better handle servers with multiple nodes (#484 )	2025-11-17 14:56:04 +01:00
Arve Knudsen	d58bc324ad	Add OpenTelemetry Collector monitoring alerts (#480 ) Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>	2025-11-05 17:08:26 +01:00
andrii.k	9edef74e73	update kafka alerts (#478 )	2025-10-13 14:24:37 +02:00
Riccardo Cannella	7832e01082	haproxy: align v1 and v2 HAProxy backend max active session > 80% alerts (#475 ) * haproxy: align v1 and v2 max current session alerts * fix: remove non-existing label --------- Co-authored-by: Riccardo Cannella <riccardo.cannella@reevo.it>	2025-09-15 15:03:44 +02:00
Samuel Berthe	237e89babc	Update query for unused replication slot rule	2025-09-14 19:22:05 +02:00
Sajjad hassanzadeh	a2c31358d1	Add couchdb alerts (#472 ) * add : additional essential clickhouse alerts * Add new ClickHouse alert rules for monitoring * linting * add : couchdb roles config in rules.yml * add : couchdb alerts in rules directory --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-09-01 15:40:42 +02:00
Sajjad hassanzadeh	7bced89d2d	add : additional essential clickhouse alerts (#471 ) * add : additional essential clickhouse alerts * Add new ClickHouse alert rules for monitoring * linting --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-08-28 23:06:31 +02:00
Samuel Berthe	554850df41	Update rules.yml	2025-06-25 13:32:16 +02:00
Samuel Berthe	748524d580	Update rules.yml	2025-06-17 19:15:52 +02:00
Samuel Berthe	a5a3c2cd92	fix: HostHighCpuUsage (#466 ) closes #457	2025-06-17 17:07:05 +02:00
Samuel Berthe	4b1b8242cb	Update rules.yml	2025-05-21 23:04:12 +02:00

1 2 3 4 5 ...

485 commits