awesome-prometheus-alerts

mirror of https://github.com/samber/awesome-prometheus-alerts.git synced 2026-06-21 00:47:18 +08:00

Author	SHA1	Message	Date
Samuel Berthe	9ae17eca97	Fix broken and misleading alert rules (#503 ) - Remove 7 meaningless `for: 0m` (ClickHouse, Caddy, Thanos) - Fix Minio obsolete metrics (disk_storage_* -> minio_cluster_capacity_*) - Rename duplicate Blackbox SSL cert rule to disambiguate warning/critical - Simplify PostgreSQL config change query (giant regex -> negative matcher) - Downgrade PostgreSQL SSL compression severity from critical to warning - Fix misleading "Host unusual disk read rate" name and description	2026-03-15 18:08:06 +01:00
Marcin Morawski	eeebb90e6f	Add systemd service name to HostSystemdServiceCrashed summary (#499 ) * Add systemd service name to HostSystemdServiceCrashed summary * Modify systemd service crash rule description Updated the description for the systemd service crash rule to include the service name. --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2026-03-01 20:15:17 +01:00
dxrayz	e60601fdcd	tune Targets Missing rules (#497 ) * tune Targets Missing rules * reworked query logic * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2026-02-21 19:40:10 +01:00
Per Lundberg	51aea96ba7	Adjust OOM kill detected rule (#495 ) * Adjust OOM kill detected rule When a machine runs out of memory, it happens that the node exporter stops responding for multiple minutes. I've adjusted the rule now to take this into account: even if it takes 15-20 minutes before the machine becomes responsive again, the alert should still fire. * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2026-01-30 12:15:27 +01:00
Samuel Berthe	d400e3e64d	feat(k8s): cronjob rule (#491 )	2026-01-07 13:57:42 +01:00
Simon Matic Langford	f810ff531d	Node exporter rules to preserve instance labels (#488 ) * Jenkins node offline for clause (#2) * Convert cpu alert expressions to without() rather than on() * Remove on() expression from network throughput alerts as labels fully match --------- Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>	2026-01-06 16:24:18 +01:00
Simon Matic Langford	79f2858037	Improve Jenkins node alerts to better handle servers with multiple nodes (#484 )	2025-11-17 14:56:04 +01:00
Arve Knudsen	d58bc324ad	Add OpenTelemetry Collector monitoring alerts (#480 ) Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>	2025-11-05 17:08:26 +01:00
andrii.k	9edef74e73	update kafka alerts (#478 )	2025-10-13 14:24:37 +02:00
Riccardo Cannella	7832e01082	haproxy: align v1 and v2 HAProxy backend max active session > 80% alerts (#475 ) * haproxy: align v1 and v2 max current session alerts * fix: remove non-existing label --------- Co-authored-by: Riccardo Cannella <riccardo.cannella@reevo.it>	2025-09-15 15:03:44 +02:00
Samuel Berthe	237e89babc	Update query for unused replication slot rule	2025-09-14 19:22:05 +02:00
Sajjad hassanzadeh	a2c31358d1	Add couchdb alerts (#472 ) * add : additional essential clickhouse alerts * Add new ClickHouse alert rules for monitoring * linting * add : couchdb roles config in rules.yml * add : couchdb alerts in rules directory --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-09-01 15:40:42 +02:00
Sajjad hassanzadeh	7bced89d2d	add : additional essential clickhouse alerts (#471 ) * add : additional essential clickhouse alerts * Add new ClickHouse alert rules for monitoring * linting --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-08-28 23:06:31 +02:00
Samuel Berthe	554850df41	Update rules.yml	2025-06-25 13:32:16 +02:00
Samuel Berthe	748524d580	Update rules.yml	2025-06-17 19:15:52 +02:00
Samuel Berthe	a5a3c2cd92	fix: HostHighCpuUsage (#466 ) closes #457	2025-06-17 17:07:05 +02:00
Samuel Berthe	4b1b8242cb	Update rules.yml	2025-05-21 23:04:12 +02:00
andrii.k	e0e3cdda1d	update istio 4xx alert description (#463 )	2025-05-08 19:49:18 +02:00
Carsten Thiel	79f45a5146	Adding rules for checking FluxCD (#458 )	2025-05-03 22:52:26 +02:00
samber	9f5c641bdd	Publish	2025-04-23 08:31:10 +00:00
samber	aca1bdf1fb	Publish	2025-04-23 08:28:06 +00:00
Samuel Berthe	4666830538	Update rules.yml	2025-04-23 10:18:08 +02:00
Roger	b3d25fafcf	feature/kubestate exporter check if node is scheduling disabeld (#462 ) * feature/kubestate-exporter-check-if-node-is-scheduling-disabeld * commented added * typo in expr * move code to right file --------- Co-authored-by: Roger Sikorski <roger.sikorski@zweiloewen.com> Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-04-23 09:58:29 +02:00
Samuel Berthe	3b440fec7b	Remove buggy HostRequiresReboot rule Closing #459	2025-04-17 17:26:00 +02:00
Samuel Berthe	8b730ef059	Update rules.yml	2025-03-27 17:23:19 +01:00
Motte	69c8208e3c	Added PostgresqlReplicationLagHigh rule (#456 ) * Added PostgresqlReplicationLagHigh rule * Update PostgreSQL replication lag alert settings --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-03-27 14:42:22 +01:00
Pigueiras	97a31f34e5	Fix queries in elasticsearch latency alerts (#455 ) The `elasticsearch_indices_search_fetch_total`, `elasticsearch_indices_search_fetch_time_seconds`, `elasticsearch_indices_indexing_index_time_seconds_total` and `elasticsearch_indices_indexing_index_total` metrics are counters. Dividing these metrics doesn't make sense because a spike in numerator would cause the alert to persist, even if subsequent fetch/index operations are normal. Adding `increase` changes the query to check if operations took, on average, more than X over a 1-minute interval, which was likely the original intent of this alert.	2025-03-26 22:15:24 +01:00
Samuel Berthe	2127c4ce90	Update rules.yml	2025-02-20 16:17:39 +01:00
Roman	c189984d0f	fix node-exporter.yaml missing parentheses (#452 )	2025-02-20 15:05:48 +01:00
Samuel Berthe	6838196343	fix: remove duplicated rule	2025-02-19 15:25:29 +01:00
dzaczek	11a78f0f06	Update google-cadvisor.yml (#382 ) * Update google-cadvisor.yml Expression Explanation: The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires. Alert Details: - Alert Name: ContainerHighLowChangeCpuUsage - Trigger Condition: Absolute change in CPU usage exceeding 25% - Alert Severity: Informational (info) * Add alert rule for high CPU usage change * Change alert severity from warning to info --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-02-16 23:46:53 +01:00
Samuel Berthe	add097c489	data: revert `5f57f09` (see #398 )	2025-02-16 23:36:44 +01:00
asdf1234	4a7b9b5c72	Update mysqld-exporter.yml (#442 ) * Update mysqld-exporter.yml add some rules * Add new MySQL monitoring rules --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-02-16 23:29:00 +01:00
Samuel Berthe	fb857e8b39	data: fix rules	2025-02-16 23:16:36 +01:00
Samuel Berthe	ae12871fa9	Update rules.yml	2025-02-04 16:40:21 +01:00
Felix Bühler	10d00c66da	Add caddy.yml (#450 )	2025-02-04 14:23:14 +01:00
Samuel Berthe	fc6b3faadc	Fix from #405	2025-01-28 06:04:10 +01:00
Samuel Berthe	d916b7c6ab	Fix from #405	2025-01-28 05:58:49 +01:00
sunlei	cbb2337438	fix: formatting errors (#448 ) * fix: formatting errors * Update query format in rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-01-12 22:01:21 +01:00
Samuel Berthe	bdcc67c04e	Update rules.yml	2024-12-16 12:17:59 +01:00
Samuel Berthe	84a3b517a8	Update rules.yml	2024-12-16 12:17:26 +01:00
Samuel Berthe	a8d7c43b30	Update rules.yml	2024-12-08 21:28:07 +01:00
Samuel Berthe	8c3d06502f	Update rules.yml	2024-12-05 23:37:28 +01:00
Martin Anderson	353ef1ed95	RabbitMQ: add too many ready messages alert (#441 ) * RabbitMQ: add too many ready messages alert * Add RabbitMQ ready messages alert rule --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2024-11-30 10:29:57 +01:00
sipr-invivo	bb75cb2c68	feat: Add rule to Kubernetes Job not starting (#436 )	2024-10-28 22:24:10 +01:00
Samuel Berthe	f08e8df514	oops	2024-08-28 08:48:42 +02:00
Samuel Berthe	995ab4d27a	Update rules.yml	2024-08-28 08:46:41 +02:00
Somrat Dutta	8c0bdc2b24	feat: Add NATS and JetStream Prometheus alert rules (#430 ) * feat: Add comprehensive NATS and JetStream Prometheus alert rules - Added multiple Prometheus alert rules for monitoring NATS server and JetStream metrics. - Included alerts for: - High connection count - High pending bytes - High subscriptions count - High routes count - High memory usage - Slow consumers - NATS server downtime - High CPU usage - High number of active connections - High JetStream store and memory usage - Subscription limits exceeded - High pending messages - Authentication timeouts - Errors in NATS (JetStream API errors) - JetStream consumers limit exceeded - Exceeding max payload size - Leaf node connection issues - Ping operations limit exceeded - Write deadline exceeded - Ensured consistency between `exporter.yml` and `rules.yml` files. - Improved overall NATS and JetStream monitoring to prevent performance degradation and ensure system reliability. This commit enhances the visibility of NATS and JetStream operations by providing key metrics to alert on potential issues and optimize system performance. * Update rules.yml * - minor changes, rollback rules.yml - address comment changes - revert to old rules.yml as they are generated * - minor changes, rollback rules.yml - address comment changes - revert to old rules.yml as they are generated * fix indentation --------- Co-authored-by: somratdutta <duttasomratand.com> Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr> Co-authored-by: somrat.dutta <somrat.dutta@nutanix.com>	2024-08-20 20:37:03 +02:00
Samuel Berthe	d1715de751	fix PostgresqlInvalidIndex rule	2024-08-20 18:31:18 +02:00
Samuel Berthe	47e74f65e0	Update rules.yml	2024-07-02 09:33:51 +02:00

1 2 3 4 5 ...

452 commits