awesome-prometheus-alerts

mirror of https://github.com/samber/awesome-prometheus-alerts.git synced 2026-06-21 17:07:24 +08:00

Author	SHA1	Message	Date
samber	67cf6892a4	Publish	2025-05-20 06:21:45 +00:00
jaqxues	98d6e7db05	Alloy: Fix incorrect alert (#464 )	2025-05-20 08:21:14 +02:00
samber	becbe1be3b	Publish	2025-05-08 17:49:45 +00:00
samber	fd9da90c1d	Publish	2025-05-03 20:52:49 +00:00
samber	9f5c641bdd	Publish	2025-04-23 08:31:10 +00:00
samber	aca1bdf1fb	Publish	2025-04-23 08:28:06 +00:00
samber	198035eaf4	Publish	2025-04-23 07:58:55 +00:00
samber	a75d5124c5	Publish	2025-04-17 15:26:25 +00:00
samber	32a4bfb19b	Publish	2025-03-27 16:23:49 +00:00
samber	93f9daecee	Publish	2025-03-27 13:42:51 +00:00
Motte	69c8208e3c	Added PostgresqlReplicationLagHigh rule (#456 ) * Added PostgresqlReplicationLagHigh rule * Update PostgreSQL replication lag alert settings --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-03-27 14:42:22 +01:00
Pigueiras	97a31f34e5	Fix queries in elasticsearch latency alerts (#455 ) The `elasticsearch_indices_search_fetch_total`, `elasticsearch_indices_search_fetch_time_seconds`, `elasticsearch_indices_indexing_index_time_seconds_total` and `elasticsearch_indices_indexing_index_total` metrics are counters. Dividing these metrics doesn't make sense because a spike in numerator would cause the alert to persist, even if subsequent fetch/index operations are normal. Adding `increase` changes the query to check if operations took, on average, more than X over a 1-minute interval, which was likely the original intent of this alert.	2025-03-26 22:15:24 +01:00
samber	7bcae33011	Publish	2025-02-20 15:18:08 +00:00
samber	9963b750ac	Publish	2025-02-20 14:06:17 +00:00
samber	807db03d0d	Publish	2025-02-19 14:25:58 +00:00
samber	4e49e77d29	Publish	2025-02-16 22:47:17 +00:00
dzaczek	11a78f0f06	Update google-cadvisor.yml (#382 ) * Update google-cadvisor.yml Expression Explanation: The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires. Alert Details: - Alert Name: ContainerHighLowChangeCpuUsage - Trigger Condition: Absolute change in CPU usage exceeding 25% - Alert Severity: Informational (info) * Add alert rule for high CPU usage change * Change alert severity from warning to info --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-02-16 23:46:53 +01:00
samber	7889a9a29b	Publish	2025-02-16 22:37:09 +00:00
samber	12b8acb1b8	Publish	2025-02-16 22:29:24 +00:00
asdf1234	4a7b9b5c72	Update mysqld-exporter.yml (#442 ) * Update mysqld-exporter.yml add some rules * Add new MySQL monitoring rules --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-02-16 23:29:00 +01:00
samber	20f9a36615	Publish	2025-02-16 22:17:02 +00:00
Felix Bühler	10d00c66da	Add caddy.yml (#450 )	2025-02-04 14:23:14 +01:00
guruevi	70ac7d9cae	Various updates and quality of life changes (#405 ) * smartctl_exporter publishes both drive_trip and current drive temperatures. Since most of the alerts are going to be permanent, it does not make sense to wait for the alert to be on for a certain time. Temperature sensors likewise vary, using the last sample is not sufficient to alert on potential issues. * Add an option to run GitHub Action manually * Add an option to force running the action for testing purposes * Set variables correctly * Set variables correctly * Publish * Clean up some more metrics * Publish * Minor bug fixes * Publish * Removed queries that throw errors when systems are upgraded. Also fixed and simplified a few Postgres queries. * Publish * Refined some more queries * Publish * PostgreSQL now has optimized autovacuum behavior * Publish * PostgreSQL now has optimized autovacuum behavior * Publish * Publish * Query fails if instance names are not unique across jobs. This fixes it. * Publish * Ruby is out of date --------- Co-authored-by: samber <samber@users.noreply.github.com>	2025-01-28 06:06:47 +01:00
sunlei	cbb2337438	fix: formatting errors (#448 ) * fix: formatting errors * Update query format in rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2025-01-12 22:01:21 +01:00
samber	53a369769d	Publish	2024-12-16 11:19:08 +00:00
samber	4533f23b79	Publish	2024-12-16 11:17:17 +00:00
dxrayz	52d4a8c744	Update postgres-exporter.yml (#444 ) Modify PostgresqlConfigurationChanged for prevent error: "many-to-many matching not allowed: matching labels must be unique on one side" in cases when you have multiple instances of postgres	2024-12-16 12:16:05 +01:00
samber	c5203e94d0	Publish	2024-12-08 20:29:15 +00:00
samber	4e38ae2087	Publish	2024-12-05 22:38:38 +00:00
samber	8a220b1b8a	Publish	2024-11-30 09:31:05 +00:00
Martin Anderson	353ef1ed95	RabbitMQ: add too many ready messages alert (#441 ) * RabbitMQ: add too many ready messages alert * Add RabbitMQ ready messages alert rule --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2024-11-30 10:29:57 +01:00
samber	14949721ba	Publish	2024-10-28 21:25:18 +00:00
samber	4aa45dee05	Publish	2024-08-28 06:49:52 +00:00
Somrat Dutta	8c0bdc2b24	feat: Add NATS and JetStream Prometheus alert rules (#430 ) * feat: Add comprehensive NATS and JetStream Prometheus alert rules - Added multiple Prometheus alert rules for monitoring NATS server and JetStream metrics. - Included alerts for: - High connection count - High pending bytes - High subscriptions count - High routes count - High memory usage - Slow consumers - NATS server downtime - High CPU usage - High number of active connections - High JetStream store and memory usage - Subscription limits exceeded - High pending messages - Authentication timeouts - Errors in NATS (JetStream API errors) - JetStream consumers limit exceeded - Exceeding max payload size - Leaf node connection issues - Ping operations limit exceeded - Write deadline exceeded - Ensured consistency between `exporter.yml` and `rules.yml` files. - Improved overall NATS and JetStream monitoring to prevent performance degradation and ensure system reliability. This commit enhances the visibility of NATS and JetStream operations by providing key metrics to alert on potential issues and optimize system performance. * Update rules.yml * - minor changes, rollback rules.yml - address comment changes - revert to old rules.yml as they are generated * - minor changes, rollback rules.yml - address comment changes - revert to old rules.yml as they are generated * fix indentation --------- Co-authored-by: somratdutta <duttasomratand.com> Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr> Co-authored-by: somrat.dutta <somrat.dutta@nutanix.com>	2024-08-20 20:37:03 +02:00
samber	02687db33d	Publish	2024-08-20 16:32:36 +00:00
samber	58ade95b8b	Publish	2024-07-02 07:34:59 +00:00
Greg	9557d4b50e	feat(meilisearch): add basic set of rules (#425 ) * feat(meilisearch): add basic meilisearch rules * fix(query): use == instead of = * fix(data): set correct name and use == * chore(meilisearch): remove index filter	2024-07-02 09:33:08 +02:00
samber	60c235975c	Publish	2024-06-14 18:16:53 +00:00
samber	1ee046b739	Publish	2024-06-06 20:54:49 +00:00
samber	8759c50440	Publish	2024-05-23 12:45:56 +00:00
samber	7dd767c4b4	Publish	2024-05-15 06:10:06 +00:00
samber	826be5877f	Publish	2024-05-14 18:44:11 +00:00
R.Sicart	262e451625	kube hpa lint and improvement (#417 ) * fix: hpa alerts are using label but the queries remove it Signed-off-by: R.Sicart <roger.sicart@gmail.com> * fix: hpa alert is using label but the query removes it Signed-off-by: R.Sicart <roger.sicart@gmail.com> * feat: hpa scale max should not alert when min and max are the same Signed-off-by: R.Sicart <roger.sicart@gmail.com> --------- Signed-off-by: R.Sicart <roger.sicart@gmail.com>	2024-05-14 20:43:00 +02:00
samber	81079a2a7e	Publish	2024-05-14 18:35:54 +00:00
samber	04886da968	Publish	2024-05-13 10:10:12 +00:00
samber	613401a960	Publish	2024-05-13 09:12:01 +00:00
samber	84b0569c97	Publish	2024-05-13 08:33:30 +00:00
Ali	2547288c13	Added Clickhouse (#412 ) * Added Clickhouse * Update rules.yml Added reasonable time periods for each query to avoid false positives and in some cased give the system a short window to try to solve the issue. Also changed the severity level of authentication alerts from critical to info which seems more appropriate * Modified time period for alerts embedded-exporter.yml I made a few adjustments in time periods. See if they seem reasonable or not * Replication alerts time periods were adjusted IMHO, replication alerts must be sent right away.	2024-05-13 10:32:18 +02:00
samber	515fca9c10	Publish	2024-05-05 23:33:11 +00:00
samber	5c0963558a	Publish	2024-05-02 18:49:56 +00:00

1 2 3

137 commits