* Update google-cadvisor.yml
Expression Explanation:
The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires.
Alert Details:
- Alert Name: ContainerHighLowChangeCpuUsage
- Trigger Condition: Absolute change in CPU usage exceeding 25%
- Alert Severity: Informational (info)
* Add alert rule for high CPU usage change
* Change alert severity from warning to info
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
* smartctl_exporter publishes both drive_trip and current drive temperatures. Since most of the alerts are going to be permanent, it does not make sense to wait for the alert to be on for a certain time. Temperature sensors likewise vary, using the last sample is not sufficient to alert on potential issues.
* Add an option to run GitHub Action manually
* Add an option to force running the action for testing purposes
* Set variables correctly
* Set variables correctly
* Publish
* Clean up some more metrics
* Publish
* Minor bug fixes
* Publish
* Removed queries that throw errors when systems are upgraded. Also fixed and simplified a few Postgres queries.
* Publish
* Refined some more queries
* Publish
* PostgreSQL now has optimized autovacuum behavior
* Publish
* PostgreSQL now has optimized autovacuum behavior
* Publish
* Publish
* Query fails if instance names are not unique across jobs. This fixes it.
* Publish
* Ruby is out of date
---------
Co-authored-by: samber <samber@users.noreply.github.com>
Modify PostgresqlConfigurationChanged for prevent error: "many-to-many matching not allowed: matching labels must be unique on one side" in cases when you have multiple instances of postgres
* feat: Add comprehensive NATS and JetStream Prometheus alert rules
- Added multiple Prometheus alert rules for monitoring NATS server and JetStream metrics.
- Included alerts for:
- High connection count
- High pending bytes
- High subscriptions count
- High routes count
- High memory usage
- Slow consumers
- NATS server downtime
- High CPU usage
- High number of active connections
- High JetStream store and memory usage
- Subscription limits exceeded
- High pending messages
- Authentication timeouts
- Errors in NATS (JetStream API errors)
- JetStream consumers limit exceeded
- Exceeding max payload size
- Leaf node connection issues
- Ping operations limit exceeded
- Write deadline exceeded
- Ensured consistency between `exporter.yml` and `rules.yml` files.
- Improved overall NATS and JetStream monitoring to prevent performance degradation and ensure system reliability.
This commit enhances the visibility of NATS and JetStream operations by providing key metrics to alert on potential issues and optimize system performance.
* Update rules.yml
* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated
* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated
* fix indentation
---------
Co-authored-by: somratdutta <duttasomratand.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
Co-authored-by: somrat.dutta <somrat.dutta@nutanix.com>