* Add systemd service name to HostSystemdServiceCrashed summary
* Modify systemd service crash rule description
Updated the description for the systemd service crash rule to include the service name.
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
* Adjust OOM kill detected rule
When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.
* Update rules.yml
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
* Jenkins node offline for clause (#2)
* Convert cpu alert expressions to without() rather than on()
* Remove on() expression from network throughput alerts as labels fully match
---------
Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>
* feature/kubestate-exporter-check-if-node-is-scheduling-disabeld
* commented added
* typo in expr
* move code to right file
---------
Co-authored-by: Roger Sikorski <roger.sikorski@zweiloewen.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
The `elasticsearch_indices_search_fetch_total`,
`elasticsearch_indices_search_fetch_time_seconds`,
`elasticsearch_indices_indexing_index_time_seconds_total`
and `elasticsearch_indices_indexing_index_total` metrics
are counters.
Dividing these metrics doesn't make sense because a spike in
numerator would cause the alert to persist, even if subsequent
fetch/index operations are normal. Adding `increase` changes the query
to check if operations took, on average, more than X over
a 1-minute interval, which was likely the original intent of
this alert.
* Update google-cadvisor.yml
Expression Explanation:
The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires.
Alert Details:
- Alert Name: ContainerHighLowChangeCpuUsage
- Trigger Condition: Absolute change in CPU usage exceeding 25%
- Alert Severity: Informational (info)
* Add alert rule for high CPU usage change
* Change alert severity from warning to info
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
* feat: Add comprehensive NATS and JetStream Prometheus alert rules
- Added multiple Prometheus alert rules for monitoring NATS server and JetStream metrics.
- Included alerts for:
- High connection count
- High pending bytes
- High subscriptions count
- High routes count
- High memory usage
- Slow consumers
- NATS server downtime
- High CPU usage
- High number of active connections
- High JetStream store and memory usage
- Subscription limits exceeded
- High pending messages
- Authentication timeouts
- Errors in NATS (JetStream API errors)
- JetStream consumers limit exceeded
- Exceeding max payload size
- Leaf node connection issues
- Ping operations limit exceeded
- Write deadline exceeded
- Ensured consistency between `exporter.yml` and `rules.yml` files.
- Improved overall NATS and JetStream monitoring to prevent performance degradation and ensure system reliability.
This commit enhances the visibility of NATS and JetStream operations by providing key metrics to alert on potential issues and optimize system performance.
* Update rules.yml
* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated
* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated
* fix indentation
---------
Co-authored-by: somratdutta <duttasomratand.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
Co-authored-by: somrat.dutta <somrat.dutta@nutanix.com>