mirror of
https://github.com/samber/awesome-prometheus-alerts.git
synced 2026-06-21 00:47:18 +08:00
Adjust OOM kill detected rule (#495)
* Adjust OOM kill detected rule When a machine runs out of memory, it happens that the node exporter stops responding for multiple minutes. I've adjusted the rule now to take this into account: even if it takes 15-20 minutes before the machine becomes responsive again, the alert should still fire. * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
This commit is contained in:
parent
1d69457017
commit
51aea96ba7
1 changed files with 3 additions and 1 deletions
|
|
@ -271,8 +271,10 @@ groups:
|
|||
severity: info
|
||||
- name: Host OOM kill detected
|
||||
description: OOM kill detected
|
||||
query: "(increase(node_vmstat_oom_kill[1m]) > 0)"
|
||||
query: "(increase(node_vmstat_oom_kill[30m]) > 0)"
|
||||
severity: warning
|
||||
comments: |
|
||||
When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 15–20 minutes to recover, the alert should still trigger.
|
||||
- name: Host EDAC Correctable Errors detected
|
||||
description: 'Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.'
|
||||
query: "(increase(node_edac_correctable_errors_total[1m]) > 0)"
|
||||
|
|
|
|||
Loading…
Reference in a new issue