mirror of
https://github.com/samber/awesome-prometheus-alerts.git
synced 2026-06-24 02:17:00 +08:00
Adjust OOM kill detected rule (#495)
* Adjust OOM kill detected rule When a machine runs out of memory, it happens that the node exporter stops responding for multiple minutes. I've adjusted the rule now to take this into account: even if it takes 15-20 minutes before the machine becomes responsive again, the alert should still fire. * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
This commit is contained in:
parent
1d69457017
commit
51aea96ba7
1 changed files with 3 additions and 1 deletions
|
|
@ -271,8 +271,10 @@ groups:
|
||||||
severity: info
|
severity: info
|
||||||
- name: Host OOM kill detected
|
- name: Host OOM kill detected
|
||||||
description: OOM kill detected
|
description: OOM kill detected
|
||||||
query: "(increase(node_vmstat_oom_kill[1m]) > 0)"
|
query: "(increase(node_vmstat_oom_kill[30m]) > 0)"
|
||||||
severity: warning
|
severity: warning
|
||||||
|
comments: |
|
||||||
|
When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 15–20 minutes to recover, the alert should still trigger.
|
||||||
- name: Host EDAC Correctable Errors detected
|
- name: Host EDAC Correctable Errors detected
|
||||||
description: 'Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.'
|
description: 'Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.'
|
||||||
query: "(increase(node_edac_correctable_errors_total[1m]) > 0)"
|
query: "(increase(node_edac_correctable_errors_total[1m]) > 0)"
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue