Adjust OOM kill detected rule (#495)

* Adjust OOM kill detected rule

When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
This commit is contained in:
Per Lundberg 2026-01-30 13:15:27 +02:00 committed by GitHub
parent 1d69457017
commit 51aea96ba7
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -271,8 +271,10 @@ groups:
severity: info
- name: Host OOM kill detected
description: OOM kill detected
query: "(increase(node_vmstat_oom_kill[1m]) > 0)"
query: "(increase(node_vmstat_oom_kill[30m]) > 0)"
severity: warning
comments: |
When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 1520 minutes to recover, the alert should still trigger.
- name: Host EDAC Correctable Errors detected
description: 'Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.'
query: "(increase(node_edac_correctable_errors_total[1m]) > 0)"