Adjust OOM kill detected rule

When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.
This commit is contained in:
Per Lundberg 2026-01-30 09:07:59 +02:00 committed by GitHub
parent 1d69457017
commit 6179475625
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -271,7 +271,7 @@ groups:
severity: info
- name: Host OOM kill detected
description: OOM kill detected
query: "(increase(node_vmstat_oom_kill[1m]) > 0)"
query: "(increase(node_vmstat_oom_kill[30m]) > 0)"
severity: warning
- name: Host EDAC Correctable Errors detected
description: 'Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.'