From 51aea96ba794bbf4ff5904162db610ee5ad7013d Mon Sep 17 00:00:00 2001 From: Per Lundberg Date: Fri, 30 Jan 2026 13:15:27 +0200 Subject: [PATCH] Adjust OOM kill detected rule (#495) * Adjust OOM kill detected rule When a machine runs out of memory, it happens that the node exporter stops responding for multiple minutes. I've adjusted the rule now to take this into account: even if it takes 15-20 minutes before the machine becomes responsive again, the alert should still fire. * Update rules.yml --------- Co-authored-by: Samuel Berthe --- _data/rules.yml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/_data/rules.yml b/_data/rules.yml index 715f181..7f4da58 100644 --- a/_data/rules.yml +++ b/_data/rules.yml @@ -271,8 +271,10 @@ groups: severity: info - name: Host OOM kill detected description: OOM kill detected - query: "(increase(node_vmstat_oom_kill[1m]) > 0)" + query: "(increase(node_vmstat_oom_kill[30m]) > 0)" severity: warning + comments: | + When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 15–20 minutes to recover, the alert should still trigger. - name: Host EDAC Correctable Errors detected description: 'Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.' query: "(increase(node_edac_correctable_errors_total[1m]) > 0)"