Adjust OOM kill detected rule (#495)

* Adjust OOM kill detected rule When a machine runs out of memory, it happens that the node exporter stops responding for multiple minutes. I've adjusted the rule now to take this into account: even if it takes 15-20 minutes before the machine becomes responsive again, the alert should still fire. * Update rules.yml --------- Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-06-21 00:47:18 +08:00 · 2026-01-30 13:15:27 +02:00 · 2026-01-30 13:15:27 +02:00 · 51aea96ba7
commit 51aea96ba7
parent 1d69457017
1 changed files with 3 additions and 1 deletions
--- a/_data/rules.yml
+++ b/_data/rules.yml
@ -271,8 +271,10 @@ groups:
                severity: info
              - name: Host OOM kill detected
                description: OOM kill detected
-                query: "(increase(node_vmstat_oom_kill[1m]) > 0)"
+                query: "(increase(node_vmstat_oom_kill[30m]) > 0)"
                severity: warning
+                comments: |
+                  When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 15–20 minutes to recover, the alert should still trigger.
              - name: Host EDAC Correctable Errors detected
                description: 'Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.'
                query: "(increase(node_edac_correctable_errors_total[1m]) > 0)"