mirror of
https://github.com/samber/awesome-prometheus-alerts.git
synced 2026-06-21 00:47:18 +08:00
feat: add Apache Flink and Apache Spark alerting rules
Add 20 new alerting rules under the Runtimes category: - Apache Flink (12 rules): job status, TaskManager registration, slot availability, restarts, checkpoints, backpressure, heap memory, GC, and record processing - Apache Spark (8 rules): worker health, waiting apps, memory/cores exhaustion, executor GC, task failures, and disk spill
This commit is contained in:
parent
1db2c6f196
commit
e6cdcdb9e5
2 changed files with 139 additions and 0 deletions
|
|
@ -93,6 +93,8 @@ Collection available here: **[https://samber.github.io/awesome-prometheus-alerts
|
|||
- [Ruby](https://samber.github.io/awesome-prometheus-alerts/rules#ruby)
|
||||
- [Python](https://samber.github.io/awesome-prometheus-alerts/rules#python)
|
||||
- [Sidekiq](https://samber.github.io/awesome-prometheus-alerts/rules#sidekiq)
|
||||
- [Apache Flink](https://samber.github.io/awesome-prometheus-alerts/rules#apache-flink)
|
||||
- [Apache Spark](https://samber.github.io/awesome-prometheus-alerts/rules#apache-spark)
|
||||
|
||||
#### Orchestrators
|
||||
- [Kubernetes](https://samber.github.io/awesome-prometheus-alerts/rules#kubernetes)
|
||||
|
|
|
|||
137
_data/rules.yml
137
_data/rules.yml
|
|
@ -2786,6 +2786,143 @@ groups:
|
|||
query: "max(sidekiq_queue_latency) > 60"
|
||||
severity: critical
|
||||
|
||||
- name: Apache Flink
|
||||
exporters:
|
||||
- name: Built-in Prometheus reporter
|
||||
slug: flink-prometheus-reporter
|
||||
doc_url: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/
|
||||
rules:
|
||||
- name: Flink job is not running
|
||||
description: "No Flink jobs are currently running. All jobs may have failed or been cancelled."
|
||||
query: "flink_jobmanager_numRunningJobs == 0"
|
||||
severity: critical
|
||||
for: 1m
|
||||
- name: Flink no TaskManagers registered
|
||||
description: "No TaskManagers are registered with the JobManager. The cluster has no processing capacity."
|
||||
query: "flink_jobmanager_numRegisteredTaskManagers == 0"
|
||||
severity: critical
|
||||
for: 1m
|
||||
- name: Flink all task slots used
|
||||
description: "All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled."
|
||||
query: "flink_jobmanager_taskSlotsAvailable == 0"
|
||||
severity: warning
|
||||
for: 5m
|
||||
comments: |
|
||||
This alert fires when there are no available task slots. Adjust the threshold if your cluster is expected to run at full capacity.
|
||||
- name: Flink job restart increasing
|
||||
description: "Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes."
|
||||
query: "increase(flink_jobmanager_job_numRestarts[5m]) > 0"
|
||||
severity: warning
|
||||
- name: Flink checkpoint failures
|
||||
description: "Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes."
|
||||
query: "increase(flink_jobmanager_job_numberOfFailedCheckpoints[10m]) > 0"
|
||||
severity: warning
|
||||
- name: Flink checkpoint duration high
|
||||
description: "Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete."
|
||||
query: "flink_jobmanager_job_lastCheckpointDuration > 60000"
|
||||
severity: warning
|
||||
for: 5m
|
||||
comments: |
|
||||
Threshold is 60 seconds. Adjust based on your checkpoint interval and state size.
|
||||
- name: Flink task backpressured
|
||||
description: "Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured."
|
||||
query: "flink_taskmanager_job_task_isBackPressured == 1"
|
||||
severity: warning
|
||||
for: 5m
|
||||
- name: Flink task high backpressure time
|
||||
description: "Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure."
|
||||
query: "flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500"
|
||||
severity: warning
|
||||
for: 5m
|
||||
comments: |
|
||||
Fires when a task spends more than 500ms/sec backpressured. This indicates the task cannot keep up with upstream data rate.
|
||||
- name: Flink TaskManager heap memory high
|
||||
description: "Flink TaskManager {{ $labels.instance }} heap memory usage is above 90%."
|
||||
query: "flink_taskmanager_Status_JVM_Memory_Heap_Used / flink_taskmanager_Status_JVM_Memory_Heap_Max > 0.9"
|
||||
severity: warning
|
||||
for: 5m
|
||||
- name: Flink JobManager heap memory high
|
||||
description: "Flink JobManager {{ $labels.instance }} heap memory usage is above 90%."
|
||||
query: "flink_jobmanager_Status_JVM_Memory_Heap_Used / flink_jobmanager_Status_JVM_Memory_Heap_Max > 0.9"
|
||||
severity: warning
|
||||
for: 5m
|
||||
- name: Flink TaskManager GC time high
|
||||
description: "Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection."
|
||||
query: "rate(flink_taskmanager_Status_JVM_GarbageCollector_All_Time[5m]) > 100"
|
||||
severity: warning
|
||||
for: 5m
|
||||
comments: |
|
||||
Threshold: more than 100ms/sec of GC time (10% of wall clock). Adjust based on your workload.
|
||||
- name: Flink no records processed
|
||||
description: "Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes."
|
||||
query: "rate(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0"
|
||||
severity: warning
|
||||
for: 5m
|
||||
comments: |
|
||||
Only fires for tasks that have previously received records, to avoid false positives during startup.
|
||||
|
||||
- name: Apache Spark
|
||||
exporters:
|
||||
- name: Built-in Prometheus (PrometheusServlet + PrometheusResource)
|
||||
slug: spark-prometheus
|
||||
doc_url: https://spark.apache.org/docs/latest/monitoring.html
|
||||
comments: |
|
||||
Spark exposes metrics via two built-in endpoints:
|
||||
- PrometheusServlet: master/worker/driver metrics at /metrics/prometheus/ (ports 8080, 8081, 4040)
|
||||
- PrometheusResource: executor metrics at /metrics/executors/prometheus/ (port 4040, requires spark.ui.prometheus.enabled=true in Spark 3.x)
|
||||
Metric names from PrometheusServlet include a dynamic namespace (application ID), making static PromQL queries challenging.
|
||||
Configuration: spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
|
||||
rules:
|
||||
- name: Spark no alive workers
|
||||
description: "No Spark workers are alive. The cluster has no processing capacity."
|
||||
query: "metrics_master_aliveWorkers_Value == 0"
|
||||
severity: critical
|
||||
for: 1m
|
||||
- name: Spark too many waiting apps
|
||||
description: "Spark has {{ $value }} applications waiting for resources."
|
||||
query: "metrics_master_waitingApps_Value > 10"
|
||||
severity: warning
|
||||
for: 5m
|
||||
comments: |
|
||||
Adjust the threshold based on your cluster's typical queuing behavior.
|
||||
- name: Spark worker memory exhausted
|
||||
description: "Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free)."
|
||||
query: "metrics_worker_memFree_MB_Value == 0"
|
||||
severity: warning
|
||||
for: 2m
|
||||
- name: Spark worker cores exhausted
|
||||
description: "Spark worker {{ $labels.instance }} has no free cores."
|
||||
query: "metrics_worker_coresFree_Value == 0"
|
||||
severity: warning
|
||||
for: 5m
|
||||
comments: |
|
||||
Fires when a worker has no free cores. This may be normal under high load but can indicate capacity issues.
|
||||
- name: Spark executor high GC time
|
||||
description: "Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC."
|
||||
query: "metrics_executor_totalGCTime / (metrics_executor_totalDuration > 0) > 0.1"
|
||||
severity: warning
|
||||
for: 5m
|
||||
comments: |
|
||||
Fires when more than 10% of executor time is spent in garbage collection.
|
||||
This metric comes from the PrometheusResource endpoint (/metrics/executors/prometheus/).
|
||||
- name: Spark executor all tasks failing
|
||||
description: "Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed)."
|
||||
query: "metrics_executor_failedTasks > 0 and metrics_executor_completedTasks == 0"
|
||||
severity: critical
|
||||
for: 5m
|
||||
- name: Spark executor high task failure rate
|
||||
description: "Spark executor {{ $labels.executor_id }} has a task failure rate above 10%."
|
||||
query: "metrics_executor_failedTasks / (metrics_executor_totalTasks > 0) > 0.1"
|
||||
severity: warning
|
||||
for: 5m
|
||||
- name: Spark executor high disk spill
|
||||
description: "Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory."
|
||||
query: "rate(metrics_executor_diskUsed_bytes[5m]) > 0"
|
||||
severity: warning
|
||||
for: 5m
|
||||
comments: |
|
||||
Disk spilling indicates insufficient memory for the workload.
|
||||
|
||||
- name: Orchestrators
|
||||
services:
|
||||
- name: Kubernetes
|
||||
|
|
|
|||
Loading…
Reference in a new issue