Feat/jaeger alerting rules (#521)

* Add .worktrees/ to .gitignore

* feat: add Jaeger alerting rules (8 rules from official jaeger-mixin)

Rules cover agent HTTP errors, RPC errors, client/agent/collector span drops,
sampling update failures, throttling update failures, and query request failures.
All rules sourced from https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin

* fix: rename Jaeger agent RPC alert to Jaeger client RPC

The jaeger_client_jaeger_rpc_http_requests metric is client-side,
not agent-side. Rename alert to match the actual metric source.
This commit is contained in:
Samuel Berthe 2026-03-16 14:09:03 +01:00 committed by GitHub
parent eeba1ebbaa
commit f974552ef1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 50 additions and 1 deletions

1
.gitignore vendored
View file

@ -5,3 +5,4 @@ _site/
_data/rules.json _data/rules.json
test/rules/ test/rules/
/node_modules /node_modules
.worktrees/

View file

@ -141,6 +141,7 @@ Collection available here: **[https://samber.github.io/awesome-prometheus-alerts
- [Jenkins](https://samber.github.io/awesome-prometheus-alerts/rules#jenkins) - [Jenkins](https://samber.github.io/awesome-prometheus-alerts/rules#jenkins)
- [GitLab](https://samber.github.io/awesome-prometheus-alerts/rules#gitlab) - [GitLab](https://samber.github.io/awesome-prometheus-alerts/rules#gitlab)
- [Graph Node](https://samber.github.io/awesome-prometheus-alerts/rules#graph-node) - [Graph Node](https://samber.github.io/awesome-prometheus-alerts/rules#graph-node)
- [Jaeger](https://samber.github.io/awesome-prometheus-alerts/rules#jaeger)
## 🤝 Contributing ## 🤝 Contributing

View file

@ -4819,3 +4819,50 @@ groups:
comments: | comments: |
When the circuit breaker trips to "open" state, Git operations (push, pull, clone) will fail. When the circuit breaker trips to "open" state, Git operations (push, pull, clone) will fail.
Check Gitaly service health and logs. Check Gitaly service health and logs.
- name: Jaeger
exporters:
- name: Embedded exporter
slug: embedded-exporter
doc_url: https://www.jaegertracing.io/docs/latest/monitoring/
rules:
- name: Jaeger agent HTTP server errors
description: "Jaeger agent on {{ $labels.instance }} is experiencing {{ $value | humanize }}% HTTP server errors."
query: '100 * sum(rate(jaeger_agent_http_server_errors_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 1'
severity: warning
for: 15m
- name: Jaeger client RPC request errors
description: "Jaeger client on {{ $labels.instance }} is experiencing {{ $value | humanize }}% RPC HTTP errors."
query: '100 * sum(rate(jaeger_client_jaeger_rpc_http_requests{status_code=~"4xx|5xx"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 1'
severity: warning
for: 15m
- name: Jaeger client spans dropped
description: "Jaeger client on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans."
query: '100 * sum(rate(jaeger_reporter_spans{result=~"dropped|err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 1'
severity: warning
for: 15m
- name: Jaeger agent spans dropped
description: "Jaeger agent on {{ $labels.instance }} is dropping {{ $value | humanize }}% of span batches."
query: '100 * sum(rate(jaeger_agent_reporter_batches_failures_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 1'
severity: warning
for: 15m
- name: Jaeger collector dropping spans
description: "Jaeger collector on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans."
query: '100 * sum(rate(jaeger_collector_spans_dropped_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 1'
severity: warning
for: 15m
- name: Jaeger sampling update failing
description: "Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of sampling policy updates."
query: '100 * sum(rate(jaeger_sampler_queries{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 1'
severity: warning
for: 15m
- name: Jaeger throttling update failing
description: "Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of throttling policy updates."
query: '100 * sum(rate(jaeger_throttler_updates{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 1'
severity: warning
for: 15m
- name: Jaeger query request failures
description: "Jaeger query on {{ $labels.instance }} is failing {{ $value | humanize }}% of requests."
query: '100 * sum(rate(jaeger_query_requests_total{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 1'
severity: warning
for: 15m