LiteLLM (https://github.com/BerriAI/litellm) is a popular LLM-gateway/proxy
that exposes Prometheus metrics via its built-in callback. There were no
existing alerting rules for LiteLLM in this repo, despite its growing
adoption as an OpenAI/Anthropic-compatible proxy.
Added 3 alerts covering the most common operational concerns:
1. **LiteLLM provider spend over budget** — soft-warning on cumulative
24h spend per model-name regex. Useful when LiteLLM's native
`provider_budget_config` hard-cap is unavailable, disabled, or
buggy (e.g. BerriAI/litellm#26701).
2. **LiteLLM proxy failed requests rate high** — error-rate ratio
alert for downstream LLM provider availability/auth issues.
3. **LiteLLM request latency p95 high** — histogram-quantile alert
for downstream provider response-time degradation.
All 3 rules tested via `promtool check rules` (SUCCESS) and validated
on a real LiteLLM v1.83.7 production deployment.
Reference: https://docs.litellm.ai/docs/proxy/prometheus
Add Prometheus alerting rules for Oracle Database using iamseth/oracledb_exporter.
Rules based on Grafana oracledb-mixin and exporter default metrics:
- DB down, session/process limit, tablespace capacity (warning+critical),
high rollbacks, active sessions, user I/O wait time.
* feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules)
Add 18 Tempo rules and 49 Mimir rules based on official upstream mixins.
Covers ring health, compaction, TSDB, instance limits, ruler, alertmanager, and more.
* fix: address PR review comments on Tempo/Mimir rules
- Fix Tempo no tenant index builders: add on() for cross-label-set and
- Fix Tempo block list rising: output percentage instead of ratio
- Fix Mimir memory map areas: multiply by 100 to match % description
- Fix all instance limit rules: multiply by 100 to match % descriptions
- Fix distributor inflight requests: add % to description
* Add .worktrees/ to .gitignore
* feat: add Jaeger alerting rules (8 rules from official jaeger-mixin)
Rules cover agent HTTP errors, RPC errors, client/agent/collector span drops,
sampling update failures, throttling update failures, and query request failures.
All rules sourced from https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin
* fix: rename Jaeger agent RPC alert to Jaeger client RPC
The jaeger_client_jaeger_rpc_http_requests metric is client-side,
not agent-side. Rename alert to match the actual metric source.
* feat: add systemd_exporter alerting rules (7 rules)
Add new Systemd service under Basic resource monitoring with rules for:
- Unit failed/inactive state detection
- Service crash loop detection
- Task limit exhaustion
- Socket refused/high connections
- Timer missed trigger
* fix: narrow systemd unit inactive query to reduce noise
Add type="service" and name filter to the inactive unit alert
to avoid false positives from legitimately inactive units.
* feat: add Cloud providers alerting rules (33 rules across 4 exporters)
New "Cloud providers" category with rules for:
- AWS CloudWatch (13 rules): exporter health + EC2, RDS, SQS, ALB, Lambda
- Google Cloud / Stackdriver (5 rules): scrape health, API quotas, staleness
- DigitalOcean (10 rules): droplets, databases, k8s, load balancers, incidents
- Azure (5 rules): API errors, rate limits, collection performance
* fix: address PR review - move Cloud providers before Other, fix service name
- Move "Cloud providers" group before "Other" in rules.yml for consistent ordering
- Rename "Google Cloud / Stackdriver" to "Google Cloud Stackdriver" to avoid
awkward /-/ in generated anchors and dist/rules/ paths
- Fix README anchor link to match the new service name
* fix: use proper zero-traffic guard in Envoy ratio alerts (#511)
Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.
* feat: add alerting rules for prometheus/memcached_exporter
* fix: add division-by-zero guards and improve quoting in memcached rules (#512)
- Add `and memcached_max_connections > 0` to connection limit queries
- Add `and memcached_limit_bytes > 0` to memory usage query
- Switch hit-rate query to single quotes for cleaner PromQL readability
* fix: fix SNMP interface down query and add job scoping (#507)
- Fix ifOperStatus query to use vector matching instead of label filter
since ifAdminStatus is a separate metric in snmp_exporter output
- Add job=~"snmp.*" matcher to interface error rate, bandwidth usage,
and interface down rules to prevent matching non-SNMP series
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* feat: add OpenStack alerting rules (openstack-exporter)
Add 20 alerting rules for openstack-exporter/openstack-exporter covering
Nova, Neutron, Cinder, Octavia, and Placement services.
* docs: add OpenStack to README services list
* fix: align OpenStack load balancer alert name with operating_status semantics
The operating_status label uses ONLINE/OFFLINE/DEGRADED/ERROR values,
not ACTIVE. Rename alert to "not online" and use the label in the
description for clarity.
* feat: add process-exporter alerting rules (ncabatoff/process-exporter)
* docs: add Process to README services list
* fix: address PR review feedback for process-exporter rules
- Rename service from "Process" to "Process Exporter" for clarity
- Fix grammar: "file descriptors usage" → "file descriptor usage"
- Clarify CPU alert description as core-equivalent percentage
- Rename "high disk IO" to "high disk write IO" for accuracy
* feat: add IPMI exporter alerting rules
Add 17 alerting rules for prometheus-community/ipmi_exporter covering
temperature, fan, voltage, current, power sensors, chassis status,
and system event log monitoring.
* docs: add IPMI to README service list
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Add 7 alerting rules for prometheus/snmp_exporter covering device
availability, interface status, error rates, bandwidth utilization,
and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.
* data: adding python/ruby/golang
* fix: address review feedback on runtime alerts
- JVM non-heap: guard against unbounded metaspace (max_bytes = -1)
- JVM old gen GC: note regex only matches CMS/G1/Parallel collectors
- JVM/Python file descriptors: note process_* metrics are generic
- Go memory usage: fix description (sys_bytes is runtime memory, not host)
- Go goroutine spike: use deriv() instead of rate() on gauge
- Go GC CPU fraction: note deprecation since Go 1.20
- Go GC duration: clarify quantile="1" is max, not p99
- Python uncollectable: use increase() on counter instead of raw threshold
- Add threshold comments for workload-dependent defaults
* Add systemd service name to HostSystemdServiceCrashed summary
* Modify systemd service crash rule description
Updated the description for the systemd service crash rule to include the service name.
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
* Adjust OOM kill detected rule
When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.
* Update rules.yml
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
* Jenkins node offline for clause (#2)
* Convert cpu alert expressions to without() rather than on()
* Remove on() expression from network throughput alerts as labels fully match
---------
Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>