awesome-prometheus-alerts

mirror of https://github.com/samber/awesome-prometheus-alerts.git synced 2026-06-21 00:47:18 +08:00

History

nucocloud 4c9da9ed24 Add LiteLLM section to Other group with 3 alerting rules (#553 ) LiteLLM (https://github.com/BerriAI/litellm) is a popular LLM-gateway/proxy that exposes Prometheus metrics via its built-in callback. There were no existing alerting rules for LiteLLM in this repo, despite its growing adoption as an OpenAI/Anthropic-compatible proxy. Added 3 alerts covering the most common operational concerns: 1. LiteLLM provider spend over budget — soft-warning on cumulative 24h spend per model-name regex. Useful when LiteLLM's native `provider_budget_config` hard-cap is unavailable, disabled, or buggy (e.g. BerriAI/litellm#26701). 2. LiteLLM proxy failed requests rate high — error-rate ratio alert for downstream LLM provider availability/auth issues. 3. LiteLLM request latency p95 high — histogram-quantile alert for downstream provider response-time degradation. All 3 rules tested via `promtool check rules` (SUCCESS) and validated on a real LiteLLM v1.83.7 production deployment. Reference: https://docs.litellm.ai/docs/proxy/prometheus	2026-04-29 15:03:07 +02:00
..
rules.yml	Add LiteLLM section to Other group with 3 alerting rules (#553 )	2026-04-29 15:03:07 +02:00

Add LiteLLM section to Other group with 3 alerting rules (#553 )

LiteLLM (https://github.com/BerriAI/litellm) is a popular LLM-gateway/proxy
that exposes Prometheus metrics via its built-in callback. There were no
existing alerting rules for LiteLLM in this repo, despite its growing
adoption as an OpenAI/Anthropic-compatible proxy.

Added 3 alerts covering the most common operational concerns:

1. **LiteLLM provider spend over budget** — soft-warning on cumulative
   24h spend per model-name regex. Useful when LiteLLM's native
   `provider_budget_config` hard-cap is unavailable, disabled, or
   buggy (e.g. BerriAI/litellm#26701).

2. **LiteLLM proxy failed requests rate high** — error-rate ratio
   alert for downstream LLM provider availability/auth issues.

3. **LiteLLM request latency p95 high** — histogram-quantile alert
   for downstream provider response-time degradation.

All 3 rules tested via `promtool check rules` (SUCCESS) and validated
on a real LiteLLM v1.83.7 production deployment.

Reference: https://docs.litellm.ai/docs/proxy/prometheus

2026-04-29 15:03:07 +02:00

rules.yml

Add LiteLLM section to Other group with 3 alerting rules (#553 )

2026-04-29 15:03:07 +02:00