awesome-prometheus-alerts

mirror of https://github.com/samber/awesome-prometheus-alerts.git synced 2026-06-21 00:47:18 +08:00

Author	SHA1	Message	Date
Samuel Berthe	5c166e8403	docs: update tagline and clean up README	2026-04-10 21:45:27 +02:00
Samuel Berthe	aa7d93ce95	chore: migrate assets/ to site/public/images/ (#549 ) Remove legacy assets/ directory (pre-Astro era). Images were already duplicated under site/public/images/; update README sponsor URLs to point to the new location.	2026-04-10 21:28:38 +02:00
Samuel Berthe	79afa21610	feat/astro migration (#538 ) * feat: migrate website from Jekyll to Astro Rebuilds the site using Astro (SSG) with Tailwind CSS v4, replacing the Jekyll/Cayman theme. Key changes: - Splits the monolithic /rules page into 110 statically-generated pages (92 per-service + 13 group index + homepage + guide pages) for SEO - URL structure: /rules/[group-slug]/[service-slug]/ with backward- compatibility redirect map for old anchor-based URLs (/rules#redis) - Modern UI: Prometheus-orange accent, dark mode (system + toggle), sticky sidebar, responsive layout, copy-to-clipboard per rule/section - SEO: per-page <title>, <meta description>, Open Graph, Twitter Card, canonical URLs, sitemap.xml via @astrojs/sitemap - GEO: FAQPage JSON-LD schema on each service page (rules as Q&A pairs for AI search engines), TechArticle schema, BreadcrumbList - Search: Pagefind (build-time index, lazy-loaded, ~200KB) - Zero JS by default; copy buttons and theme toggle use inline scripts - New CI: .github/workflows/deploy.yml builds Astro + Pagefind and deploys to GitHub Pages via actions/deploy-pages - Existing dist.yml and test.yml workflows are untouched - _data/rules.yml remains the single source of truth Note: GitHub Pages source must be changed from "Build from branch" (Jekyll) to "GitHub Actions" in repository settings. * doc: new website based on astro * refactor: remove previous website * chore: add npm dependabot for Astro site + scope CI to _data changes * Update site/astro.config.mjs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update site/src/components/CopyButton.astro Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * oops * fix: strip trailing slash from BASE_URL to prevent double slashes in URLs Agent-Logs-Url: https://github.com/samber/awesome-prometheus-alerts/sessions/c85937ba-1855-4b8a-a72b-847eab1c8639 Co-authored-by: samber <2951285+samber@users.noreply.github.com> * fix: resolve Astro build errors in astro.config.mjs - Remove assetsInclude yml which caused Vite to treat YAML files as static assets instead of running them through the custom YAML transform plugin; data.groups was undefined at runtime because the import resolved to a URL rather than parsed content - Deduplicate old-path redirects: emit only the slash-less variant per service to avoid Astro router collision warnings (trailing-slash variant is handled automatically) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: samber <2951285+samber@users.noreply.github.com>	2026-04-10 21:08:06 +02:00
Samuel Berthe	b8fd051a55	Update README.md	2026-03-31 16:41:19 +02:00
Samuel Berthe	03963ef6f9	refactor(categories): change categories and move some exporters (#528 )	2026-03-17 13:30:13 +01:00
Samuel Berthe	2b99cf1f76	Feat/cilium alerting rules (#526 ) * Add .worktrees/ to .gitignore * feat: add Cilium alerting rules (32 rules across agent, operator, ClusterMesh, KVStoreMesh, Hubble) * fix: use job label instead of k8s_app, switch to single-quoted YAML strings * remove Cilium agent high restart rate alert	2026-03-16 17:10:59 +01:00
Samuel Berthe	5071e01ad9	Feature/spinnaker alerts (#527 ) * Add .worktrees/ to .gitignore * feat: add Spinnaker alerting rules (12 rules) Add Prometheus alerting rules for Spinnaker built-in exporter covering Orca queue health, circuit breakers, Igor polling monitors, Gate API throttling, Clouddriver errors, and AWS rate limiting. Metric names validated against uneeq-oss/spinnaker-mixin dashboards. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-16 16:52:31 +01:00
Samuel Berthe	1455e0fd77	feat: add Oracle Database alerting rules (8 rules) (#525 ) Add Prometheus alerting rules for Oracle Database using iamseth/oracledb_exporter. Rules based on Grafana oracledb-mixin and exporter default metrics: - DB down, session/process limit, tablespace capacity (warning+critical), high rollbacks, active sessions, user I/O wait time.	2026-03-16 16:39:35 +01:00
Samuel Berthe	d8315eb3bc	Feature/cert manager rules (#524 ) * Add .worktrees/ to .gitignore * feat: add cert-manager alerting rules (4 rules) Add Prometheus alerting rules for cert-manager under the "Network, security and storage" category: - Cert-Manager absent (service down detection) - Certificate expiring soon (21-day threshold) - Certificate not ready (readiness check) - Hitting ACME rate limits (rate limit detection) Based on imusmanmalik/cert-manager-mixin and official cert-manager metrics documentation. * docs: add cert-manager to README	2026-03-16 15:01:07 +01:00
Samuel Berthe	b58b498bbb	feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules) (#523 ) * feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules) Add 18 Tempo rules and 49 Mimir rules based on official upstream mixins. Covers ring health, compaction, TSDB, instance limits, ruler, alertmanager, and more. * fix: address PR review comments on Tempo/Mimir rules - Fix Tempo no tenant index builders: add on() for cross-label-set and - Fix Tempo block list rising: output percentage instead of ratio - Fix Mimir memory map areas: multiply by 100 to match % description - Fix all instance limit rules: multiply by 100 to match % descriptions - Fix distributor inflight requests: add % to description	2026-03-16 14:36:50 +01:00
Samuel Berthe	7ee16641ac	feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter) (#520 ) * feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter) * fix: grammar in WireGuard rule comment	2026-03-16 14:20:17 +01:00
Samuel Berthe	f974552ef1	Feat/jaeger alerting rules (#521 ) * Add .worktrees/ to .gitignore * feat: add Jaeger alerting rules (8 rules from official jaeger-mixin) Rules cover agent HTTP errors, RPC errors, client/agent/collector span drops, sampling update failures, throttling update failures, and query request failures. All rules sourced from https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin * fix: rename Jaeger agent RPC alert to Jaeger client RPC The jaeger_client_jaeger_rpc_http_requests metric is client-side, not agent-side. Rename alert to match the actual metric source.	2026-03-16 14:09:03 +01:00
Samuel Berthe	8b443be6d2	feat: add systemd_exporter alerting rules (7 rules) (#522 ) * feat: add systemd_exporter alerting rules (7 rules) Add new Systemd service under Basic resource monitoring with rules for: - Unit failed/inactive state detection - Service crash loop detection - Task limit exhaustion - Socket refused/high connections - Timer missed trigger * fix: narrow systemd unit inactive query to reduce noise Add type="service" and name filter to the inactive unit alert to avoid false positives from legitimately inactive units.	2026-03-16 14:07:14 +01:00
Samuel Berthe	30bbedbc79	feat: add Cloud providers alerting rules (33 rules across 4 exporters) (#519 ) * feat: add Cloud providers alerting rules (33 rules across 4 exporters) New "Cloud providers" category with rules for: - AWS CloudWatch (13 rules): exporter health + EC2, RDS, SQS, ALB, Lambda - Google Cloud / Stackdriver (5 rules): scrape health, API quotas, staleness - DigitalOcean (10 rules): droplets, databases, k8s, load balancers, incidents - Azure (5 rules): API errors, rate limits, collection performance * fix: address PR review - move Cloud providers before Other, fix service name - Move "Cloud providers" group before "Other" in rules.yml for consistent ordering - Rename "Google Cloud / Stackdriver" to "Google Cloud Stackdriver" to avoid awkward /-/ in generated anchors and dist/rules/ paths - Fix README anchor link to match the new service name	2026-03-16 14:06:59 +01:00
Samuel Berthe	97aae5dabf	feat: add GitLab alerting rules (28 rules across 3 exporters) (#518 ) Add new GitLab service under "Other" category with 3 exporters: - Built-in exporter (18 rules): Puma, HTTP errors/latency, Sidekiq jobs, database connection pool, CI/CD pipelines, Ruby process health - Workhorse (3 rules): HTTP error rate, latency, in-flight requests - Gitaly (7 rules): gRPC errors, ResourceExhausted, RPC latency, CPU throttling, auth failures, circuit breaker All metrics verified against gitlabhq/gitlabhq source code. Several rules derived from GitLab Omnibus default alerting rules.	2026-03-16 04:48:52 +01:00
Samuel Berthe	e6cdcdb9e5	feat: add Apache Flink and Apache Spark alerting rules Add 20 new alerting rules under the Runtimes category: - Apache Flink (12 rules): job status, TaskManager registration, slot availability, restarts, checkpoints, backpressure, heap memory, GC, and record processing - Apache Spark (8 rules): worker health, waiting apps, memory/cores exhaustion, executor GC, task failures, and disk spill	2026-03-16 04:46:00 +01:00
Samuel Berthe	88e2c19017	feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi) (#517 ) * feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi) * fix: correct Keycloak metrics-spi metric names and query grouping	2026-03-16 04:40:15 +01:00
Samuel Berthe	20651aa10d	feat: add OpenStack alerting rules (openstack-exporter) (#515 ) * feat: add OpenStack alerting rules (openstack-exporter) Add 20 alerting rules for openstack-exporter/openstack-exporter covering Nova, Neutron, Cinder, Octavia, and Placement services. * docs: add OpenStack to README services list * fix: align OpenStack load balancer alert name with operating_status semantics The operating_status label uses ONLINE/OFFLINE/DEGRADED/ERROR values, not ACTIVE. Rename alert to "not online" and use the label in the description for clarity.	2026-03-16 03:43:51 +01:00
Samuel Berthe	bf7b902881	feat: add process-exporter alerting rules (ncabatoff/process-exporter) (#514 ) * feat: add process-exporter alerting rules (ncabatoff/process-exporter) * docs: add Process to README services list * fix: address PR review feedback for process-exporter rules - Rename service from "Process" to "Process Exporter" for clarity - Fix grammar: "file descriptors usage" → "file descriptor usage" - Clarify CPU alert description as core-equivalent percentage - Rename "high disk IO" to "high disk write IO" for accuracy	2026-03-16 03:31:18 +01:00
Samuel Berthe	2b239736cf	feat: add alerting rules for prometheus/memcached_exporter (#512 )	2026-03-16 03:25:38 +01:00
Samuel Berthe	f97f692596	feat: add Proxmox VE alerting rules (prometheus-pve-exporter) (#509 ) Add 9 alerting rules for Proxmox VE covering node/guest status, CPU, memory, storage, backup coverage, replication, and cluster quorum.	2026-03-16 03:12:06 +01:00
Samuel Berthe	be7a2e4d5d	feat: add IPMI exporter alerting rules (#510 ) * feat: add IPMI exporter alerting rules Add 17 alerting rules for prometheus-community/ipmi_exporter covering temperature, fan, voltage, current, power sensors, chassis status, and system event log monitoring. * docs: add IPMI to README service list * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-16 03:10:10 +01:00
Samuel Berthe	c064d2264e	feat: add Envoy proxy alerting rules using built-in metrics (#511 ) Add 19 alerting rules for Envoy proxy under "Reverse proxies and load balancers" using native metrics from /stats/prometheus endpoint. Covers: server health, HTTP error rates (downstream/upstream), connection saturation, cluster membership, health checks, outlier detection, SSL/TLS certificate expiry, circuit breakers, and request timeouts.	2026-03-16 03:03:57 +01:00
Samuel Berthe	89e703d763	feat: add alerting rules for cloudflare/ebpf_exporter (#508 ) * feat: add alerting rules for cloudflare/ebpf_exporter * docs: add eBPF to README service list	2026-03-16 02:56:04 +01:00
Samuel Berthe	3db9281508	feat: add SNMP exporter alerting rules (#507 ) Add 7 alerting rules for prometheus/snmp_exporter covering device availability, interface status, error rates, bandwidth utilization, and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.	2026-03-16 02:34:34 +01:00
Samuel Berthe	8f462ce962	adding claude.md	2026-03-15 19:59:01 +01:00
Samuel Berthe	080a792777	data: adding python/ruby/golang (#502 ) * data: adding python/ruby/golang * fix: address review feedback on runtime alerts - JVM non-heap: guard against unbounded metaspace (max_bytes = -1) - JVM old gen GC: note regex only matches CMS/G1/Parallel collectors - JVM/Python file descriptors: note process_* metrics are generic - Go memory usage: fix description (sys_bytes is runtime memory, not host) - Go goroutine spike: use deriv() instead of rate() on gauge - Go GC CPU fraction: note deprecation since Go 1.20 - Go GC duration: clarify quantile="1" is max, not p99 - Python uncollectable: use increase() on counter instead of raw threshold - Add threshold comments for workload-dependent defaults	2026-03-15 19:46:39 +01:00
Samuel Berthe	f0107caf9e	Update README.md	2026-01-15 12:33:35 +01:00
Samuel Berthe	65551ae19f	Update README.md	2026-01-15 02:42:42 +01:00
Samuel Berthe	2b5c8b0ec7	Update README.md	2026-01-15 02:39:24 +01:00
Samuel Berthe	d0d1b00a7b	Fix typo in OpenTelemetry Collector link	2025-11-05 17:15:10 +01:00
Samuel Berthe	e617c07179	Update README.md	2025-11-05 17:14:47 +01:00
Samuel Berthe	dfac84209d	Update README.md	2025-09-01 15:41:07 +02:00
Samuel Berthe	4be87d7796	Update README.md	2025-05-03 22:53:51 +02:00
Felix Bühler	10d00c66da	Add caddy.yml (#450 )	2025-02-04 14:23:14 +01:00
Samuel Berthe	fff8a80ae5	Update README.md	2024-12-08 21:24:45 +01:00
Samuel Berthe	b6a6c2e313	Update README.md	2024-07-02 09:33:01 +02:00
Samuel Berthe	847143ecc9	Update README.md	2024-05-13 10:42:04 +02:00
Samuel Berthe	85b102df08	Welcome @betterstack-community ✌️	2024-03-21 16:25:24 +01:00
Samuel Berthe	854688d17a	Update README.md	2024-02-09 20:24:10 +01:00
josedev-union	c6ff5a59dc	feat: Add rules for Graph Node (#387 ) Co-authored-by: josedev-union <josedev-union@users.noreply.github.com>	2024-01-20 20:33:26 +01:00
Samuel Berthe	32a097836a	Update README.md	2023-10-06 18:48:38 +02:00
Samuel Berthe	b19b403862	Update README.md	2023-08-15 20:05:13 +02:00
Samuel Berthe	5b6a86fa00	Update README.md	2023-08-15 20:03:06 +02:00
Samuel Berthe	ab7e29cfc0	Update README.md	2023-08-15 20:01:45 +02:00
Samuel Berthe	9efec14d26	chore: move from "https://awesome-prometheus-alerts.grep.to " to "https://samber.github.io/awesome-prometheus-alerts/"	2023-04-23 23:32:26 +02:00
Samuel Berthe	6ba9eb104c	feat: adding cloudflare exporter (#310 )	2022-10-03 16:57:24 +02:00
Yonah Dissen	55b049eb28	add argocd rules (#309 ) * add argocd rules * fix(argocd): move contrib into _data/rules.yml instead of dist/... Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>	2022-10-02 18:05:30 +02:00
Samuel Berthe	4662cd2812	doc: improve pulsar doc	2022-06-07 01:29:31 +02:00
Samuel Berthe	37722256d5	Adding jenkins	2021-12-27 12:49:32 +01:00

1 2

95 commits