Compare commits

...

285 commits

Author SHA1 Message Date
dependabot[bot]
5cc052fc0a
build(deps): bump @astrojs/sitemap from 3.7.2 to 3.7.3 in /site (#567)
Bumps [@astrojs/sitemap](https://github.com/withastro/astro/tree/HEAD/packages/integrations/sitemap) from 3.7.2 to 3.7.3.
- [Release notes](https://github.com/withastro/astro/releases)
- [Changelog](https://github.com/withastro/astro/blob/main/packages/integrations/sitemap/CHANGELOG.md)
- [Commits](https://github.com/withastro/astro/commits/@astrojs/sitemap@3.7.3/packages/integrations/sitemap)

---
updated-dependencies:
- dependency-name: "@astrojs/sitemap"
  dependency-version: 3.7.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 18:39:30 +02:00
dependabot[bot]
63b80c8078
build(deps): bump posthog-js from 1.372.6 to 1.378.1 in /site (#566)
Bumps [posthog-js](https://github.com/PostHog/posthog-js) from 1.372.6 to 1.378.1.
- [Release notes](https://github.com/PostHog/posthog-js/releases)
- [Changelog](https://github.com/PostHog/posthog-js/blob/main/CHANGELOG.md)
- [Commits](https://github.com/PostHog/posthog-js/compare/posthog-js@1.372.6...posthog-js@1.378.1)

---
updated-dependencies:
- dependency-name: posthog-js
  dependency-version: 1.378.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 18:38:03 +02:00
dependabot[bot]
3a847e3d02
build(deps): bump astro from 6.2.1 to 6.4.2 in /site (#568)
Bumps [astro](https://github.com/withastro/astro/tree/HEAD/packages/astro) from 6.2.1 to 6.4.2.
- [Release notes](https://github.com/withastro/astro/releases)
- [Changelog](https://github.com/withastro/astro/blob/main/packages/astro/CHANGELOG.md)
- [Commits](https://github.com/withastro/astro/commits/astro@6.4.2/packages/astro)

---
updated-dependencies:
- dependency-name: astro
  dependency-version: 6.4.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 18:37:56 +02:00
dependabot[bot]
96fc299432
build(deps): bump @iconify-json/lucide from 1.2.102 to 1.2.111 in /site (#569)
Bumps [@iconify-json/lucide](https://github.com/iconify/icon-sets) from 1.2.102 to 1.2.111.
- [Commits](https://github.com/iconify/icon-sets/commits)

---
updated-dependencies:
- dependency-name: "@iconify-json/lucide"
  dependency-version: 1.2.111
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 18:37:48 +02:00
dependabot[bot]
074736db2c
build(deps): bump yaml from 2.8.3 to 2.9.0 in /site (#570)
Bumps [yaml](https://github.com/eemeli/yaml) from 2.8.3 to 2.9.0.
- [Release notes](https://github.com/eemeli/yaml/releases)
- [Commits](https://github.com/eemeli/yaml/compare/v2.8.3...v2.9.0)

---
updated-dependencies:
- dependency-name: yaml
  dependency-version: 2.9.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-02 18:36:52 +02:00
samber (headless)
832376c598
fix(ci): fix Publish workflow startup_failure (#565)
* fix(ci): fix Publish workflow startup_failure

* fix(ci): fix Publish workflow startup_failure
2026-05-22 20:33:32 +02:00
Samuel Berthe
5c41e54297
chore: add Bing Webmaster Tools verification meta tag 2026-05-16 21:44:37 +02:00
samber (headless)
49dbf0309f
ci: add dependabot automerge workflow (#564)
Co-authored-by: headless-samber <150833725+headless-samber@users.noreply.github.com>
2026-05-15 18:51:23 +02:00
dependabot[bot]
0cb56fdcfc
build(deps): bump devalue from 5.6.4 to 5.8.1 in /site (#563)
Bumps [devalue](https://github.com/sveltejs/devalue) from 5.6.4 to 5.8.1.
- [Release notes](https://github.com/sveltejs/devalue/releases)
- [Changelog](https://github.com/sveltejs/devalue/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sveltejs/devalue/compare/v5.6.4...v5.8.1)

---
updated-dependencies:
- dependency-name: devalue
  dependency-version: 5.8.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-15 18:49:32 +02:00
dependabot[bot]
56c10ee930
build(deps): bump protobufjs from 7.5.5 to 7.5.8 in /site (#562)
Bumps [protobufjs](https://github.com/protobufjs/protobuf.js) from 7.5.5 to 7.5.8.
- [Release notes](https://github.com/protobufjs/protobuf.js/releases)
- [Changelog](https://github.com/protobufjs/protobuf.js/blob/protobufjs-v7.5.8/CHANGELOG.md)
- [Commits](https://github.com/protobufjs/protobuf.js/compare/protobufjs-v7.5.5...protobufjs-v7.5.8)

---
updated-dependencies:
- dependency-name: protobufjs
  dependency-version: 7.5.8
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-13 12:02:59 +02:00
dependabot[bot]
2bdcdbb54e
build(deps): bump @protobufjs/utf8 from 1.1.0 to 1.1.1 in /site (#561)
Bumps [@protobufjs/utf8](https://github.com/dcodeIO/protobuf.js) from 1.1.0 to 1.1.1.
- [Release notes](https://github.com/dcodeIO/protobuf.js/releases)
- [Changelog](https://github.com/protobufjs/protobuf.js/blob/master/CHANGELOG.md)
- [Commits](https://github.com/dcodeIO/protobuf.js/compare/protobufjs-cli-v1.1.0...protobufjs-cli-v1.1.1)

---
updated-dependencies:
- dependency-name: "@protobufjs/utf8"
  dependency-version: 1.1.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 20:16:18 +02:00
dependabot[bot]
1fb78854d4
build(deps): bump postcss from 8.5.8 to 8.5.13 in /site (#560)
Bumps [postcss](https://github.com/postcss/postcss) from 8.5.8 to 8.5.13.
- [Release notes](https://github.com/postcss/postcss/releases)
- [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/postcss/postcss/compare/8.5.8...8.5.13)

---
updated-dependencies:
- dependency-name: postcss
  dependency-version: 8.5.13
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-01 16:52:23 +02:00
dependabot[bot]
07b24067f3
build(deps): bump astro from 6.1.6 to 6.2.1 in /site (#555)
Bumps [astro](https://github.com/withastro/astro/tree/HEAD/packages/astro) from 6.1.6 to 6.2.1.
- [Release notes](https://github.com/withastro/astro/releases)
- [Changelog](https://github.com/withastro/astro/blob/main/packages/astro/CHANGELOG.md)
- [Commits](https://github.com/withastro/astro/commits/astro@6.2.1/packages/astro)

---
updated-dependencies:
- dependency-name: astro
  dependency-version: 6.2.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-01 16:51:51 +02:00
dependabot[bot]
09a0755bee
build(deps): bump posthog-js from 1.369.2 to 1.372.6 in /site (#556)
Bumps [posthog-js](https://github.com/PostHog/posthog-js) from 1.369.2 to 1.372.6.
- [Release notes](https://github.com/PostHog/posthog-js/releases)
- [Changelog](https://github.com/PostHog/posthog-js/blob/main/CHANGELOG.md)
- [Commits](https://github.com/PostHog/posthog-js/compare/posthog-js@1.369.2...posthog-js@1.372.6)

---
updated-dependencies:
- dependency-name: posthog-js
  dependency-version: 1.372.6
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-01 16:50:57 +02:00
dependabot[bot]
bbdcbb7956
build(deps): bump actions/upload-pages-artifact from 4 to 5 (#554)
Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/upload-pages-artifact/releases)
- [Commits](https://github.com/actions/upload-pages-artifact/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/upload-pages-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-01 16:50:54 +02:00
dependabot[bot]
d1021e7c8b
build(deps): bump @tailwindcss/vite from 4.2.2 to 4.2.4 in /site (#557)
Bumps [@tailwindcss/vite](https://github.com/tailwindlabs/tailwindcss/tree/HEAD/packages/@tailwindcss-vite) from 4.2.2 to 4.2.4.
- [Release notes](https://github.com/tailwindlabs/tailwindcss/releases)
- [Changelog](https://github.com/tailwindlabs/tailwindcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/tailwindlabs/tailwindcss/commits/v4.2.4/packages/@tailwindcss-vite)

---
updated-dependencies:
- dependency-name: "@tailwindcss/vite"
  dependency-version: 4.2.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-01 16:49:36 +02:00
dependabot[bot]
adc5477b1e
build(deps): bump tailwindcss from 4.2.2 to 4.2.4 in /site (#558)
Bumps [tailwindcss](https://github.com/tailwindlabs/tailwindcss/tree/HEAD/packages/tailwindcss) from 4.2.2 to 4.2.4.
- [Release notes](https://github.com/tailwindlabs/tailwindcss/releases)
- [Changelog](https://github.com/tailwindlabs/tailwindcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/tailwindlabs/tailwindcss/commits/v4.2.4/packages/tailwindcss)

---
updated-dependencies:
- dependency-name: tailwindcss
  dependency-version: 4.2.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-01 16:49:23 +02:00
dependabot[bot]
b14bfb236b
build(deps): bump pagefind from 1.5.0 to 1.5.2 in /site (#559)
Bumps [pagefind](https://github.com/Pagefind/pagefind) from 1.5.0 to 1.5.2.
- [Release notes](https://github.com/Pagefind/pagefind/releases)
- [Changelog](https://github.com/Pagefind/pagefind/blob/main/CHANGELOG.md)
- [Commits](https://github.com/Pagefind/pagefind/compare/v1.5.0...v1.5.2)

---
updated-dependencies:
- dependency-name: pagefind
  dependency-version: 1.5.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-01 16:49:15 +02:00
samber
43427987af Publish 2026-04-29 13:03:37 +00:00
nucocloud
4c9da9ed24
Add LiteLLM section to Other group with 3 alerting rules (#553)
LiteLLM (https://github.com/BerriAI/litellm) is a popular LLM-gateway/proxy
that exposes Prometheus metrics via its built-in callback. There were no
existing alerting rules for LiteLLM in this repo, despite its growing
adoption as an OpenAI/Anthropic-compatible proxy.

Added 3 alerts covering the most common operational concerns:

1. **LiteLLM provider spend over budget** — soft-warning on cumulative
   24h spend per model-name regex. Useful when LiteLLM's native
   `provider_budget_config` hard-cap is unavailable, disabled, or
   buggy (e.g. BerriAI/litellm#26701).

2. **LiteLLM proxy failed requests rate high** — error-rate ratio
   alert for downstream LLM provider availability/auth issues.

3. **LiteLLM request latency p95 high** — histogram-quantile alert
   for downstream provider response-time degradation.

All 3 rules tested via `promtool check rules` (SUCCESS) and validated
on a real LiteLLM v1.83.7 production deployment.

Reference: https://docs.litellm.ai/docs/proxy/prometheus
2026-04-29 15:03:07 +02:00
Samuel Berthe
8ca1fe591f
chore: improve seo 2026-04-26 16:52:07 +02:00
Samuel Berthe
f5f4fdfba4
ci: pin Node.js to 24 for Astro 6 compatibility
Astro 6 requires Node.js >=22.12.0; 'latest' was resolving to v20.
2026-04-22 01:49:13 +02:00
Samuel Berthe
73fff11969
chore: fix Astro deployment 2026-04-22 01:44:01 +02:00
Samuel Berthe
7fd73364a0
ci: pin Node.js to 24 for Astro 6 compatibility
Astro 6 requires Node.js >=22.12.0; 'latest' was resolving to v20.
2026-04-22 01:40:32 +02:00
dependabot[bot]
b2563bb228
build(deps): bump astro from 5.18.1 to 6.1.6 in /site (#551)
Bumps [astro](https://github.com/withastro/astro/tree/HEAD/packages/astro) from 5.18.1 to 6.1.6.
- [Release notes](https://github.com/withastro/astro/releases)
- [Changelog](https://github.com/withastro/astro/blob/main/packages/astro/CHANGELOG.md)
- [Commits](https://github.com/withastro/astro/commits/astro@6.1.6/packages/astro)

---
updated-dependencies:
- dependency-name: astro
  dependency-version: 6.1.6
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-22 01:00:59 +02:00
samber
90f0a63450 Publish 2026-04-21 22:56:04 +00:00
Samuel Berthe
353133d23f
jaeger v2 otel exporter alerts (#552)
* feat(jaeger): add v2 OTEL-based alerts and keep v1 as legacy

Jaeger v2 is built on OpenTelemetry Collector and no longer exposes
jaeger_agent_* / jaeger_collector_* / jaeger_client_* metrics.

- Add "Embedded exporter (v2+)" with 8 rules targeting:
  - jaeger_storage_requests_total (error rate, unavailability, no reads)
  - jaeger_storage_latency_seconds_bucket (p99 latency)
  - http_server_request_duration_seconds_* via otelhttp (search errors,
    search latency, single-trace retrieval latency, service discovery errors)
- Rename existing exporter to "Embedded exporter (legacy, <v2)" with
  slug embedded-exporter-legacy and a v1 EOL notice (Dec 31 2025)

* chore: adding node version to github action
2026-04-22 00:55:36 +02:00
Samuel Berthe
eccf556bdb
fix: remove dead legacyHtmlRedirects and clean up sitemap/SEO config
- Drop legacyHtmlRedirects from astro.config.mjs (no-op on static GitHub Pages host; superseded by *.html.astro pages from e0311c3)
- Remove lastmod: new Date() from sitemap serializer (generates unstable dates on every build)
- Add sitemap .html filter comment, tighten service page meta description, include rule count in titles
2026-04-21 17:18:36 +02:00
Samuel Berthe
e0311c3c09
fix: replace Astro redirects with static meta-refresh pages for legacy .html URLs
GitHub Pages is a static host and does not support server-side redirects.
Astro redirects config only works for SSR targets, so legacyHtmlRedirects had
no effect. Replace with real .html.astro pages using meta http-equiv=refresh
and link rel=canonical. Also disallow legacy URLs in robots.txt.
2026-04-21 16:39:01 +02:00
Samuel Berthe
6d8b2b3671
doc(seo): improve seo after migration 2026-04-21 16:24:57 +02:00
Samuel Berthe
bb8ac9b0cd
fix: always include item field in BreadcrumbList JSON-LD
Fixes Google Search Console error: missing field "item" in itemListElement.
Also removes unused ahrefs site verification meta tag.
2026-04-21 16:00:30 +02:00
Samuel Berthe
6b2a5af9f9
oops 2026-04-17 20:01:55 +02:00
Samuel Berthe
b58c180dcb
improve seo 2026-04-17 19:59:24 +02:00
Samuel Berthe
6070e81097
improve seo 2026-04-17 19:53:36 +02:00
Samuel Berthe
4481bb3276
oops 2026-04-17 13:40:37 +02:00
Samuel Berthe
b4324742be
feat: replace Tinybird tracking with PostHog
- Remove Tinybird fetch pipeline from pipe.ts, keep only session/lifetime copy counters
- Wire session_copy_count and lifetime_copy_count into posthog.capture calls
- Remove Tinybird calls from sponsor click tracking, use posthog only
- Hardcode PostHog project ID and reverse proxy host (hogpost.samber.dev)
2026-04-17 12:07:50 +02:00
Samuel Berthe
5a5976c9a3
feat: track sponsor clicks with blocking event before navigation
- Add recordCopy() for copy events (bumps session/lifetime counters)
- Add recordAndWait() for blocking events (1500ms timeout, errors swallowed)
- Extract shared sponsor click handler into site/src/scripts/sponsor.ts
- Plain left-click blocks navigation until HTTP response; modifier/middle
  clicks track fire-and-forget and let the browser navigate natively
- Distinguish header vs footer placement via data-sponsor-slot attribute
2026-04-15 16:28:25 +02:00
Samuel Berthe
1c5f626046
feat: add first-party copy event pipe to Tinybird
Sends rule_copy, wget_copy events on clipboard interactions,
bypassing ad blockers. Tracks user_id (localStorage apa_uid),
session_id (sessionStorage apa_sid), session/lifetime copy counts,
full rule coordinates (group/service/exporter/rule slugs + indices),
page context, and browser environment. Event name is the Tinybird
data source name, scoped to "rule" or "exporter" per copy type.
2026-04-15 11:34:12 +02:00
Samuel Berthe
bb055773b4
feat: add GitHub star nudges across the site
- Prepend attribution comment to "Copy all" exporter clipboard
- Show inline  Star nudge on individual rule copy (3s, dismisses automatically)
- Change StatsBar stars label to "engineers starred" for social proof
- Add milestone progress bar toward 10k stars in StatsBar
- Fix header/StatsBar showing "0" when SSR GitHub API fetch fails (use "—" placeholder)
2026-04-14 21:52:27 +02:00
Samuel Berthe
d38511d7cb
chore: generate pagefind index at build time, not committed to git
- Add pagefind run step to build script in site/package.json
- Add site/public/pagefind/ to .gitignore (generated at deploy time)
2026-04-14 20:33:29 +02:00
Samuel Berthe
a56d8cf2a4
feat: refine star toast — brand orange, idle trigger, 15s auto-hide
- Style: brand orange background with white text (visible on any bg)
- Trigger: every 5 copies OR after 10 minutes of inactivity on page
- Auto-hide: 15s (reset if toast re-triggers before expiry)
- Idle timer resets on each copy
2026-04-14 20:30:08 +02:00
Samuel Berthe
25418c5db2
feat: add star nudge toast after every 5 rule copies
Show a dismissible toast (bottom-right, 20s auto-hide) nudging users
to star the GitHub repo. Fires every 5 copies via a sessionStorage
counter. CopyButton dispatches a copy-success custom event; StarToast
listens for it and manages display logic.
2026-04-14 20:09:30 +02:00
Samuel Berthe
5366d4b9ae
fix: replace invalid top-level return with isFresh flag in star scripts
Top-level return is a syntax error in ES modules. Replace the early
return pattern with an isFresh boolean guard. Also revert the hero
"Star on GitHub" button change.
2026-04-14 19:59:36 +02:00
Samuel Berthe
1f8bcca779
feat: add GitHub stars to StatsBar and fix cache early-return
Add a 4th stat ( GitHub stars) to StatsBar with build-time fallback
and live client-side fetch. Both Header and StatsBar share the same
sessionStorage cache key and skip the API call when the cache is fresh
(1h TTL), reducing fetches to at most one per session.
2026-04-14 19:51:12 +02:00
Samuel Berthe
954999dfa9
feat: replace GitHub icon with Star button and live star count
Replace the plain GitHub icon+count in the header with a proper two-zone
star button (★ Star | 8.4k). The count is seeded at build time from the
GitHub API and refreshed client-side on page load with a 1-hour
sessionStorage cache.
2026-04-14 19:47:49 +02:00
Samuel Berthe
297fd9864c
fix: use https in CC BY URL and trigger site build on _data changes 2026-04-14 16:27:01 +02:00
Samuel Berthe
5c166e8403
docs: update tagline and clean up README 2026-04-10 21:45:27 +02:00
Samuel Berthe
ab87fdcf30
feat/dual license (#550)
* ci: remove node version pin in site build workflow

* docs: clarify dual license (CC BY 4.0 for content, MIT for site code)

Alert rules and content (_data/rules.yml, dist/) are licensed under
Creative Commons CC BY 4.0. The site source code (site/) is licensed
under MIT. Both are now documented in LICENSE, site/LICENSE, the footer,
and the FAQ.
2026-04-10 21:36:57 +02:00
Samuel Berthe
aa7d93ce95
chore: migrate assets/ to site/public/images/ (#549)
Remove legacy assets/ directory (pre-Astro era). Images were already
duplicated under site/public/images/; update README sponsor URLs to
point to the new location.
2026-04-10 21:28:38 +02:00
Samuel Berthe
a4d0b1370c
ci: add site build workflow (#548) 2026-04-10 21:21:04 +02:00
dependabot[bot]
d31b3f9ba0
build(deps): bump @iconify-json/lucide from 1.2.101 to 1.2.102 in /site (#545)
Bumps [@iconify-json/lucide](https://github.com/iconify/icon-sets) from 1.2.101 to 1.2.102.
- [Commits](https://github.com/iconify/icon-sets/commits)

---
updated-dependencies:
- dependency-name: "@iconify-json/lucide"
  dependency-version: 1.2.102
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:19:49 +02:00
Samuel Berthe
89d8423d93
build(deps): migrate from @astrojs/tailwind to @tailwindcss/vite for Tailwind v4 (#547)
@astrojs/tailwind v6 still requires tailwindcss@^3; replace it with the
official @tailwindcss/vite Vite plugin. Update global.css to v4 syntax
(@import "tailwindcss", @custom-variant dark, @theme tokens) and drop
the now-unused tailwind.config.mjs.
2026-04-10 21:18:13 +02:00
dependabot[bot]
814dd5d3fb
build(deps): bump @astrojs/tailwind from 5.1.5 to 6.0.2 in /site (#543)
Bumps [@astrojs/tailwind](https://github.com/withastro/astro/tree/HEAD/packages/integrations/tailwind) from 5.1.5 to 6.0.2.
- [Release notes](https://github.com/withastro/astro/releases)
- [Changelog](https://github.com/withastro/astro/blob/main/packages/integrations/tailwind/CHANGELOG.md)
- [Commits](https://github.com/withastro/astro/commits/@astrojs/tailwind@6.0.2/packages/integrations/tailwind)

---
updated-dependencies:
- dependency-name: "@astrojs/tailwind"
  dependency-version: 6.0.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:11:05 +02:00
dependabot[bot]
e6ea45aec1
build(deps): bump tailwindcss from 3.4.19 to 4.2.2 in /site (#544)
Bumps [tailwindcss](https://github.com/tailwindlabs/tailwindcss/tree/HEAD/packages/tailwindcss) from 3.4.19 to 4.2.2.
- [Release notes](https://github.com/tailwindlabs/tailwindcss/releases)
- [Changelog](https://github.com/tailwindlabs/tailwindcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/tailwindlabs/tailwindcss/commits/v4.2.2/packages/tailwindcss)

---
updated-dependencies:
- dependency-name: tailwindcss
  dependency-version: 4.2.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:56 +02:00
dependabot[bot]
bea2dc45b4
build(deps): bump actions/upload-pages-artifact from 3 to 4 (#540)
Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 3 to 4.
- [Release notes](https://github.com/actions/upload-pages-artifact/releases)
- [Commits](https://github.com/actions/upload-pages-artifact/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/upload-pages-artifact
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:38 +02:00
dependabot[bot]
dd0c8372f9
build(deps): bump actions/setup-node from 4 to 6 (#541)
Bumps [actions/setup-node](https://github.com/actions/setup-node) from 4 to 6.
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/setup-node
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:31 +02:00
dependabot[bot]
132329abd8
build(deps): bump actions/deploy-pages from 4 to 5 (#542)
Bumps [actions/deploy-pages](https://github.com/actions/deploy-pages) from 4 to 5.
- [Release notes](https://github.com/actions/deploy-pages/releases)
- [Commits](https://github.com/actions/deploy-pages/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/deploy-pages
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:24 +02:00
dependabot[bot]
9e80bb910e
build(deps): bump actions/checkout from 4 to 6 (#539)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:09 +02:00
Samuel Berthe
79afa21610
feat/astro migration (#538)
* feat: migrate website from Jekyll to Astro

Rebuilds the site using Astro (SSG) with Tailwind CSS v4, replacing the
Jekyll/Cayman theme. Key changes:

- Splits the monolithic /rules page into 110 statically-generated pages
  (92 per-service + 13 group index + homepage + guide pages) for SEO
- URL structure: /rules/[group-slug]/[service-slug]/ with backward-
  compatibility redirect map for old anchor-based URLs (/rules#redis)
- Modern UI: Prometheus-orange accent, dark mode (system + toggle),
  sticky sidebar, responsive layout, copy-to-clipboard per rule/section
- SEO: per-page <title>, <meta description>, Open Graph, Twitter Card,
  canonical URLs, sitemap.xml via @astrojs/sitemap
- GEO: FAQPage JSON-LD schema on each service page (rules as Q&A pairs
  for AI search engines), TechArticle schema, BreadcrumbList
- Search: Pagefind (build-time index, lazy-loaded, ~200KB)
- Zero JS by default; copy buttons and theme toggle use inline scripts
- New CI: .github/workflows/deploy.yml builds Astro + Pagefind and
  deploys to GitHub Pages via actions/deploy-pages
- Existing dist.yml and test.yml workflows are untouched
- _data/rules.yml remains the single source of truth

Note: GitHub Pages source must be changed from "Build from branch"
(Jekyll) to "GitHub Actions" in repository settings.

* doc: new website based on astro

* refactor: remove previous website

* chore: add npm dependabot for Astro site + scope CI to _data changes

* Update site/astro.config.mjs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update site/src/components/CopyButton.astro

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* oops

* fix: strip trailing slash from BASE_URL to prevent double slashes in URLs

Agent-Logs-Url: https://github.com/samber/awesome-prometheus-alerts/sessions/c85937ba-1855-4b8a-a72b-847eab1c8639

Co-authored-by: samber <2951285+samber@users.noreply.github.com>

* fix: resolve Astro build errors in astro.config.mjs

- Remove assetsInclude yml which caused Vite to treat YAML files as static assets instead of running them through the custom YAML transform plugin; data.groups was undefined at runtime because the import resolved to a URL rather than parsed content
- Deduplicate old-path redirects: emit only the slash-less variant per service to avoid Astro router collision warnings (trailing-slash variant is handled automatically)

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: samber <2951285+samber@users.noreply.github.com>
2026-04-10 21:08:06 +02:00
samber
0d148832d3 Publish 2026-04-06 19:14:46 +00:00
Samuel Berthe
c2615fae52
fix/promql rules review 2 (#534)
* fix(data): fix queries and thresholds across multiple exporters

- Ceph: fix OSD latency metric name (ceph_osd_apply_latency_ms), replace
  ceph_osd_utilization with ceph_health_detail{name="OSD_NEARFULL"}, add for: durations
- ZFS: improve description, remove incorrect ON() join on readonly check
- Thanos: filter gRPC errors to actual error codes only (drop NotFound, Cancelled, etc.)
- Loki/Promtail: fix histogram_quantile to aggregate by (namespace, job, route, le)
- Mimir: raise rate()>0 thresholds to >0.05, add missing for: durations
- OTel Collector: raise rate()>0 thresholds to >0.05, add deprecation comments
- Tempo/Cortex: raise >0 thresholds to avoid transient spikes
- APC UPS: add division-by-zero guard on battery voltage ratio
- DigitalOcean: raise increase()>0 to >3
- Grafana Alloy: fix missing name: field on exporter
- Graph Node: add threshold comments

* fix(data): remove official mixin reference from Ceph OSD comment

* fix(data): remove official mixin references from comments
2026-04-06 21:14:15 +02:00
Samuel Berthe
72c9e922c0
docs: update CLAUDE.md with lessons from PromQL review
Add guidance on untyped metrics with counter semantics (node_vmstat_*,
MySQL SHOW GLOBAL STATUS), YAML duplicate key pitfall, permanent-firing
cumulative counter alerts, and updated category structure.
2026-04-06 21:08:48 +02:00
samber
ed1515015a Publish 2026-04-06 18:38:45 +00:00
Samuel Berthe
2258835c30
fix/promql rules review (#533)
* fix(data): comprehensive PromQL review across all ~937 rules

Query fixes:
- Replace rate()/increase() with deriv()/delta() on gauge metrics exposed
  as untyped by exporters (node_vmstat_oom_kill, mysql_global_status_*,
  systemd_socket_refused_connections_total)
- Fix Ceph OSD latency metric name: ceph_osd_perf_apply_latency_seconds
  → ceph_osd_apply_latency_ms (Ceph MGR Prometheus module)
- Fix NATS subscriptions metric: gnatsd_connz_subscriptions (per-conn)
  → gnatsd_varz_subscriptions (server total)
- Fix Caddy reverse proxy down query: count()==0 → direct gauge == 0
- Fix RabbitMQ total connections metric: connectionsTotal → connections
- Fix Cilium ClusterMesh/KVStoreMesh: deriv() on failure gauge → direct
  gauge comparison (deriv > 0 misses stable non-zero failure states)
- Fix cert-manager ACME metric name: certmanager_http_acme_client_request_count
  → certmanager_acme_client_request_count (renamed in v1.19+)
- Fix Thanos Query gRPC filter: grpc_code!="OK" → explicit error codes
- Fix Flink duplicate comments: field (YAML last-write-wins bug)
- Add datid!="0" filter to PostgreSQL dead locks query
- Fix PostgreSQL high rollback rate: restructure division-by-zero guard
  and move ratio calculation outside sum()
- Add division-by-zero guards: Container Low CPU, Hadoop ResourceManager
  memory, Hadoop HBase heap, Vault cluster health
- Add for: 1m to Blackbox probe failed/HTTP failure and Ceph State/
  OSD Down/PG unavailable

Threshold fixes:
- Replace > 0 with meaningful thresholds on rate()/increase() queries
  across: Alertmanager, eBPF decoder errors, systemd refused connections,
  Memcached, Cassandra (Instaclustr + Criteo), ClickHouse distributed
  inserts, CouchDB log entries, HAProxy healthcheck failures, RabbitMQ
  unroutable messages, Spinnaker, Cilium, Mimir TSDB/alertmanager,
  OTel Collector receiver refused metrics
- Fix Elasticsearch High Indexing Latency threshold: 0.0005s → 0.01s
  (0.5ms was below normal operating range; 10ms is more realistic)

Description fixes:
- Fix MySQL slow queries: remove duplicate "mysql" word
- Fix SMART device description: remove trailing stray ")" (6 rules)
- Fix host disk IO description: remove duplicate "Check storage for issues."
- Fix EDAC correctable errors: "last 5 minutes" → "last 1 minute"
- Fix EDAC uncorrectable errors: remove time-window claim (raw counter)
- Fix Mimir store-gateway sync description: said "10 minutes" but
  threshold is 1800s (30 minutes)
- Fix Vault description false "%" suffix on count values
- Improve descriptions across RabbitMQ, Zookeeper, Kafka, Pulsar, Envoy,
  Istio rules to include {{ $labels }} and {{ $value }} template vars
- Downgrade Cassandra key cache hit rate: critical → warning

Comments:
- Add note on node_vmstat_oom_kill gauge type (delta vs increase)
- Add note on systemd_socket_refused_connections_total gauge type
- Add note on mysql_global_status_* gauge type (delta/deriv vs rate)
- Add note on pg_txid_current requiring a custom postgres_exporter query
- Add note on pg_stat_ssl_compression availability (PG 9.5-13 only)
- Add note on cert-manager legacy metric name for users on v1.18 and older
- Add threshold rationale for Elasticsearch, Cassandra, CouchDB rules
- Add note on NATS leaf node spurious fires when leaf nodes not configured

* fix(data): PromQL type fixes, job filter cleanup, query correctness review

- Replace rate()/increase() with deriv()/delta() on gauge metrics:
  node_vmstat_pgmajfault, cassandra_stats (criteo exporter),
  gitlab_ci_pipeline_failure_reasons, flink_taskmanager_job_task_numRecordsIn
- Fix histogram_quantile on non-_bucket metric: cilium_policy_implementation_delay
- Fix Thanos bucket replicate latency: use _count instead of _bucket for guard clause
- Fix Thanos query latency: use _count instead of _bucket for guard clause
- Restore job filter in Thanos objstore guard clauses (compact + store)
- Remove redundant job= filters from unique metrics: ~30 Thanos rules,
  kube_persistentvolume_status_phase, otelcol_process_runtime_*
- Fix high-cardinality Istio latency grouping (drop source labels from by())
- Add division-by-zero guard to host context switch ratio
- Raise noisy ClickHouse thresholds: RejectedInserts > 2, DelayedInserts > 10
- Remove redundant for: 1m from HAProxy check failure rules
- Add job rename comments to up{job=...} rules (Hadoop, OpenStack, SNMP, OTel)
- Remove external mixin references from comments
- Fix Tempo dropped spans metric name: add missing _total suffix
- Fix Thanos bucket replicate run latency: add missing le label in by()
2026-04-06 20:38:12 +02:00
Samuel Berthe
b8fd051a55
Update README.md 2026-03-31 16:41:19 +02:00
samber
87d0610246 Publish 2026-03-31 14:40:08 +00:00
Emil Bostijancic
7ba6b2d367
feat: add OpenSearch alerting rules (OpenSearch exporter plugin) (#532) 2026-03-31 16:39:38 +02:00
dependabot[bot]
b13d59bce6
build(deps-dev): bump activesupport from 7.2.3 to 7.2.3.1 (#531)
Bumps [activesupport](https://github.com/rails/rails) from 7.2.3 to 7.2.3.1.
- [Release notes](https://github.com/rails/rails/releases)
- [Changelog](https://github.com/rails/rails/blob/v8.1.2.1/activesupport/CHANGELOG.md)
- [Commits](https://github.com/rails/rails/compare/v7.2.3...v7.2.3.1)

---
updated-dependencies:
- dependency-name: activesupport
  dependency-version: 7.2.3.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 08:24:41 +01:00
dependabot[bot]
9d9c648cdd
build(deps-dev): bump json from 2.18.1 to 2.19.2 (#530)
Bumps [json](https://github.com/ruby/json) from 2.18.1 to 2.19.2.
- [Release notes](https://github.com/ruby/json/releases)
- [Changelog](https://github.com/ruby/json/blob/master/CHANGES.md)
- [Commits](https://github.com/ruby/json/compare/v2.18.1...v2.19.2)

---
updated-dependencies:
- dependency-name: json
  dependency-version: 2.19.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-20 18:22:18 +01:00
samber
af2f277830 Publish 2026-03-18 20:41:01 +00:00
Samuel Berthe
e3a7165a65
fix(data): remove malformed summary fields, replace increase() by rate(), remove redundant avg_over_time 2026-03-18 21:40:30 +01:00
samber
c0e1f7a5f5 Publish 2026-03-18 17:06:34 +00:00
Samuel Berthe
1aafa40913
fix(data): prevent division by 0 2026-03-18 18:06:00 +01:00
samber
4fb1aa9ae4 Publish 2026-03-18 11:23:25 +00:00
Samuel Berthe
a4581ed322
fix(data): fix tresholds, comments, intervals, units... (#529) 2026-03-18 12:22:55 +01:00
samber
f36c23e393 Publish 2026-03-17 12:30:42 +00:00
Samuel Berthe
03963ef6f9
refactor(categories): change categories and move some exporters (#528) 2026-03-17 13:30:13 +01:00
Samuel Berthe
06f8b048a3
fix ci 2026-03-16 19:17:05 +01:00
Samuel Berthe
5d099fcae1
fix ci 2026-03-16 17:44:00 +01:00
samber
9d00396bc8 Publish 2026-03-16 16:11:31 +00:00
Samuel Berthe
2b99cf1f76
Feat/cilium alerting rules (#526)
* Add .worktrees/ to .gitignore

* feat: add Cilium alerting rules (32 rules across agent, operator, ClusterMesh, KVStoreMesh, Hubble)

* fix: use job label instead of k8s_app, switch to single-quoted YAML strings

* remove Cilium agent high restart rate alert
2026-03-16 17:10:59 +01:00
samber
e8eb75c2e2 Publish 2026-03-16 15:53:03 +00:00
Samuel Berthe
5071e01ad9
Feature/spinnaker alerts (#527)
* Add .worktrees/ to .gitignore

* feat: add Spinnaker alerting rules (12 rules)

Add Prometheus alerting rules for Spinnaker built-in exporter
covering Orca queue health, circuit breakers, Igor polling monitors,
Gate API throttling, Clouddriver errors, and AWS rate limiting.

Metric names validated against uneeq-oss/spinnaker-mixin dashboards.

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 16:52:31 +01:00
samber
6423f93ba7 Publish 2026-03-16 15:40:08 +00:00
Samuel Berthe
1455e0fd77
feat: add Oracle Database alerting rules (8 rules) (#525)
Add Prometheus alerting rules for Oracle Database using iamseth/oracledb_exporter.
Rules based on Grafana oracledb-mixin and exporter default metrics:
- DB down, session/process limit, tablespace capacity (warning+critical),
  high rollbacks, active sessions, user I/O wait time.
2026-03-16 16:39:35 +01:00
samber
ba5c9a3280 Publish 2026-03-16 14:01:45 +00:00
Samuel Berthe
d8315eb3bc
Feature/cert manager rules (#524)
* Add .worktrees/ to .gitignore

* feat: add cert-manager alerting rules (4 rules)

Add Prometheus alerting rules for cert-manager under the
"Network, security and storage" category:
- Cert-Manager absent (service down detection)
- Certificate expiring soon (21-day threshold)
- Certificate not ready (readiness check)
- Hitting ACME rate limits (rate limit detection)

Based on imusmanmalik/cert-manager-mixin and official
cert-manager metrics documentation.

* docs: add cert-manager to README
2026-03-16 15:01:07 +01:00
samber
7f346ede99 Publish 2026-03-16 13:37:19 +00:00
Samuel Berthe
b58b498bbb
feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules) (#523)
* feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules)

Add 18 Tempo rules and 49 Mimir rules based on official upstream mixins.
Covers ring health, compaction, TSDB, instance limits, ruler, alertmanager, and more.

* fix: address PR review comments on Tempo/Mimir rules

- Fix Tempo no tenant index builders: add on() for cross-label-set and
- Fix Tempo block list rising: output percentage instead of ratio
- Fix Mimir memory map areas: multiply by 100 to match % description
- Fix all instance limit rules: multiply by 100 to match % descriptions
- Fix distributor inflight requests: add % to description
2026-03-16 14:36:50 +01:00
samber
ff17e9c69b Publish 2026-03-16 13:20:46 +00:00
Samuel Berthe
7ee16641ac
feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter) (#520)
* feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter)

* fix: grammar in WireGuard rule comment
2026-03-16 14:20:17 +01:00
samber
4da60669d0 Publish 2026-03-16 13:09:31 +00:00
Samuel Berthe
f974552ef1
Feat/jaeger alerting rules (#521)
* Add .worktrees/ to .gitignore

* feat: add Jaeger alerting rules (8 rules from official jaeger-mixin)

Rules cover agent HTTP errors, RPC errors, client/agent/collector span drops,
sampling update failures, throttling update failures, and query request failures.
All rules sourced from https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin

* fix: rename Jaeger agent RPC alert to Jaeger client RPC

The jaeger_client_jaeger_rpc_http_requests metric is client-side,
not agent-side. Rename alert to match the actual metric source.
2026-03-16 14:09:03 +01:00
samber
eeba1ebbaa Publish 2026-03-16 13:07:45 +00:00
Samuel Berthe
8b443be6d2
feat: add systemd_exporter alerting rules (7 rules) (#522)
* feat: add systemd_exporter alerting rules (7 rules)

Add new Systemd service under Basic resource monitoring with rules for:
- Unit failed/inactive state detection
- Service crash loop detection
- Task limit exhaustion
- Socket refused/high connections
- Timer missed trigger

* fix: narrow systemd unit inactive query to reduce noise

Add type="service" and name filter to the inactive unit alert
to avoid false positives from legitimately inactive units.
2026-03-16 14:07:14 +01:00
Samuel Berthe
30bbedbc79
feat: add Cloud providers alerting rules (33 rules across 4 exporters) (#519)
* feat: add Cloud providers alerting rules (33 rules across 4 exporters)

New "Cloud providers" category with rules for:
- AWS CloudWatch (13 rules): exporter health + EC2, RDS, SQS, ALB, Lambda
- Google Cloud / Stackdriver (5 rules): scrape health, API quotas, staleness
- DigitalOcean (10 rules): droplets, databases, k8s, load balancers, incidents
- Azure (5 rules): API errors, rate limits, collection performance

* fix: address PR review - move Cloud providers before Other, fix service name

- Move "Cloud providers" group before "Other" in rules.yml for consistent ordering
- Rename "Google Cloud / Stackdriver" to "Google Cloud Stackdriver" to avoid
  awkward /-/ in generated anchors and dist/rules/ paths
- Fix README anchor link to match the new service name
2026-03-16 14:06:59 +01:00
samber
577c36d9ae Publish 2026-03-16 03:50:27 +00:00
Samuel Berthe
fd3bfb02c0
Some fix (#516)
* fix: use proper zero-traffic guard in Envoy ratio alerts (#511)

Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.

* feat: add alerting rules for prometheus/memcached_exporter

* fix: add division-by-zero guards and improve quoting in memcached rules (#512)

- Add `and memcached_max_connections > 0` to connection limit queries
- Add `and memcached_limit_bytes > 0` to memory usage query
- Switch hit-rate query to single quotes for cleaner PromQL readability

* fix: fix SNMP interface down query and add job scoping (#507)

- Fix ifOperStatus query to use vector matching instead of label filter
  since ifAdminStatus is a separate metric in snmp_exporter output
- Add job=~"snmp.*" matcher to interface error rate, bandwidth usage,
  and interface down rules to prevent matching non-SNMP series

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 04:50:01 +01:00
samber
fab9193407 Publish 2026-03-16 03:49:20 +00:00
Samuel Berthe
97aae5dabf
feat: add GitLab alerting rules (28 rules across 3 exporters) (#518)
Add new GitLab service under "Other" category with 3 exporters:
- Built-in exporter (18 rules): Puma, HTTP errors/latency, Sidekiq jobs,
  database connection pool, CI/CD pipelines, Ruby process health
- Workhorse (3 rules): HTTP error rate, latency, in-flight requests
- Gitaly (7 rules): gRPC errors, ResourceExhausted, RPC latency,
  CPU throttling, auth failures, circuit breaker

All metrics verified against gitlabhq/gitlabhq source code.
Several rules derived from GitLab Omnibus default alerting rules.
2026-03-16 04:48:52 +01:00
samber
c390641203 Publish 2026-03-16 03:46:30 +00:00
Samuel Berthe
e6cdcdb9e5 feat: add Apache Flink and Apache Spark alerting rules
Add 20 new alerting rules under the Runtimes category:
- Apache Flink (12 rules): job status, TaskManager registration, slot
  availability, restarts, checkpoints, backpressure, heap memory, GC,
  and record processing
- Apache Spark (8 rules): worker health, waiting apps, memory/cores
  exhaustion, executor GC, task failures, and disk spill
2026-03-16 04:46:00 +01:00
samber
1db2c6f196 Publish 2026-03-16 03:40:43 +00:00
Samuel Berthe
88e2c19017
feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi) (#517)
* feat: add Keycloak alerting rules (aerogear/keycloak-metrics-spi)

* fix: correct Keycloak metrics-spi metric names and query grouping
2026-03-16 04:40:15 +01:00
samber
258220b4f0 Publish 2026-03-16 02:44:20 +00:00
Samuel Berthe
20651aa10d
feat: add OpenStack alerting rules (openstack-exporter) (#515)
* feat: add OpenStack alerting rules (openstack-exporter)

Add 20 alerting rules for openstack-exporter/openstack-exporter covering
Nova, Neutron, Cinder, Octavia, and Placement services.

* docs: add OpenStack to README services list

* fix: align OpenStack load balancer alert name with operating_status semantics

The operating_status label uses ONLINE/OFFLINE/DEGRADED/ERROR values,
not ACTIVE. Rename alert to "not online" and use the label in the
description for clarity.
2026-03-16 03:43:51 +01:00
samber
32f639da3b Publish 2026-03-16 02:31:48 +00:00
Samuel Berthe
bf7b902881
feat: add process-exporter alerting rules (ncabatoff/process-exporter) (#514)
* feat: add process-exporter alerting rules (ncabatoff/process-exporter)

* docs: add Process to README services list

* fix: address PR review feedback for process-exporter rules

- Rename service from "Process" to "Process Exporter" for clarity
- Fix grammar: "file descriptors usage" → "file descriptor usage"
- Clarify CPU alert description as core-equivalent percentage
- Rename "high disk IO" to "high disk write IO" for accuracy
2026-03-16 03:31:18 +01:00
samber
d44bfd4c4b Publish 2026-03-16 02:26:04 +00:00
Samuel Berthe
2b239736cf
feat: add alerting rules for prometheus/memcached_exporter (#512) 2026-03-16 03:25:38 +01:00
Samuel Berthe
281142567c
fix: use proper zero-traffic guard in Envoy ratio alerts (#511) (#513)
Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.
2026-03-16 03:25:27 +01:00
samber
6bec57ae96 Publish 2026-03-16 02:12:41 +00:00
Samuel Berthe
f97f692596
feat: add Proxmox VE alerting rules (prometheus-pve-exporter) (#509)
Add 9 alerting rules for Proxmox VE covering node/guest status,
CPU, memory, storage, backup coverage, replication, and cluster quorum.
2026-03-16 03:12:06 +01:00
samber
7397eb24ec Publish 2026-03-16 02:10:36 +00:00
Samuel Berthe
be7a2e4d5d
feat: add IPMI exporter alerting rules (#510)
* feat: add IPMI exporter alerting rules

Add 17 alerting rules for prometheus-community/ipmi_exporter covering
temperature, fan, voltage, current, power sensors, chassis status,
and system event log monitoring.

* docs: add IPMI to README service list

* Apply suggestions from code review

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 03:10:10 +01:00
samber
8bd2265fe1 Publish 2026-03-16 02:04:26 +00:00
Samuel Berthe
c064d2264e
feat: add Envoy proxy alerting rules using built-in metrics (#511)
Add 19 alerting rules for Envoy proxy under "Reverse proxies and load
balancers" using native metrics from /stats/prometheus endpoint.

Covers: server health, HTTP error rates (downstream/upstream), connection
saturation, cluster membership, health checks, outlier detection,
SSL/TLS certificate expiry, circuit breakers, and request timeouts.
2026-03-16 03:03:57 +01:00
samber
375a36f82a Publish 2026-03-16 01:56:27 +00:00
Samuel Berthe
89e703d763
feat: add alerting rules for cloudflare/ebpf_exporter (#508)
* feat: add alerting rules for cloudflare/ebpf_exporter

* docs: add eBPF to README service list
2026-03-16 02:56:04 +01:00
samber
9f6d4fd2a2 Publish 2026-03-16 01:34:59 +00:00
Samuel Berthe
3db9281508
feat: add SNMP exporter alerting rules (#507)
Add 7 alerting rules for prometheus/snmp_exporter covering device
availability, interface status, error rates, bandwidth utilization,
and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.
2026-03-16 02:34:34 +01:00
dependabot[bot]
b039066277
build(deps-dev): bump nokogiri from 1.18.10 to 1.19.1 (#506)
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.18.10 to 1.19.1.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.18.10...v1.19.1)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-version: 1.19.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-16 01:51:52 +01:00
Samuel Berthe
01a5791376
fix: fix GitHub Actions workflow issues (#505)
- Replace deprecated ::set-output with $GITHUB_OUTPUT
- Pin mikefarah/yq from @master to @v4
- Add explicit permissions: contents: write to publish workflow
- Limit test workflow push trigger to master branch only
2026-03-16 01:47:09 +01:00
samber
e2af1325c6 Publish 2026-03-16 00:27:40 +00:00
Samuel Berthe
c37ef8f50c
fix: review and fix 74 database & broker alert rules (#504)
* fix: review and fix 74 database & broker alert rules

Comprehensive review of all database and broker alerts covering 16 services.

Typos & descriptions (8 fixes):
- PGBouncer: "a a server" → "a server"
- RabbitMQ: "instace" → "instance", "RabbmitMQ" → "RabbitMQ",
  "unactive" → "inactive"
- Cassandra: write failure said "Read failures", "bad hacker" →
  "authentication failures"
- Solr: replication errors said "failed updates"
- Meilisearch: "index is empty" said "instance is down"

Duplicates removed (5 fixes):
- PostgreSQL: 2 rules using wrong exporter metric (postgresql_errors_total)
- ClickHouse: "High Network Traffic" (thread counts) duplicated byte-rate rule
- NATS: 2 rules with low thresholds duplicated better rules

Broken queries (20 fixes):
- Patroni: patroni_master → patroni_primary (renamed in v3)
- MongoDB: rate() on gauge → direct ratio for connection queries
- MongoDB: removed WiredTiger-incompatible virtual memory rule
- Cassandra instaclustr: avg() on counter → rate()[5m]
- Cassandra criteo: increase() on JMX rate metric → direct threshold
- ClickHouse: increase() on gauge → direct threshold
- NATS: rate() on gauge → direct comparison, removed 4 config-value rules
- SQL Server: increase() on gauge → direct threshold
- Pulsar: moved comparison outside sum() (4 rules)
- Hadoop: inverted comparison < 0.2 → > 0.8, counters → increase()[1h]

Severity adjustments (7 fixes):
- Redis: backup threshold 24h → 48h, rejected connections → warning > 5
- RabbitMQ: no consumer for: 5m with comment
- Elasticsearch: unassigned shards added for: 2m
- CouchDB: process restarted critical → info
- Kafka: consumer group lag → warning, threshold 10000, better description
- Hadoop: HBase heap low critical → warning

Missing for duration (18 fixes):
- Added for: 1m to service-down alerts across MySQL, PostgreSQL,
  SQL Server, Patroni, Redis, MongoDB, RabbitMQ, Elasticsearch,
  Cassandra, Zookeeper with restart-tolerance comments

Division by zero guards (9 fixes):
- Added denominator > 0 guards to ratio queries in PostgreSQL,
  RabbitMQ, Elasticsearch, ClickHouse, CouchDB, NATS

Query design improvements (5 fixes):
- Cassandra: removed unnecessary sum() and redundant avg_over_time()
- ClickHouse: ZooKeeper avg() → per-instance check
- PostgreSQL: sum() → sum by (instance) for SSL and locks
- PGBouncer: 30s range window → 2m

Hardcoded labels (2 fixes):
- ClickHouse: added comment about job="clickhouse"
- Cassandra criteo: removed hardcoded service="cas"

* fix: address PR review comments

- Cassandra connection timeouts: wrap rate() in sum by() (rate() by() is invalid PromQL)
- Elasticsearch query latency: add division-by-zero guard
- Redis backup: "backuped" → "backed up"
2026-03-16 01:27:18 +01:00
Samuel Berthe
89842beb5c fix: fix favicon path 2026-03-15 23:54:05 +01:00
Samuel Berthe
8f462ce962 adding claude.md 2026-03-15 19:59:01 +01:00
samber
879436f440 Publish 2026-03-15 18:47:04 +00:00
Samuel Berthe
080a792777
data: adding python/ruby/golang (#502)
* data: adding python/ruby/golang

* fix: address review feedback on runtime alerts

- JVM non-heap: guard against unbounded metaspace (max_bytes = -1)
- JVM old gen GC: note regex only matches CMS/G1/Parallel collectors
- JVM/Python file descriptors: note process_* metrics are generic
- Go memory usage: fix description (sys_bytes is runtime memory, not host)
- Go goroutine spike: use deriv() instead of rate() on gauge
- Go GC CPU fraction: note deprecation since Go 1.20
- Go GC duration: clarify quantile="1" is max, not p99
- Python uncollectable: use increase() on counter instead of raw threshold
- Add threshold comments for workload-dependent defaults
2026-03-15 19:46:39 +01:00
samber
1e4e3d17bc Publish 2026-03-15 17:08:32 +00:00
Samuel Berthe
9ae17eca97
Fix broken and misleading alert rules (#503)
- Remove 7 meaningless `for: 0m` (ClickHouse, Caddy, Thanos)
- Fix Minio obsolete metrics (disk_storage_* -> minio_cluster_capacity_*)
- Rename duplicate Blackbox SSL cert rule to disambiguate warning/critical
- Simplify PostgreSQL config change query (giant regex -> negative matcher)
- Downgrade PostgreSQL SSL compression severity from critical to warning
- Fix misleading "Host unusual disk read rate" name and description
2026-03-15 18:08:06 +01:00
Mattias Bengtsson
bc41215c8f
Website: Support dark mode (#501)
* Update Gemfile.lock

Running Jekyll according to `CONTRIBUTING.md` fails complaining about
missing a `nokogiri` dependency. Updating `Gemfile.lock` seems to solve
this issue.

Fixes: #500

* Website: Support dark mode

Support `prefers-color-scheme: dark` by employing some more or less
hacky CSS overrides.

One should perhaps just use a different off-the-shelf Jekyll theme that
does this properly from the start.
2026-03-01 22:54:42 +01:00
samber
80400e9a56 Publish 2026-03-01 19:15:42 +00:00
Marcin Morawski
eeebb90e6f
Add systemd service name to HostSystemdServiceCrashed summary (#499)
* Add systemd service name to HostSystemdServiceCrashed summary

* Modify systemd service crash rule description

Updated the description for the systemd service crash rule to include the service name.

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-03-01 20:15:17 +01:00
samber
0693ed168e Publish 2026-02-21 18:40:35 +00:00
dxrayz
e60601fdcd
tune Targets Missing rules (#497)
* tune Targets Missing rules

* reworked query logic

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-02-21 19:40:10 +01:00
dependabot[bot]
9998e22145
build(deps-dev): bump nokogiri from 1.18.9 to 1.19.1 (#498)
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.18.9 to 1.19.1.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.18.9...v1.19.1)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-version: 1.19.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-20 01:58:02 +01:00
dependabot[bot]
52cc00fc4c
build(deps-dev): bump faraday from 2.12.0 to 2.14.1 (#496)
Bumps [faraday](https://github.com/lostisland/faraday) from 2.12.0 to 2.14.1.
- [Release notes](https://github.com/lostisland/faraday/releases)
- [Changelog](https://github.com/lostisland/faraday/blob/main/CHANGELOG.md)
- [Commits](https://github.com/lostisland/faraday/compare/v2.12.0...v2.14.1)

---
updated-dependencies:
- dependency-name: faraday
  dependency-version: 2.14.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-10 00:26:42 +01:00
samber
dd10c7ef05 Publish 2026-01-30 11:15:52 +00:00
Per Lundberg
51aea96ba7
Adjust OOM kill detected rule (#495)
* Adjust OOM kill detected rule

When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.

* Update rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-01-30 12:15:27 +01:00
Andreyev Dias de Melo
1d69457017
fix: corrects download URL for rules files (#494) 2026-01-30 01:40:38 +01:00
Samuel Berthe
f0107caf9e
Update README.md 2026-01-15 12:33:35 +01:00
Samuel Berthe
34cc80ffea
Update app.css 2026-01-15 02:48:16 +01:00
Samuel Berthe
a5d1c04955
Update default.html 2026-01-15 02:43:57 +01:00
Samuel Berthe
65551ae19f
Update README.md 2026-01-15 02:42:42 +01:00
Samuel Berthe
570521429e
Update default.html 2026-01-15 02:42:00 +01:00
Samuel Berthe
55f16705eb
Add files via upload 2026-01-15 02:40:58 +01:00
Samuel Berthe
2b5c8b0ec7
Update README.md 2026-01-15 02:39:24 +01:00
samber
81081bdda5 Publish 2026-01-07 12:58:08 +00:00
Samuel Berthe
d400e3e64d
feat(k8s): cronjob rule (#491) 2026-01-07 13:57:42 +01:00
Samuel Berthe
1136aa3a87
remove file 2026-01-07 13:29:12 +01:00
Simon Matic Langford
f810ff531d
Node exporter rules to preserve instance labels (#488)
* Jenkins node offline for clause (#2)

* Convert cpu alert expressions to without() rather than on()

* Remove on() expression from network throughput alerts as labels fully match

---------

Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>
2026-01-06 16:24:18 +01:00
dependabot[bot]
74ba870f05
build(deps-dev): bump uri from 0.13.2 to 0.13.3 (#489)
Bumps [uri](https://github.com/ruby/uri) from 0.13.2 to 0.13.3.
- [Release notes](https://github.com/ruby/uri/releases)
- [Commits](https://github.com/ruby/uri/compare/v0.13.2...v0.13.3)

---
updated-dependencies:
- dependency-name: uri
  dependency-version: 0.13.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-06 00:55:03 +01:00
5bentz
ffa260b39d
Update sleep-peacefully.md (#487)
Fix business hours (9:00 to 18:00)
2025-12-08 15:19:11 +01:00
dependabot[bot]
766b224c67
build(deps): bump actions/checkout from 5 to 6 (#485)
Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-01 21:34:15 +01:00
Simon Matic Langford
79f2858037
Improve Jenkins node alerts to better handle servers with multiple nodes (#484) 2025-11-17 14:56:04 +01:00
Samuel Berthe
d6589237e1
Update CONTRIBUTING.md 2025-11-13 16:24:49 +01:00
Samuel Berthe
d0d1b00a7b
Fix typo in OpenTelemetry Collector link 2025-11-05 17:15:10 +01:00
Samuel Berthe
e617c07179
Update README.md 2025-11-05 17:14:47 +01:00
Samuel Berthe
48f2dde80c
feat: use /ref/head/ instead of /master/ for yaml url (#482) 2025-11-05 17:12:50 +01:00
samber
cea78d7fd6 Publish 2025-11-05 16:08:52 +00:00
Arve Knudsen
d58bc324ad
Add OpenTelemetry Collector monitoring alerts (#480)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-11-05 17:08:26 +01:00
samber
4acbddb21a Publish 2025-11-05 16:04:56 +00:00
Samuel Berthe
6e2db98590
feat: add support for exporter-level comments (#481) 2025-11-05 17:04:30 +01:00
samber
ae8cfb0366 Publish 2025-10-13 12:24:59 +00:00
andrii.k
9edef74e73
update kafka alerts (#478) 2025-10-13 14:24:37 +02:00
dependabot[bot]
2f9279d707
build(deps-dev): bump rexml from 3.3.9 to 3.4.2 (#476)
Bumps [rexml](https://github.com/ruby/rexml) from 3.3.9 to 3.4.2.
- [Release notes](https://github.com/ruby/rexml/releases)
- [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md)
- [Commits](https://github.com/ruby/rexml/compare/v3.3.9...v3.4.2)

---
updated-dependencies:
- dependency-name: rexml
  dependency-version: 3.4.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-19 04:17:09 +02:00
samber
606d6fc592 Publish 2025-09-15 13:04:10 +00:00
Riccardo Cannella
7832e01082
haproxy: align v1 and v2 HAProxy backend max active session > 80% alerts (#475)
* haproxy: align v1 and v2 max current session alerts

* fix: remove non-existing label

---------

Co-authored-by: Riccardo Cannella <riccardo.cannella@reevo.it>
2025-09-15 15:03:44 +02:00
samber
b158ebb551 Publish 2025-09-14 17:22:29 +00:00
Samuel Berthe
237e89babc
Update query for unused replication slot rule 2025-09-14 19:22:05 +02:00
dependabot[bot]
264bcb82be
build(deps): bump actions/checkout from 4 to 5 (#473)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-02 17:06:00 +02:00
Samuel Berthe
dfac84209d
Update README.md 2025-09-01 15:41:07 +02:00
samber
5fbce5f513 Publish 2025-09-01 13:41:06 +00:00
Sajjad hassanzadeh
a2c31358d1
Add couchdb alerts (#472)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

* add : couchdb roles config in rules.yml

* add : couchdb alerts in rules directory

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-09-01 15:40:42 +02:00
Samuel Berthe
edae18b8df
Remove Screeb tag 2025-08-29 15:20:48 +02:00
Samuel Berthe
0a55137e6a
Remove Screeb 2025-08-29 15:20:21 +02:00
samber
3abc7144aa Publish 2025-08-28 21:07:00 +00:00
Sajjad hassanzadeh
7bced89d2d
add : additional essential clickhouse alerts (#471)
* add : additional essential clickhouse alerts

* Add new ClickHouse alert rules for monitoring

* linting

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-08-28 23:06:31 +02:00
dependabot[bot]
52e4ba143c
build(deps-dev): bump nokogiri from 1.18.8 to 1.18.9 (#469)
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.18.8 to 1.18.9.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.18.8...v1.18.9)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-version: 1.18.9
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-22 16:59:26 +02:00
samber
b04b11ce1d Publish 2025-06-25 11:32:39 +00:00
Samuel Berthe
554850df41
Update rules.yml 2025-06-25 13:32:16 +02:00
samber
ea63d8001a Publish 2025-06-17 17:16:15 +00:00
Samuel Berthe
748524d580
Update rules.yml 2025-06-17 19:15:52 +02:00
samber
6ebe6d8a8e Publish 2025-06-17 15:07:35 +00:00
Samuel Berthe
a5a3c2cd92
fix: HostHighCpuUsage (#466)
closes #457
2025-06-17 17:07:05 +02:00
samber
a3325114ea Publish 2025-05-21 21:04:42 +00:00
Samuel Berthe
4b1b8242cb
Update rules.yml 2025-05-21 23:04:12 +02:00
samber
67cf6892a4 Publish 2025-05-20 06:21:45 +00:00
jaqxues
98d6e7db05
Alloy: Fix incorrect alert (#464) 2025-05-20 08:21:14 +02:00
samber
becbe1be3b Publish 2025-05-08 17:49:45 +00:00
andrii.k
e0e3cdda1d
update istio 4xx alert description (#463) 2025-05-08 19:49:18 +02:00
Samuel Berthe
4be87d7796
Update README.md 2025-05-03 22:53:51 +02:00
samber
fd9da90c1d Publish 2025-05-03 20:52:49 +00:00
Carsten Thiel
79f45a5146
Adding rules for checking FluxCD (#458) 2025-05-03 22:52:26 +02:00
samber
9f5c641bdd Publish 2025-04-23 08:31:10 +00:00
samber
aca1bdf1fb Publish 2025-04-23 08:28:06 +00:00
Samuel Berthe
4666830538
Update rules.yml 2025-04-23 10:18:08 +02:00
samber
198035eaf4 Publish 2025-04-23 07:58:55 +00:00
Roger
b3d25fafcf
feature/kubestate exporter check if node is scheduling disabeld (#462)
* feature/kubestate-exporter-check-if-node-is-scheduling-disabeld

* commented added

* typo in expr

* move code to right file


---------

Co-authored-by: Roger Sikorski <roger.sikorski@zweiloewen.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-04-23 09:58:29 +02:00
dependabot[bot]
6446bb44be
build(deps-dev): bump nokogiri from 1.18.4 to 1.18.8 (#460)
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.18.4 to 1.18.8.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.18.4...v1.18.8)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-version: 1.18.8
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-22 11:37:56 +02:00
samber
a75d5124c5 Publish 2025-04-17 15:26:25 +00:00
Samuel Berthe
3b440fec7b
Remove buggy HostRequiresReboot rule
Closing #459
2025-04-17 17:26:00 +02:00
samber
32a4bfb19b Publish 2025-03-27 16:23:49 +00:00
Samuel Berthe
8b730ef059
Update rules.yml 2025-03-27 17:23:19 +01:00
samber
93f9daecee Publish 2025-03-27 13:42:51 +00:00
Motte
69c8208e3c
Added PostgresqlReplicationLagHigh rule (#456)
* Added PostgresqlReplicationLagHigh rule

* Update PostgreSQL replication lag alert settings

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-03-27 14:42:22 +01:00
Pigueiras
97a31f34e5
Fix queries in elasticsearch latency alerts (#455)
The `elasticsearch_indices_search_fetch_total`,
`elasticsearch_indices_search_fetch_time_seconds`,
`elasticsearch_indices_indexing_index_time_seconds_total`
and `elasticsearch_indices_indexing_index_total` metrics
are counters.

Dividing these metrics doesn't make sense because a spike in
numerator would cause the alert to persist, even if subsequent
fetch/index operations are normal. Adding `increase` changes the query
to check if operations took, on average, more than X over
a 1-minute interval, which was likely the original intent of
this alert.
2025-03-26 22:15:24 +01:00
dependabot[bot]
242054f7dc
build(deps-dev): bump uri from 0.13.1 to 0.13.2 (#454)
Bumps [uri](https://github.com/ruby/uri) from 0.13.1 to 0.13.2.
- [Release notes](https://github.com/ruby/uri/releases)
- [Commits](https://github.com/ruby/uri/compare/v0.13.1...v0.13.2)

---
updated-dependencies:
- dependency-name: uri
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-23 16:30:56 +01:00
dependabot[bot]
4335f85830
build(deps-dev): bump nokogiri from 1.18.3 to 1.18.4 (#453)
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.18.3 to 1.18.4.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.18.3...v1.18.4)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-23 16:26:08 +01:00
samber
7bcae33011 Publish 2025-02-20 15:18:08 +00:00
Samuel Berthe
2127c4ce90
Update rules.yml 2025-02-20 16:17:39 +01:00
samber
9963b750ac Publish 2025-02-20 14:06:17 +00:00
Roman
c189984d0f
fix node-exporter.yaml missing parentheses (#452) 2025-02-20 15:05:48 +01:00
samber
807db03d0d Publish 2025-02-19 14:25:58 +00:00
Samuel Berthe
6838196343
fix: remove duplicated rule 2025-02-19 15:25:29 +01:00
dependabot[bot]
0f4b45d127
build(deps-dev): bump nokogiri from 1.16.7 to 1.18.3 (#451)
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.16.7 to 1.18.3.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/v1.18.3/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.16.7...v1.18.3)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-19 14:33:37 +01:00
samber
4e49e77d29 Publish 2025-02-16 22:47:17 +00:00
dzaczek
11a78f0f06
Update google-cadvisor.yml (#382)
* Update google-cadvisor.yml

    Expression Explanation:
    The expression calculates the absolute change in CPU usage for containers by comparing the current rate of CPU usage (within the last 1 minute) with the rate of CPU usage from the previous minute. If this change exceeds 25%, the alert is triggered. Additionally, it compares the current rate of CPU usage with the rate from the previous 5 minutes to capture larger trends. If any of these conditions are met, the alert fires.
    
    Alert Details:
    - Alert Name: ContainerHighLowChangeCpuUsage
    - Trigger Condition: Absolute change in CPU usage exceeding 25%
    - Alert Severity: Informational (info)

* Add alert rule for high CPU usage change

* Change alert severity from warning to info

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:46:53 +01:00
samber
7889a9a29b Publish 2025-02-16 22:37:09 +00:00
Samuel Berthe
add097c489
data: revert 5f57f09 (see #398) 2025-02-16 23:36:44 +01:00
samber
12b8acb1b8 Publish 2025-02-16 22:29:24 +00:00
asdf1234
4a7b9b5c72
Update mysqld-exporter.yml (#442)
* Update mysqld-exporter.yml

add some rules

* Add new MySQL monitoring rules

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-02-16 23:29:00 +01:00
samber
20f9a36615 Publish 2025-02-16 22:17:02 +00:00
Samuel Berthe
fb857e8b39
data: fix rules 2025-02-16 23:16:36 +01:00
Samuel Berthe
2f9c0c0483
upgrade ruby version 2025-02-16 23:15:43 +01:00
Samuel Berthe
eb92a79898
upgrade github action ruby version 2025-02-04 16:44:40 +01:00
Samuel Berthe
ae12871fa9
Update rules.yml 2025-02-04 16:40:21 +01:00
Felix Bühler
10d00c66da
Add caddy.yml (#450) 2025-02-04 14:23:14 +01:00
guruevi
70ac7d9cae
Various updates and quality of life changes (#405)
* smartctl_exporter publishes both drive_trip and current drive temperatures. Since most of the alerts are going to be permanent, it does not make sense to wait for the alert to be on for a certain time. Temperature sensors likewise vary, using the last sample is not sufficient to alert on potential issues.

* Add an option to run GitHub Action manually

* Add an option to force running the action for testing purposes

* Set variables correctly

* Set variables correctly

* Publish

* Clean up some more metrics

* Publish

* Minor bug fixes

* Publish

* Removed queries that throw errors when systems are upgraded. Also fixed and simplified a few Postgres queries.

* Publish

* Refined some more queries

* Publish

* PostgreSQL now has optimized autovacuum behavior

* Publish

* PostgreSQL now has optimized autovacuum behavior

* Publish

* Publish

* Query fails if instance names are not unique across jobs. This fixes it.

* Publish

* Ruby is out of date

---------

Co-authored-by: samber <samber@users.noreply.github.com>
2025-01-28 06:06:47 +01:00
Samuel Berthe
fc6b3faadc
Fix from #405 2025-01-28 06:04:10 +01:00
Samuel Berthe
d916b7c6ab
Fix from #405 2025-01-28 05:58:49 +01:00
sunlei
cbb2337438
fix: formatting errors (#448)
* fix: formatting errors

* Update query format in rules.yml

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2025-01-12 22:01:21 +01:00
samber
53a369769d Publish 2024-12-16 11:19:08 +00:00
Samuel Berthe
bdcc67c04e
Update rules.yml 2024-12-16 12:17:59 +01:00
Samuel Berthe
84a3b517a8
Update rules.yml 2024-12-16 12:17:26 +01:00
samber
4533f23b79 Publish 2024-12-16 11:17:17 +00:00
dxrayz
52d4a8c744
Update postgres-exporter.yml (#444)
Modify PostgresqlConfigurationChanged for prevent error: "many-to-many matching not allowed: matching labels must be unique on one side" in cases when you have multiple instances of postgres
2024-12-16 12:16:05 +01:00
samber
c5203e94d0 Publish 2024-12-08 20:29:15 +00:00
Samuel Berthe
a8d7c43b30
Update rules.yml 2024-12-08 21:28:07 +01:00
Samuel Berthe
fff8a80ae5
Update README.md 2024-12-08 21:24:45 +01:00
samber
4e38ae2087 Publish 2024-12-05 22:38:38 +00:00
Samuel Berthe
8c3d06502f
Update rules.yml 2024-12-05 23:37:28 +01:00
samber
8a220b1b8a Publish 2024-11-30 09:31:05 +00:00
Martin Anderson
353ef1ed95
RabbitMQ: add too many ready messages alert (#441)
* RabbitMQ: add too many ready messages alert

* Add RabbitMQ ready messages alert rule

---------

Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2024-11-30 10:29:57 +01:00
samber
14949721ba Publish 2024-10-28 21:25:18 +00:00
sipr-invivo
bb75cb2c68
feat: Add rule to Kubernetes Job not starting (#436) 2024-10-28 22:24:10 +01:00
dependabot[bot]
f9e683896f
build(deps-dev): bump rexml from 3.3.7 to 3.3.9 (#438)
Bumps [rexml](https://github.com/ruby/rexml) from 3.3.7 to 3.3.9.
- [Release notes](https://github.com/ruby/rexml/releases)
- [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md)
- [Commits](https://github.com/ruby/rexml/compare/v3.3.7...v3.3.9)

---
updated-dependencies:
- dependency-name: rexml
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-28 20:17:58 +01:00
Samuel Berthe
c41fda1d92
Update alertmanager.md 2024-10-06 17:31:23 +02:00
Samuel Berthe
7313acce36
Create FUNDING.json 2024-10-05 18:57:43 +02:00
Samuel Berthe
640f06588d
Delete FUNDING.json 2024-10-05 18:21:35 +02:00
Samuel Berthe
cd5b39a1f0
Create FUNDING.json 2024-10-05 18:06:22 +02:00
dependabot[bot]
35596c866f
build(deps): bump webrick from 1.7.0 to 1.8.2 (#435)
Bumps [webrick](https://github.com/ruby/webrick) from 1.7.0 to 1.8.2.
- [Release notes](https://github.com/ruby/webrick/releases)
- [Commits](https://github.com/ruby/webrick/compare/v1.7.0...v1.8.2)

---
updated-dependencies:
- dependency-name: webrick
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-27 22:24:21 +02:00
Samuel Berthe
d6d6ae4ef8
fix: Gemfile to reduce vulnerabilities (#434)
The following vulnerabilities are fixed with an upgrade:
- https://snyk.io/vuln/SNYK-RUBY-WEBRICK-8068535

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
2024-09-26 11:31:21 +02:00
dependabot[bot]
65a5f586cb
build(deps-dev): bump rexml from 3.3.3 to 3.3.6 (#431)
Bumps [rexml](https://github.com/ruby/rexml) from 3.3.3 to 3.3.6.
- [Release notes](https://github.com/ruby/rexml/releases)
- [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md)
- [Commits](https://github.com/ruby/rexml/compare/v3.3.3...v3.3.6)

---
updated-dependencies:
- dependency-name: rexml
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-09 20:09:20 +02:00
samber
4aa45dee05 Publish 2024-08-28 06:49:52 +00:00
Samuel Berthe
f08e8df514
oops 2024-08-28 08:48:42 +02:00
Samuel Berthe
995ab4d27a
Update rules.yml 2024-08-28 08:46:41 +02:00
Samuel Berthe
3bf8d6d824
fix: Gemfile to reduce vulnerabilities (#432)
The following vulnerabilities are fixed with an upgrade:
- https://snyk.io/vuln/SNYK-RUBY-REXML-7814166

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
2024-08-24 10:42:21 +02:00
Somrat Dutta
8c0bdc2b24
feat: Add NATS and JetStream Prometheus alert rules (#430)
* feat: Add comprehensive NATS and JetStream Prometheus alert rules

- Added multiple Prometheus alert rules for monitoring NATS server and JetStream metrics.
- Included alerts for:
  - High connection count
  - High pending bytes
  - High subscriptions count
  - High routes count
  - High memory usage
  - Slow consumers
  - NATS server downtime
  - High CPU usage
  - High number of active connections
  - High JetStream store and memory usage
  - Subscription limits exceeded
  - High pending messages
  - Authentication timeouts
  - Errors in NATS (JetStream API errors)
  - JetStream consumers limit exceeded
  - Exceeding max payload size
  - Leaf node connection issues
  - Ping operations limit exceeded
  - Write deadline exceeded
- Ensured consistency between `exporter.yml` and `rules.yml` files.
- Improved overall NATS and JetStream monitoring to prevent performance degradation and ensure system reliability.

This commit enhances the visibility of NATS and JetStream operations by providing key metrics to alert on potential issues and optimize system performance.

* Update rules.yml

* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated

* - minor changes, rollback rules.yml
- address comment changes
- revert to old rules.yml as they are generated

* fix indentation

---------

Co-authored-by: somratdutta <duttasomratand.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
Co-authored-by: somrat.dutta <somrat.dutta@nutanix.com>
2024-08-20 20:37:03 +02:00
samber
02687db33d Publish 2024-08-20 16:32:36 +00:00
Samuel Berthe
d1715de751
fix PostgresqlInvalidIndex rule 2024-08-20 18:31:18 +02:00
dependabot[bot]
61da73d517
build(deps-dev): bump rexml from 3.3.2 to 3.3.3 (#428)
Bumps [rexml](https://github.com/ruby/rexml) from 3.3.2 to 3.3.3.
- [Release notes](https://github.com/ruby/rexml/releases)
- [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md)
- [Commits](https://github.com/ruby/rexml/compare/v3.3.2...v3.3.3)

---
updated-dependencies:
- dependency-name: rexml
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-08-02 14:14:26 +02:00
dependabot[bot]
225607cf7f
build(deps-dev): bump nokogiri from 1.15.6 to 1.16.5 (#427)
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.15.6 to 1.16.5.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.15.6...v1.16.5)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-07-30 17:25:23 +02:00
Samuel Berthe
2c764df932
fix: Gemfile & Gemfile.lock to reduce vulnerabilities (#426)
The following vulnerabilities are fixed with an upgrade:
- https://snyk.io/vuln/SNYK-RUBY-REXML-7462086

Co-authored-by: snyk-bot <snyk-bot@snyk.io>
2024-07-18 10:14:45 +02:00
samber
58ade95b8b Publish 2024-07-02 07:34:59 +00:00
Samuel Berthe
47e74f65e0
Update rules.yml 2024-07-02 09:33:51 +02:00
Greg
9557d4b50e
feat(meilisearch): add basic set of rules (#425)
* feat(meilisearch): add basic meilisearch rules

* fix(query): use == instead of =

* fix(data): set correct name and use ==

* chore(meilisearch): remove index filter
2024-07-02 09:33:08 +02:00
Samuel Berthe
b6a6c2e313
Update README.md 2024-07-02 09:33:01 +02:00
samber
60c235975c Publish 2024-06-14 18:16:53 +00:00
Samuel Berthe
ca4fb01c6d
Update rules.yml 2024-06-14 20:15:44 +02:00
samber
1ee046b739 Publish 2024-06-06 20:54:49 +00:00
Samuel Berthe
1e4ea0b3e7
Update rules.yml 2024-06-06 22:53:29 +02:00
samber
8759c50440 Publish 2024-05-23 12:45:56 +00:00
Samuel Berthe
9b0ac7d230
Update rules.yml 2024-05-23 14:44:45 +02:00
dependabot[bot]
61a40270d9
build(deps-dev): bump rexml from 3.2.5 to 3.2.8 (#420)
Bumps [rexml](https://github.com/ruby/rexml) from 3.2.5 to 3.2.8.
- [Release notes](https://github.com/ruby/rexml/releases)
- [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md)
- [Commits](https://github.com/ruby/rexml/compare/v3.2.5...v3.2.8)

---
updated-dependencies:
- dependency-name: rexml
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-05-16 23:28:17 +02:00
samber
7dd767c4b4 Publish 2024-05-15 06:10:06 +00:00
Samuel Berthe
1adecd9ee7
Update rules.yml 2024-05-15 08:08:58 +02:00
Enes Yalınkaya
9877561b6c
fix elasticsearch rate rules (#418)
* fix elasticsearch rate rules

* fix

* fix

* fix
2024-05-15 08:07:55 +02:00
samber
826be5877f Publish 2024-05-14 18:44:11 +00:00
R.Sicart
262e451625
kube hpa lint and improvement (#417)
* fix: hpa alerts are using  label but the queries remove it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* fix: hpa alert is using  label but the query removes it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* feat: hpa scale max should not alert when min and max are the same

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

---------

Signed-off-by: R.Sicart <roger.sicart@gmail.com>
2024-05-14 20:43:00 +02:00
samber
81079a2a7e Publish 2024-05-14 18:35:54 +00:00
R.Sicart
8460f9008e
fix: some kube api alert lint (#416)
* fix: apiserver regexp matchers are automatically fully anchored

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* fix: apiserver errors alert is using  label but the query removes it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

* fix: apiserver latency alert is using  label but the query removes it

Signed-off-by: R.Sicart <roger.sicart@gmail.com>

---------

Signed-off-by: R.Sicart <roger.sicart@gmail.com>
2024-05-14 20:34:43 +02:00
dependabot[bot]
4963331101
build(deps-dev): bump nokogiri from 1.16.2 to 1.16.5 (#415)
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.16.2 to 1.16.5.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases)
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md)
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.16.2...v1.16.5)

---
updated-dependencies:
- dependency-name: nokogiri
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-05-14 01:41:57 +02:00
samber
04886da968 Publish 2024-05-13 10:10:12 +00:00
191 changed files with 20558 additions and 2955 deletions

1
.github/FUNDING.yml vendored
View file

@ -1 +1,2 @@
github: [samber]
ko_fi: samuelberthe

View file

@ -5,3 +5,8 @@ updates:
directory: "/"
schedule:
interval: "monthly"
- package-ecosystem: "npm"
directory: "/site"
schedule:
interval: "monthly"

View file

@ -0,0 +1,25 @@
name: Dependabot automerge
on:
pull_request:
types: [opened, synchronize]
jobs:
automerge:
runs-on: ubuntu-latest
if: github.actor == 'dependabot[bot]'
permissions:
contents: write
pull-requests: write
steps:
- name: Fetch Dependabot metadata
id: metadata
uses: dependabot/fetch-metadata@v3
- name: Enable auto-merge for github-actions updates
if: steps.metadata.outputs.package-ecosystem == 'github_actions'
run: gh pr merge --auto --squash "$PR_URL"
env:
PR_URL: ${{ github.event.pull_request.html_url }}
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

62
.github/workflows/deploy.yml vendored Normal file
View file

@ -0,0 +1,62 @@
name: Deploy Astro site to GitHub Pages
on:
push:
branches: [master]
workflow_dispatch:
# Only allow one concurrent deployment
concurrency:
group: pages
cancel-in-progress: false
permissions:
contents: read
pages: write
id-token: write
jobs:
build:
name: Build
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v6
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: 'latest'
cache: npm
cache-dependency-path: site/package-lock.json
- name: Install dependencies
working-directory: site
run: npm ci
- name: Build Astro site
working-directory: site
env:
ASTRO_TELEMETRY_DISABLED: "1"
run: npm run build
- name: Build Pagefind search index
working-directory: site
run: npx pagefind --site dist
- name: Upload Pages artifact
uses: actions/upload-pages-artifact@v5
with:
path: site/dist
deploy:
name: Deploy
needs: build
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v5

View file

@ -1,34 +1,38 @@
name: Publish
on:
workflow_dispatch:
push:
branches:
- master
permissions:
contents: write
jobs:
publish:
name: Publish
# Check if the PR is not from a fork
if: github.repository_owner == 'samber'
runs-on: ubuntu-latest
steps:
- name: Checkout Repo
uses: actions/checkout@v4
uses: actions/checkout@v6
- name: Set up Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: 2.7
ruby-version: '3.4'
- name: Set up yq
uses: mikefarah/yq@master
uses: mikefarah/yq@v4
- name: Install liquid
run: gem install liquid-cli
run: |
gem install liquid -v 5.5.1
gem install liquid-cli
- name: Build rule configuration
run: |
gem install liquid-cli
cat _data/rules.yml | yq -I 0 -o json > _data/rules.json
rm -rf dist/rules
@ -38,7 +42,7 @@ jobs:
mkdir -p "${subdir}"
# groupName=$(echo "{% assign groupName = name | split: ' ' %}{% capture groupNameCamelcase %}{% for word in groupName %}{{ word | capitalize }} {% endfor %}{% endcapture %} {{ groupNameCamelcase | remove: ' ' | remove: '-' }}" | liquid $(echo ${service} | base64 --decode | jq -r '.name | ascii_downcase | split(" ") | join("-")'))
for exporter in $(echo ${service} | base64 --decode | jq -r '.exporters[] | @base64'); do
exporterName=$(echo ${exporter} | base64 --decode | jq -r '.slug')
cat dist/template.yml | liquid "$(echo ${exporter} | base64 --decode)" > ${subdir}/${exporterName}.yml
@ -51,7 +55,7 @@ jobs:
# https://peterevans.dev/posts/github-actions-how-to-automate-code-formatting-in-pull-requests/
- name: Check for modified files
id: git-check
run: echo ::set-output name=modified::$(git status -s --porcelain | wc -l | awk '{$1=$1};1')
run: echo "modified=$(git status -s --porcelain | wc -l | awk '{$1=$1};1')" >> $GITHUB_OUTPUT
- name: Push changes
if: steps.git-check.outputs.modified != '0'
run: |

38
.github/workflows/site.yml vendored Normal file
View file

@ -0,0 +1,38 @@
name: Site build
on:
pull_request:
paths:
- site/**
- _data/**
push:
branches:
- master
paths:
- site/**
- _data/**
jobs:
site-build:
name: Build Astro site
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v6
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: 'latest'
cache: npm
cache-dependency-path: site/package-lock.json
- name: Install dependencies
working-directory: site
run: npm ci
- name: Build Astro site
working-directory: site
env:
ASTRO_TELEMETRY_DISABLED: "1"
run: npm run build

View file

@ -1,6 +1,14 @@
name: Promtool check
on: [pull_request, push]
on:
pull_request:
paths:
- _data/**
push:
branches:
- master
paths:
- _data/**
jobs:
promtool-check:
@ -8,22 +16,21 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout Repo
uses: actions/checkout@v4
uses: actions/checkout@v6
- name: Set up Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: 2.7
ruby-version: 3.4
- name: Set up yq
uses: mikefarah/yq@master
uses: mikefarah/yq@v4
- name: Install liquid
run: gem install liquid-cli
- name: Build rule configuration
run: |
gem install liquid-cli
cat _data/rules.yml | yq -I 0 -o json > _data/rules.json
for service in $(cat _data/rules.json | jq -r '.groups[].services[] | @base64'); do
@ -31,7 +38,7 @@ jobs:
mkdir -p "${subdir}"
# groupName=$(echo "{% assign groupName = name | split: ' ' %}{% capture groupNameCamelcase %}{% for word in groupName %}{{ word | capitalize }} {% endfor %}{% endcapture %} {{ groupNameCamelcase | remove: ' ' | remove: '-' }}" | liquid $(echo ${service} | base64 --decode | jq -r '.name | ascii_downcase | split(" ") | join("-")'))
for exporter in $(echo ${service} | base64 --decode | jq -r '.exporters[] | @base64'); do
exporterName=$(echo ${exporter} | base64 --decode | jq -r '.slug')
cat dist/template.yml | liquid "$(echo ${exporter} | base64 --decode)" > ${subdir}/${exporterName}.yml

15
.gitignore vendored
View file

@ -1,6 +1,13 @@
_site/
.sass-cache/
.jekyll-cache/
.jekyll-metadata
# Generated data
_data/rules.json
test/rules/
# Node / Astro
/node_modules
site/node_modules/
site/dist/
site/.astro/
site/public/pagefind/
# Misc
.worktrees/

216
CLAUDE.md Normal file
View file

@ -0,0 +1,216 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
A curated collection of ~940 Prometheus alerting rules covering 90+ services across 100+ exporters, organized in categories: basic resource monitoring (Prometheus, host/hardware, SMART, Docker, Blackbox, Windows, VMware, Netdata), databases (MySQL, PostgreSQL, Redis, MongoDB, Elasticsearch, Cassandra, Clickhouse, CouchDB, etc.), message brokers (RabbitMQ, Kafka, Pulsar, Nats, Zookeeper), proxies/load balancers/service meshes (Nginx, Apache, HaProxy, Traefik, Caddy, Linkerd, Istio), runtimes (PHP-FPM, JVM, Sidekiq), data engineering (Apache Flink, Apache Spark, Hadoop), orchestrators (Kubernetes, Nomad, Consul, Etcd, OpenStack), CI/CD (Jenkins, ArgoCD, FluxCD, GitLab CI, Spinnaker), network and security (SSL/TLS, CoreDNS, Vault, Cloudflare, Cilium, eBPF), storage (Ceph, ZFS, OpenEBS, Minio), cloud providers (AWS, Azure, DigitalOcean), observability (Thanos, Loki, Cortex, OpenTelemetry Collector, Grafana Tempo/Mimir/Alloy, Jaeger), and other (APC UPS, Graph Node).
All rules are stored in a single YAML data file (`_data/rules.yml`) and rendered as a static site built with Astro + TypeScript (located in `site/`). The site provides copy-pasteable Prometheus alert snippets and downloadable rule files per exporter.
The project is community-driven. Most contributions are PRs adding or updating rules in `_data/rules.yml`. Files in `dist/rules/` are auto-generated on merge — never edit them manually.
## Architecture
- **`_data/rules.yml`** — The single source of truth for all alerting rules. This is the main file contributors edit. It is NOT a valid Prometheus config; the site renders each rule into copy-pasteable Prometheus alert format.
- **`site/`** — Astro + TypeScript static site. Run `npm run dev` inside this directory to develop locally.
- **`site/src/data/rules.ts`** — Typed wrappers and helper functions over `_data/rules.yml`.
- **`site/src/data/site.ts`** — Shared site metadata constants (URLs, author, schema objects).
- **`site/src/pages/`** — Astro page routes: `index.astro` (homepage), `rules/[group]/[service].astro` (per-service rule pages), `alertmanager.astro`, `blackbox-exporter.astro`, `sleep-peacefully.astro` (guides).
- **`site/src/layouts/BaseLayout.astro`** — Root HTML layout (SEO, GA, dark mode).
- **`site/src/layouts/GuideLayout.astro`** — Layout for guide pages (TOC, hero, related guides).
- **`site/src/components/`** — Shared Astro components (Header, Footer, Sidebar, RuleCard, ExporterSection, etc.).
- **`site/astro.config.mjs`** — Astro configuration (sitemap, Vite YAML plugin, base URL).
- **`dist/rules/`** — Pre-built downloadable rule files organized by service/exporter (referenced in the site for `wget` commands).
## Rules YAML Structure
Services are listed in README.md.
`_data/rules.yml` hierarchy:
```
groups:
- name: "<category>" # e.g. "Basic resource monitoring"
services:
- name: "<service>" # e.g. "Host and hardware"
exporters:
- name: "<exporter>"
slug: "<slug>" # used for download URLs
doc_url: "<url>" # optional link to exporter docs
comments: # optional, exporter-level multiline notes rendered before rules
"<comment>"
rules:
- name: "<alert name>"
description: "<text>"
query: "<PromQL>"
severity: warning|critical|info
for: "<duration>" # optional, defaults to 0m
comments: # optional, rendered as multiline YAML comments
"<comment>"
```
Services are grouped in category. If you are not sure about the classification, ask the developer.
## Running Locally
```bash
cd site
npm install
npm run dev
```
Site serves at http://localhost:4321/awesome-prometheus-alerts.
To build for production:
```bash
cd site
npm run build
npm run preview
```
## Contributing Rules
All rule changes go in `_data/rules.yml`. Each rule needs: `name`, `description`, `query` (valid PromQL), and `severity`. The `for` field is optional. Descriptions should be factual ("what") and include root cause hints ("why"). Queries must be tested against the latest exporter version. Never modify files in `dist/` — they are auto-generated on merge.
## Query Validation
- When adding or updating an alert, verify that the PromQL query references metric series that actually exist in the related exporter. Check the exporter's documentation or source code to confirm series names.
- If a metric series has been deprecated or removed in a newer version of the exporter, update the query to use the replacement series, or remove the rule if no replacement exists. Known examples: `kube_hpa_*` renamed to `kube_horizontalpodautoscaler_*` in kube-state-metrics 2.x; `node_hwmon_temp_alarm` does not exist (correct: `node_hwmon_temp_crit_alarm_celsius`); node-exporter CLI flags get renamed across versions.
- When writing or reviewing a query, search the internet (exporter docs, GitHub issues, changelogs) to validate correctness and catch outdated series names. When you are not sure about a metric name, always search the internet to confirm it exists and is spelled correctly before using it.
- Pay special attention to metric naming conventions: many exporters add `_total` suffixes for counters and `_seconds_total` for time-based counters. Verify the exact name from source code, not just docs. Known examples: Spark's PrometheusResource adds `_total` and `_seconds_total` suffixes (e.g., `metrics_executor_failedTasks_total`, not `metrics_executor_failedTasks`); Oracle's `oracledb_sessions_value` not `oracledb_sessions_activity`.
- Verify that label names used in `{{ $labels.xxx }}` template variables actually exist on the metric. Check the exporter source code for the exact label names. Known examples: cloudflare/ebpf_exporter uses `id` not `name` for programs, and `config` not `name` for decoder errors.
- When a metric uses info-style patterns (value always 1, information carried in labels), `== 0` will never be true — the metric simply won't exist. Use `absent()` instead. Known example: `ebpf_exporter_enabled_configs`.
- Some metrics are version-dependent. When a metric was renamed or removed in a newer version, add a comment noting the version requirement. Known examples: `go_memstats_gc_cpu_fraction` removed in client_golang v1.12+; cert-manager renamed `certmanager_http_acme_client_request_count` to `certmanager_acme_client_request_count` in v1.19+.
- Verify the unit of a metric before setting thresholds. Some metrics use milliseconds while descriptions assume seconds. Known example: Keycloak's `keycloak_request_duration` is in milliseconds, so `> 2` means 2ms not 2s.
- Some exporters expose labels that differ between services even within the same ecosystem. Known example: OpenStack Neutron uses `adminState="up"` while Nova and Cinder use `adminState="enabled"`.
- When an official mixin exists for a service, compare thresholds and time windows against it. Known deviations to watch for: Mimir store-gateway sync uses 1800s (not 600s), Mimir compactor skipped blocks uses `[24h]` (not `[5m]`), Tempo normalizes outstanding blocks per worker.
## Common Review Pitfalls (learned from PR history)
These are the most frequent issues raised during code review on this repo:
### Severity levels
- `critical` = requires immediate human attention. Do not use for informational/security notifications.
- `warning` = needs attention soon but not urgent.
- `info` = awareness only (e.g., config changes, underutilized resources).
- Authentication failures, security notifications, and config-change detections are typically `info`, not `critical`.
### `for` duration
- Omit `for` when the default (0m) is intentional and appropriate — do not add `for: 0m` explicitly.
- Add a `for` duration (e.g., `for: 2m` or `for: 5m`) to tolerate brief unavailability from restarts or transient spikes. Most "service down" rules should have at least `for: 1m``2m`.
- Do not blanket-change all `for: 0m` to `for: 1m` — it depends on the alert's semantics and the range window used in `increase()`/`rate()`.
### Query design
- Prefer symptom-based alerts over cause-based alerts to reduce alert fatigue. Example: "service is unreachable" is better than "specific internal counter changed". Metrics like heap object count, allocation rate, or free heap slots are causes, not symptoms — prefer GC duration, latency, or error rate alerts instead.
- Don't add unnecessary aggregation (`avg()`, `avg_over_time()`) on metrics that are local to a single node/instance. Only aggregate when the alert is cluster-wide.
- Don't combine `min_over_time()[1m]` with `for: 2m` redundantly — pick one mechanism for smoothing. Same applies to `avg_over_time()[5m]` with `for: 5m`.
- Remove unnecessary label filters (e.g., `job="cassandra"` or `cluster=~".*"`) that add noise without value.
- Verify comparison operators match the intent — e.g., "high snapshot count" must use `> N`, not `< N`.
- When dividing counters (e.g., error rate = errors / total), guard against division by zero with `and total > 0` or filter appropriately. This is the most common issue in new PRs — check every ratio query.
- Filter out system/template databases explicitly in DB queries (e.g., PostgreSQL: add `datid!="0"` alongside `datname!~"template.*|postgres"`).
- Never use `rate()` on a gauge metric — use `deriv()` instead. `rate()` is for monotonically increasing counters only.
- Conversely, never use `deriv()` or `delta()` on a metric that is a cumulative counter, even if the exporter declares it as `untyped`. The only reliable way to determine whether a metric is a counter or a gauge is to check whether it monotonically increases and resets on restart — not just the declared type. Known examples of untyped metrics with counter semantics: `node_vmstat_*` (e.g., `node_vmstat_pgmajfault`, `node_vmstat_oom_kill`) from node_exporter (cumulative values from /proc/vmstat — the official node_exporter mixin uses `rate()`); MySQL `SHOW GLOBAL STATUS` variables via mysqld_exporter (e.g., `mysql_global_status_slow_queries`, `mysql_global_status_innodb_log_waits`, `mysql_global_status_questions` — all monotonically increasing, use `rate()`/`increase()`).
- When using `increase()` for ratio calculations, prefer `rate()` instead — `increase()` can produce incorrect results when counters reset mid-window.
- When filtering gRPC error codes, don't use `grpc_code!="OK"` — this includes normal application responses like `NotFound`, `AlreadyExists`, and `Cancelled`. Filter to actual errors: `grpc_code=~"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown"`.
- When computing ratios with `rate()` on a metric that is itself already a normalized rate (e.g., Oracle's `v$waitclassmetric`), applying `rate()` computes the rate-of-change of a rate, which is not meaningful.
- When a multi-label metric is used in a binary operation with a metric that has fewer labels, use `ignoring(extra_label)` to avoid join failures. Known example: `systemd_unit_tasks_current / ignoring(type) systemd_unit_tasks_max`.
- When a query groups by labels (e.g., `by (le, worker)`), consider the cardinality impact — hundreds of label values means hundreds of independent alerts.
- Ensure `{{ $value | humanizeDuration }}` is only used on values in seconds. If the metric is in milliseconds, divide by 1000 first or use `{{ $value | humanize }}ms`.
- Avoid using `up{job=~"exporter-name"} == 0` or `absent(up{job=~"exporter-name"})` to detect whether a service is down. When targets are managed via service discovery or a job reaches multiple targets, a disappeared target causes the `up` series to become stale and vanish rather than drop to 0, so the alert never fires. Prefer application-level or cluster-level metrics instead (e.g., "number of consul cluster members < 3", "PostgreSQL primary node absent").
### Thresholds
- Alert thresholds are inherently arbitrary and depend on workload. Use `comments:` to note this when a threshold is a rough default.
- When threshold values in a PR seem unreasonable (too high or too low), challenge them with real-world reasoning or exporter docs.
- Watch for thresholds that are so high they only catch catastrophic scenarios and miss real problems. Examples: Go goroutine spike at 100/s (misses gradual leaks), Ruby major GC at 5/s (only fires if app is non-functional), Python gen2 GC at >1/s (extremely rare).
- Watch for thresholds that will fire on normal healthy operation. Examples: Memcached at 90% memory is desired (it's a cache), Flink TaskManager at 90% JVM heap is normal, cache hit rate < 80% is common for cold caches.
- For SNMP bandwidth utilization, `ifSpeed` (Gauge32) maxes at ~4.29 Gbps. For 10G+ interfaces, use `ifHighSpeed * 1000000` instead.
- For alerts using `> 0` on counters with `rate()` or `increase()`, consider whether a single event truly warrants alerting. In most cases, a small threshold (e.g., `> 0.05` for rate, `> 3` for increase) better distinguishes real problems from transient noise.
- When checking a cumulative total metric (one that only resets on process restart) with `> 0`, the alert will fire permanently after the first occurrence and never resolve. Always wrap such metrics in `increase()` or `rate()` to detect new events. Known example: `opensearch_circuitbreaker_tripped_count > 0` fires forever after the first circuit breaker trip.
### Comments
- When an alert or its query needs explanation (e.g., non-obvious PromQL logic, threshold rationale, edge cases), use the rule-level `comments:` field. Use multiline comments when needed.
- Use the exporter-level `comments:` field for notes that apply to all rules under that exporter (e.g., exporter version requirements, known quirks, setup prerequisites).
- Comments are rendered as YAML `#` comments in the output, so they are visible to users who copy-paste the rules.
- Never add two `comments:` keys to the same rule or exporter block. YAML silently discards the first when there are duplicate keys in the same mapping. Always merge multiple comment paragraphs into a single `comments:` field using the multiline `|` block scalar.
### Descriptions
- Keep descriptions short, factual, and actionable.
- Include what is happening ("Disk is almost full") and why it matters or what to check.
- Use `{{ $labels.instance }}`, `{{ $value }}`, and other template variables in descriptions when useful.
- If the description says "average" but the query uses `histogram_quantile(0.95, ...)`, fix the description to say "p95" (or vice versa).
- When alerting on rates or ratios that may not be intuitive, include `{{ $value }}` in the description so operators can see the actual number.
### Structure
- Some services have multiple exporters (e.g., MongoDB has `percona/mongodb_exporter` and `dcu/mongodb_exporter`). Place rules under the correct exporter.
- Search for duplicates before adding a new rule — a similar alert may already exist under a different exporter or with different thresholds.
- The `slug` field must be unique per exporter and is used for download URLs.
## Reference Sources for Cross-Checking Alerts
Use these sources to criticize and validate PromQL queries, compare thresholds, and find inspiration for new rules.
Everytime you consume an external resource to change a PromQL query, please compare before/after and explain why you think the external source is right.
### Official project mixins (alerts maintained by the project itself)
- https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin/alerts
- https://github.com/prometheus/prometheus/tree/main/documentation/prometheus-mixin
- https://github.com/prometheus/alertmanager/tree/main/doc/alertmanager-mixin
- https://github.com/prometheus/snmp_exporter/tree/main/snmp-mixin
- https://github.com/prometheus/mysqld_exporter/tree/main/mysqld-mixin
- https://github.com/prometheus-community/postgres_exporter/tree/master/postgres_mixin
- https://github.com/prometheus-community/elasticsearch_exporter (mixin via Grafana docs)
- https://github.com/etcd-io/etcd/tree/main/contrib/mixin
- https://github.com/thanos-io/thanos/tree/main/mixin (also: examples/alerts/)
- https://github.com/grafana/loki/tree/main/production/loki-mixin (also: promtail-mixin/)
- https://github.com/grafana/mimir/tree/main/operations/mimir-mixin
- https://github.com/grafana/tempo/tree/main/operations/tempo-mixin
- https://github.com/grafana/grafana/tree/main/grafana-mixin
- https://github.com/ceph/ceph/tree/main/monitoring/ceph-mixin (in-tree; also https://github.com/ceph/ceph-mixins)
- https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin
- https://github.com/kubernetes-monitoring/kubernetes-mixin (includes runbook.md)
- https://github.com/kubernetes/kube-state-metrics/tree/main/jsonnet/kube-state-metrics-mixin
- https://github.com/prometheus-operator/prometheus-operator/tree/main/jsonnet/mixin
- https://github.com/prometheus-operator/kube-prometheus
- https://github.com/cortexproject/cortex-jsonnet
- https://github.com/gluster/gluster-mixins
### Standalone mixin repositories
- https://github.com/povilasv/coredns-mixin
- https://github.com/adinhodovic/rabbitmq-mixin
- https://github.com/adinhodovic/blackbox-exporter-mixin
- https://github.com/adinhodovic/django-mixin
- https://github.com/adinhodovic/argo-cd-mixin
- https://github.com/adinhodovic/ingress-nginx-mixin
- https://github.com/adinhodovic/kubernetes-autoscaling-mixin
- https://github.com/metalmatze/kube-cockroachdb (CockroachDB on Kubernetes)
- https://github.com/bitnami-labs/sealed-secrets (sealed-secrets mixin)
- https://github.com/lukas-vlcek/elasticsearch-mixin (includes runbook.md)
- https://github.com/opensearch-project/opensearch-prometheus-exporter (OpenSearch exporter — check metric names here)
- https://github.com/adinhodovic/postgresql-mixin
- https://github.com/imusmanmalik/cert-manager-mixin
- https://gitlab.com/uneeq-oss/cert-manager-mixin (alternative cert-manager mixin)
- https://github.com/uneeq-oss/spinnaker-mixin
- https://github.com/metalmatze/slo-libsonnet (SLO alerting/recording rules generation library)
### Grafana jsonnet-libs (93 mixins — browse for specific services)
- https://github.com/grafana/jsonnet-libs
- Notable mixins with alerts: consul, memcached, elasticsearch, haproxy, clickhouse, opensearch, redis, mongodb, kafka, nginx, rabbitmq, jvm, vault, envoy, istio, jenkins, caddy, cloudflare, docker, traefik, windows, snmp, argocd, nomad, pgbouncer, minio, ceph, and 60+ more.
### Mixin aggregators
- https://monitoring.mixins.dev/ (central registry of all monitoring mixins)
- https://github.com/monitoring-mixins/website/blob/master/mixins.json (machine-readable list of all mixins with source URLs)
- https://github.com/nlamirault/monitoring-mixins (hub aggregating many mixins)
### GitLab monitoring & infrastructure
- https://gitlab.com/gitlab-com/runbooks (GitLab.com SRE runbooks — production alert rules, runbook docs, alertmanager config)
- https://gitlab.com/gitlab-com/runbooks/-/tree/master/mimir-rules (production Mimir alerting rules organized by tenant/environment)
- https://gitlab.com/gitlab-com/runbooks/-/tree/master/mimir-rules-jsonnet (jsonnet sources for GitLab alerting rules)
- https://gitlab.com/gitlab-org/omnibus-gitlab/-/tree/master/files/gitlab-cookbooks/monitoring/templates/rules (default Prometheus rules shipped with GitLab Omnibus)
### Community alert collections
- https://github.com/jpweber/prometheus-alert-rules
- https://github.com/bdossantos/prometheus-alert-rules
- https://github.com/giantswarm/prometheus-rules
- https://github.com/last9/awesome-prometheus-toolkit
- https://github.com/warpnet/awesome-prometheus (meta-list of Prometheus resources)

View file

@ -16,24 +16,16 @@ Please ensure your pull request adheres to the following guidelines:
- Description must be factual (the "what?") and should provide root cause suggestions (the "why?"), for faster resolution.
- Queries must be tested on latest exporter version.
## Improving Github page
## Improving the website
### Run localy
The site is built with Astro + TypeScript, located in `site/`.
### Run locally
```
gem install bundler
bundle install
jekyll serve
cd site
npm install
npm run dev
```
Or with Docker:
```
docker run --rm -it -p 4000:4000 -v $(pwd):/srv/jekyll jekyll/jekyll jekyll serve
```
Or with Docker-Compose:
```
docker-compose up -d
```
Site serves at http://localhost:4321/awesome-prometheus-alerts.

View file

@ -1,3 +0,0 @@
source 'https://rubygems.org'
gem 'github-pages', group: :jekyll_plugins
gem 'webrick', '~> 1.3', '>= 1.3.1'

View file

@ -1,284 +0,0 @@
GEM
remote: https://rubygems.org/
specs:
activesupport (6.0.6.1)
concurrent-ruby (~> 1.0, >= 1.0.2)
i18n (>= 0.7, < 2)
minitest (~> 5.1)
tzinfo (~> 1.1)
zeitwerk (~> 2.2, >= 2.2.2)
addressable (2.8.0)
public_suffix (>= 2.0.2, < 5.0)
coffee-script (2.4.1)
coffee-script-source
execjs
coffee-script-source (1.11.1)
colorator (1.1.0)
commonmarker (0.23.10)
concurrent-ruby (1.2.0)
dnsruby (1.61.9)
simpleidn (~> 0.1)
em-websocket (0.5.3)
eventmachine (>= 0.12.9)
http_parser.rb (~> 0)
ethon (0.15.0)
ffi (>= 1.15.0)
eventmachine (1.2.7)
execjs (2.8.1)
faraday (1.10.0)
faraday-em_http (~> 1.0)
faraday-em_synchrony (~> 1.0)
faraday-excon (~> 1.1)
faraday-httpclient (~> 1.0)
faraday-multipart (~> 1.0)
faraday-net_http (~> 1.0)
faraday-net_http_persistent (~> 1.0)
faraday-patron (~> 1.0)
faraday-rack (~> 1.0)
faraday-retry (~> 1.0)
ruby2_keywords (>= 0.0.4)
faraday-em_http (1.0.0)
faraday-em_synchrony (1.0.0)
faraday-excon (1.1.0)
faraday-httpclient (1.0.1)
faraday-multipart (1.0.3)
multipart-post (>= 1.2, < 3)
faraday-net_http (1.0.1)
faraday-net_http_persistent (1.2.0)
faraday-patron (1.0.0)
faraday-rack (1.0.0)
faraday-retry (1.0.3)
ffi (1.15.5)
forwardable-extended (2.6.0)
gemoji (3.0.1)
github-pages (226)
github-pages-health-check (= 1.17.9)
jekyll (= 3.9.2)
jekyll-avatar (= 0.7.0)
jekyll-coffeescript (= 1.1.1)
jekyll-commonmark-ghpages (= 0.2.0)
jekyll-default-layout (= 0.1.4)
jekyll-feed (= 0.15.1)
jekyll-gist (= 1.5.0)
jekyll-github-metadata (= 2.13.0)
jekyll-include-cache (= 0.2.1)
jekyll-mentions (= 1.6.0)
jekyll-optional-front-matter (= 0.3.2)
jekyll-paginate (= 1.1.0)
jekyll-readme-index (= 0.3.0)
jekyll-redirect-from (= 0.16.0)
jekyll-relative-links (= 0.6.1)
jekyll-remote-theme (= 0.4.3)
jekyll-sass-converter (= 1.5.2)
jekyll-seo-tag (= 2.8.0)
jekyll-sitemap (= 1.4.0)
jekyll-swiss (= 1.0.0)
jekyll-theme-architect (= 0.2.0)
jekyll-theme-cayman (= 0.2.0)
jekyll-theme-dinky (= 0.2.0)
jekyll-theme-hacker (= 0.2.0)
jekyll-theme-leap-day (= 0.2.0)
jekyll-theme-merlot (= 0.2.0)
jekyll-theme-midnight (= 0.2.0)
jekyll-theme-minimal (= 0.2.0)
jekyll-theme-modernist (= 0.2.0)
jekyll-theme-primer (= 0.6.0)
jekyll-theme-slate (= 0.2.0)
jekyll-theme-tactile (= 0.2.0)
jekyll-theme-time-machine (= 0.2.0)
jekyll-titles-from-headings (= 0.5.3)
jemoji (= 0.12.0)
kramdown (= 2.3.2)
kramdown-parser-gfm (= 1.1.0)
liquid (= 4.0.3)
mercenary (~> 0.3)
minima (= 2.5.1)
nokogiri (>= 1.13.4, < 2.0)
rouge (= 3.26.0)
terminal-table (~> 1.4)
github-pages-health-check (1.17.9)
addressable (~> 2.3)
dnsruby (~> 1.60)
octokit (~> 4.0)
public_suffix (>= 3.0, < 5.0)
typhoeus (~> 1.3)
html-pipeline (2.14.1)
activesupport (>= 2)
nokogiri (>= 1.4)
http_parser.rb (0.8.0)
i18n (0.9.5)
concurrent-ruby (~> 1.0)
jekyll (3.9.2)
addressable (~> 2.4)
colorator (~> 1.0)
em-websocket (~> 0.5)
i18n (~> 0.7)
jekyll-sass-converter (~> 1.0)
jekyll-watch (~> 2.0)
kramdown (>= 1.17, < 3)
liquid (~> 4.0)
mercenary (~> 0.3.3)
pathutil (~> 0.9)
rouge (>= 1.7, < 4)
safe_yaml (~> 1.0)
jekyll-avatar (0.7.0)
jekyll (>= 3.0, < 5.0)
jekyll-coffeescript (1.1.1)
coffee-script (~> 2.2)
coffee-script-source (~> 1.11.1)
jekyll-commonmark (1.4.0)
commonmarker (~> 0.22)
jekyll-commonmark-ghpages (0.2.0)
commonmarker (~> 0.23.4)
jekyll (~> 3.9.0)
jekyll-commonmark (~> 1.4.0)
rouge (>= 2.0, < 4.0)
jekyll-default-layout (0.1.4)
jekyll (~> 3.0)
jekyll-feed (0.15.1)
jekyll (>= 3.7, < 5.0)
jekyll-gist (1.5.0)
octokit (~> 4.2)
jekyll-github-metadata (2.13.0)
jekyll (>= 3.4, < 5.0)
octokit (~> 4.0, != 4.4.0)
jekyll-include-cache (0.2.1)
jekyll (>= 3.7, < 5.0)
jekyll-mentions (1.6.0)
html-pipeline (~> 2.3)
jekyll (>= 3.7, < 5.0)
jekyll-optional-front-matter (0.3.2)
jekyll (>= 3.0, < 5.0)
jekyll-paginate (1.1.0)
jekyll-readme-index (0.3.0)
jekyll (>= 3.0, < 5.0)
jekyll-redirect-from (0.16.0)
jekyll (>= 3.3, < 5.0)
jekyll-relative-links (0.6.1)
jekyll (>= 3.3, < 5.0)
jekyll-remote-theme (0.4.3)
addressable (~> 2.0)
jekyll (>= 3.5, < 5.0)
jekyll-sass-converter (>= 1.0, <= 3.0.0, != 2.0.0)
rubyzip (>= 1.3.0, < 3.0)
jekyll-sass-converter (1.5.2)
sass (~> 3.4)
jekyll-seo-tag (2.8.0)
jekyll (>= 3.8, < 5.0)
jekyll-sitemap (1.4.0)
jekyll (>= 3.7, < 5.0)
jekyll-swiss (1.0.0)
jekyll-theme-architect (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-cayman (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-dinky (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-hacker (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-leap-day (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-merlot (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-midnight (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-minimal (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-modernist (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-primer (0.6.0)
jekyll (> 3.5, < 5.0)
jekyll-github-metadata (~> 2.9)
jekyll-seo-tag (~> 2.0)
jekyll-theme-slate (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-tactile (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-theme-time-machine (0.2.0)
jekyll (> 3.5, < 5.0)
jekyll-seo-tag (~> 2.0)
jekyll-titles-from-headings (0.5.3)
jekyll (>= 3.3, < 5.0)
jekyll-watch (2.2.1)
listen (~> 3.0)
jemoji (0.12.0)
gemoji (~> 3.0)
html-pipeline (~> 2.2)
jekyll (>= 3.0, < 5.0)
kramdown (2.3.2)
rexml
kramdown-parser-gfm (1.1.0)
kramdown (~> 2.0)
liquid (4.0.3)
listen (3.7.1)
rb-fsevent (~> 0.10, >= 0.10.3)
rb-inotify (~> 0.9, >= 0.9.10)
mercenary (0.3.6)
minima (2.5.1)
jekyll (>= 3.5, < 5.0)
jekyll-feed (~> 0.9)
jekyll-seo-tag (~> 2.1)
minitest (5.17.0)
multipart-post (2.1.1)
nokogiri (1.16.2-x86_64-linux)
racc (~> 1.4)
octokit (4.22.0)
faraday (>= 0.9)
sawyer (~> 0.8.0, >= 0.5.3)
pathutil (0.16.2)
forwardable-extended (~> 2.6)
public_suffix (4.0.7)
racc (1.7.3)
rb-fsevent (0.11.1)
rb-inotify (0.10.1)
ffi (~> 1.0)
rexml (3.2.5)
rouge (3.26.0)
ruby2_keywords (0.0.5)
rubyzip (2.3.2)
safe_yaml (1.0.5)
sass (3.7.4)
sass-listen (~> 4.0.0)
sass-listen (4.0.0)
rb-fsevent (~> 0.9, >= 0.9.4)
rb-inotify (~> 0.9, >= 0.9.7)
sawyer (0.8.2)
addressable (>= 2.3.5)
faraday (> 0.8, < 2.0)
simpleidn (0.2.1)
unf (~> 0.1.4)
terminal-table (1.8.0)
unicode-display_width (~> 1.1, >= 1.1.1)
thread_safe (0.3.6)
typhoeus (1.4.0)
ethon (>= 0.9.0)
tzinfo (1.2.11)
thread_safe (~> 0.1)
unf (0.1.4)
unf_ext
unf_ext (0.0.8.1)
unicode-display_width (1.8.0)
webrick (1.7.0)
zeitwerk (2.6.6)
PLATFORMS
x86_64-linux
x86_64-linux-musl
DEPENDENCIES
github-pages
webrick (~> 1.3, >= 1.3.1)
BUNDLED WITH
2.3.13

38
LICENSE
View file

@ -1,3 +1,39 @@
This repository uses a dual license:
- Alert rules and content (_data/rules.yml, dist/rules/, README.md):
Creative Commons Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
- Site source code (site/):
MIT License
https://opensource.org/licenses/MIT
---
Creative Commons Attribution 4.0 International License (CC BY 4.0)
http://creativecommons.org/licenses/by/4.0/
https://creativecommons.org/licenses/by/4.0/
---
MIT License (site source code)
Copyright (c) 2018 Samuel Berthe
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

117
README.md
View file

@ -1,6 +1,6 @@
# 👋 Awesome Prometheus Alerts [![Awesome](https://awesome.re/badge-flat.svg)](https://awesome.re)
> Most alerting rules are common to every Prometheus setup. We need a place to find them all. 🤘 🚨 📊
> **940+ production-ready Prometheus alerting rules for 90+ services** — copy-paste YAML for Kubernetes, MySQL, Redis, Kafka, and more.
Collection available here: **[https://samber.github.io/awesome-prometheus-alerts](https://samber.github.io/awesome-prometheus-alerts)**
@ -8,9 +8,18 @@ Collection available here: **[https://samber.github.io/awesome-prometheus-alerts
<hr>
<sup><b>Sponsored by:</b></sup>
<br>
<a href="https://cast.ai/samuel">
<div>
<img src="https://samber.github.io/awesome-prometheus-alerts/images/sponsor-cast-ai.png" width="200" alt="Cast AI">
</div>
<div>
Cut Kubernetes & AI costs, boost application stability.
</div>
</a>
<br>
<a href="https://betterstack.com">
<div>
<img src="https://samber.github.io/awesome-prometheus-alerts/assets/sponsor-betterstack.png" width="200" alt="Better Stack">
<img src="https://samber.github.io/awesome-prometheus-alerts/images/sponsor-betterstack.png" width="200" alt="Better Stack">
</div>
<div>
Better Stack lets you centralize, search, and visualize your logs.
@ -34,74 +43,130 @@ Collection available here: **[https://samber.github.io/awesome-prometheus-alerts
- [Prometheus self-monitoring](https://samber.github.io/awesome-prometheus-alerts/rules#prometheus-internals)
- [Host/Hardware](https://samber.github.io/awesome-prometheus-alerts/rules#host-and-hardware)
- [SMART](https://samber.github.io/awesome-prometheus-alerts/rules#smart)
- [IPMI](https://samber.github.io/awesome-prometheus-alerts/rules#ipmi)
- [Docker Containers](https://samber.github.io/awesome-prometheus-alerts/rules#docker-containers)
- [Blackbox](https://samber.github.io/awesome-prometheus-alerts/rules#blackbox)
- [Windows](https://samber.github.io/awesome-prometheus-alerts/rules#windows-server)
- [VMWare](https://samber.github.io/awesome-prometheus-alerts/rules#vmware)
- [Proxmox VE](https://samber.github.io/awesome-prometheus-alerts/rules#proxmox-ve)
- [Netdata](https://samber.github.io/awesome-prometheus-alerts/rules#netdata)
- [eBPF](https://samber.github.io/awesome-prometheus-alerts/rules#ebpf)
- [Process Exporter](https://samber.github.io/awesome-prometheus-alerts/rules#process-exporter)
- [Systemd](https://samber.github.io/awesome-prometheus-alerts/rules#systemd)
#### Databases and brokers
#### Databases
- [MySQL](https://samber.github.io/awesome-prometheus-alerts/rules#mysql)
- [PostgreSQL](https://samber.github.io/awesome-prometheus-alerts/rules#postgresql)
- [SQL Server](https://samber.github.io/awesome-prometheus-alerts/rules#sql-server)
- [Oracle Database](https://samber.github.io/awesome-prometheus-alerts/rules#oracle-database)
- [Patroni](https://samber.github.io/awesome-prometheus-alerts/rules#patroni)
- [PGBouncer](https://samber.github.io/awesome-prometheus-alerts/rules#pgbouncer)
- [Redis](https://samber.github.io/awesome-prometheus-alerts/rules#redis)
- [Memcached](https://samber.github.io/awesome-prometheus-alerts/rules#memcached)
- [MongoDB](https://samber.github.io/awesome-prometheus-alerts/rules#mongodb)
- [RabbitMQ](https://samber.github.io/awesome-prometheus-alerts/rules#rabbitmq)
- [Elasticsearch](https://samber.github.io/awesome-prometheus-alerts/rules#elasticsearch)
- [OpenSearch](https://samber.github.io/awesome-prometheus-alerts/rules#opensearch)
- [Meilisearch](https://samber.github.io/awesome-prometheus-alerts/rules#meilisearch)
- [Cassandra](https://samber.github.io/awesome-prometheus-alerts/rules#cassandra)
- [Clickhouse](https://samber.github.io/awesome-prometheus-alerts/rules#clickhouse)
- [CouchDB](https://samber.github.io/awesome-prometheus-alerts/rules#couchdb)
- [Solr](https://samber.github.io/awesome-prometheus-alerts/rules#solr)
#### Message brokers
- [RabbitMQ](https://samber.github.io/awesome-prometheus-alerts/rules#rabbitmq)
- [Zookeeper](https://samber.github.io/awesome-prometheus-alerts/rules#zookeeper)
- [Kafka](https://samber.github.io/awesome-prometheus-alerts/rules#kafka)
- [Pulsar](https://samber.github.io/awesome-prometheus-alerts/rules#pulsar)
- [Nats](https://samber.github.io/awesome-prometheus-alerts/rules#nats)
- [Solr](https://samber.github.io/awesome-prometheus-alerts/rules#solr)
- [Hadoop](https://samber.github.io/awesome-prometheus-alerts/rules#hadoop)
#### Reverse proxies and load balancers
#### Proxies, load balancers and service meshes
- [Nginx](https://samber.github.io/awesome-prometheus-alerts/rules#nginx)
- [Apache](https://samber.github.io/awesome-prometheus-alerts/rules#apache)
- [HaProxy](https://samber.github.io/awesome-prometheus-alerts/rules#haproxy)
- [Traefik](https://samber.github.io/awesome-prometheus-alerts/rules#traefik)
- [Caddy](https://samber.github.io/awesome-prometheus-alerts/rules#caddy)
- [Envoy](https://samber.github.io/awesome-prometheus-alerts/rules#envoy)
- [Linkerd](https://samber.github.io/awesome-prometheus-alerts/rules#linkerd)
- [Istio](https://samber.github.io/awesome-prometheus-alerts/rules#istio)
#### Runtimes
- [PHP-FPM](https://samber.github.io/awesome-prometheus-alerts/rules#php-fpm)
- [JVM](https://samber.github.io/awesome-prometheus-alerts/rules#jvm)
- [Golang](https://samber.github.io/awesome-prometheus-alerts/rules#golang)
- [Ruby](https://samber.github.io/awesome-prometheus-alerts/rules#ruby)
- [Python](https://samber.github.io/awesome-prometheus-alerts/rules#python)
- [Sidekiq](https://samber.github.io/awesome-prometheus-alerts/rules#sidekiq)
#### Data engineering
- [Apache Flink](https://samber.github.io/awesome-prometheus-alerts/rules#apache-flink)
- [Apache Spark](https://samber.github.io/awesome-prometheus-alerts/rules#apache-spark)
- [Hadoop](https://samber.github.io/awesome-prometheus-alerts/rules#hadoop)
#### Orchestrators
- [Kubernetes](https://samber.github.io/awesome-prometheus-alerts/rules#kubernetes)
- [Nomad](https://samber.github.io/awesome-prometheus-alerts/rules#nomad)
- [Consul](https://samber.github.io/awesome-prometheus-alerts/rules#consul)
- [Etcd](https://samber.github.io/awesome-prometheus-alerts/rules#etcd)
- [Linkerd](https://samber.github.io/awesome-prometheus-alerts/rules#linkerd)
- [Istio](https://samber.github.io/awesome-prometheus-alerts/rules#istio)
- [ArgoCD](https://samber.github.io/awesome-prometheus-alerts/rules#argocd)
- [OpenStack](https://samber.github.io/awesome-prometheus-alerts/rules#openstack)
#### Network, security and storage
#### CI/CD
- [Jenkins](https://samber.github.io/awesome-prometheus-alerts/rules#jenkins)
- [ArgoCD](https://samber.github.io/awesome-prometheus-alerts/rules#argocd)
- [FluxCD](https://samber.github.io/awesome-prometheus-alerts/rules#fluxcd)
- [GitLab CI](https://samber.github.io/awesome-prometheus-alerts/rules#gitlab-ci)
- [Spinnaker](https://samber.github.io/awesome-prometheus-alerts/rules#spinnaker)
#### Network and security
- [SpeedTest](https://samber.github.io/awesome-prometheus-alerts/rules#speedtest)
- [SSL/TLS](https://samber.github.io/awesome-prometheus-alerts/rules#ssl/tls)
- [cert-manager](https://samber.github.io/awesome-prometheus-alerts/rules#cert-manager)
- [Juniper](https://samber.github.io/awesome-prometheus-alerts/rules#juniper)
- [CoreDNS](https://samber.github.io/awesome-prometheus-alerts/rules#coredns)
- [FreeSwitch](https://samber.github.io/awesome-prometheus-alerts/rules#freeswitch)
- [Hashicorp Vault](https://samber.github.io/awesome-prometheus-alerts/rules#hashicorp-vault)
- [Keycloak](https://samber.github.io/awesome-prometheus-alerts/rules#keycloak)
- [Cloudflare](https://samber.github.io/awesome-prometheus-alerts/rules#cloudflare)
- [SNMP](https://samber.github.io/awesome-prometheus-alerts/rules#snmp)
- [Cilium](https://samber.github.io/awesome-prometheus-alerts/rules#cilium)
- [WireGuard](https://samber.github.io/awesome-prometheus-alerts/rules#wireguard)
#### Storage
- [Ceph](https://samber.github.io/awesome-prometheus-alerts/rules#ceph)
- [ZFS](https://samber.github.io/awesome-prometheus-alerts/rules#zfs)
- [OpenEBS](https://samber.github.io/awesome-prometheus-alerts/rules#openebs)
- [Minio](https://samber.github.io/awesome-prometheus-alerts/rules#minio)
- [SSL/TLS](https://samber.github.io/awesome-prometheus-alerts/rules#ssl/tls)
- [Juniper](https://samber.github.io/awesome-prometheus-alerts/rules#juniper)
- [CoreDNS](https://samber.github.io/awesome-prometheus-alerts/rules#coredns)
- [FreeSwitch](https://samber.github.io/awesome-prometheus-alerts/rules#freeswitch)
- [Hashicorp Vault](https://samber.github.io/awesome-prometheus-alerts/rules#hashicorp-vault)
- [Cloudflare](https://samber.github.io/awesome-prometheus-alerts/rules#cloudflare)
#### Other
#### Cloud providers
- [AWS CloudWatch](https://samber.github.io/awesome-prometheus-alerts/rules#aws-cloudwatch)
- [Google Cloud Stackdriver](https://samber.github.io/awesome-prometheus-alerts/rules#google-cloud-stackdriver)
- [DigitalOcean](https://samber.github.io/awesome-prometheus-alerts/rules#digitalocean)
- [Azure](https://samber.github.io/awesome-prometheus-alerts/rules#azure)
#### Observability
- [Thanos](https://samber.github.io/awesome-prometheus-alerts/rules#thanos)
- [Loki](https://samber.github.io/awesome-prometheus-alerts/rules#loki)
- [Promtail](https://samber.github.io/awesome-prometheus-alerts/rules#promtail)
- [Cortex](https://samber.github.io/awesome-prometheus-alerts/rules#cortex)
- [Jenkins](https://samber.github.io/awesome-prometheus-alerts/rules#jenkins)
- [Grafana Tempo](https://samber.github.io/awesome-prometheus-alerts/rules#grafana-tempo)
- [Grafana Mimir](https://samber.github.io/awesome-prometheus-alerts/rules#grafana-mimir)
- [Grafana Alloy](https://samber.github.io/awesome-prometheus-alerts/rules#grafana-alloy)
- [OpenTelemetry Collector](https://samber.github.io/awesome-prometheus-alerts/rules#opentelemetry-collector)
- [Jaeger](https://samber.github.io/awesome-prometheus-alerts/rules#jaeger)
#### Other
- [APC UPS](https://samber.github.io/awesome-prometheus-alerts/rules#apc-ups)
- [Graph Node](https://samber.github.io/awesome-prometheus-alerts/rules#graph-node)
## 🤝 Contributing
@ -112,23 +177,15 @@ There are many ways to contribute: writing code, alerting rules, documentation,
[Instructions here](CONTRIBUTING.md)
## 🏋️ Improvements
- Create an alert rule builder in Jekyll for custom alerts (severity, thresholds, instances...)
- Add resolution suggestions to rule descriptions, for faster incident resolution ([#85](https://github.com/samber/awesome-prometheus-alerts/issues/85)).
## 💫 Show your support
Give a ⭐️ if this project helped you!
[![support us](https://c5.patreon.com/external/logo/become_a_patron_button.png)](https://www.patreon.com/samber)
## 👏 Thanks
Gratitude for the Gitlab operation team that provided 50+ rules. \o/
## 📝 License
[![CC4](https://mirrors.creativecommons.org/presskit/cc.srr.primary.svg)](https://creativecommons.org/licenses/by/4.0/legalcode)
- Alert rules and content: [Creative Commons CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
- Site source code: [MIT](site/LICENSE)
Licensed under the Creative Commons 4.0 License, see LICENSE file for more detail.
See [LICENSE](LICENSE) for details.

View file

@ -1,8 +0,0 @@
theme: jekyll-theme-cayman
title: Awesome Prometheus alerts
description: Collection of alerting rules
repository: samber/awesome-prometheus-alerts
baseurl: /awesome-prometheus-alerts

File diff suppressed because it is too large Load diff

View file

@ -1,170 +0,0 @@
<!DOCTYPE html>
<html lang="{{ site.lang | default: "en-US" }}">
<head>
<meta charset="UTF-8">
{% seo %}
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#157878">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
<link rel="stylesheet" href="{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}">
<link rel="stylesheet" href="{{ '/assets/css/app.css?v=' | append: site.github.build_revision | relative_url }}">
<link rel="icon" type="image/png" href="/assets/favicon.ico">
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/js/bootstrap.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.4/clipboard.min.js"></script>
<script src="{{ '/assets/js/app.js?v=' | append: site.github.build_revision | relative_url }}"></script>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-118604063-2"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'UA-118604063-2');
</script>
</head>
<body>
<style>
#skip-to-content {
height: 1px;
width: 1px;
position: absolute;
overflow: hidden;
top: -10px;
&:focus {
position: fixed;
top: 10px;
left: 10px;
height: auto;
width: auto;
background: invert($body-link-color);
outline: thick solid invert($body-link-color);
}
}
ul.github-buttons-cta li {
display: inline-block;
height: 20px;
padding: 0px 15px;
}
ul.github-buttons-cta li a {
/* width: 100px; */
text-decoration: none;
}
.fa {
/* padding: 14px;
width: 50px;
height: 50px; */
font-size: 25px;
text-align: center;
text-decoration: none;
border-radius: 50%;
}
.fa:hover {
opacity: 0.8;
}
.fa-twitter,
.fa-linkedin {
/* background: #55ACEE; */
color: white;
}
</style>
<a id="skip-to-content" href="#content">Skip to the content.</a>
<header class="page-header" role="banner">
<h1 class="project-name">
<a href="{{ '/' | relative_url }}" style="color: white">
{{ site.title | default: site.github.repository_name }}
</a>
</h1>
<h2 class="project-tagline">{{ site.description | default: site.github.project_tagline }}</h2>
<a href="{{ '/alertmanager' | relative_url }}" class="btn">Global configuration</a>
<a href="{{ '/rules' | relative_url }}" class="btn">Rules</a>
<a href="{{ '/sleep-peacefully' | relative_url }}" class="btn">Sleep peacefully</a>
<a href="{{ '/blackbox-exporter' | relative_url }}" class="btn">Blackbox</a>
<a href="https://github.com/samber/awesome-prometheus-alerts/blob/master/CONTRIBUTING.md" class="btn">
Contribute on GitHub
</a>
<ul class="github-buttons-cta">
<li>
<a href="https://github.com/samber/awesome-prometheus-alerts">
<img alt="GitHub Repo Watchers" src="https://img.shields.io/github/watchers/samber/awesome-prometheus-alerts?style=social">
</a>
</li>
<li>
<a href="https://github.com/samber/awesome-prometheus-alerts">
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/samber/awesome-prometheus-alerts?style=social">
</a>
</li>
<li>
<a href="https://github.com/samber/awesome-prometheus-alerts">
<img alt="GitHub Repo forks" src="https://img.shields.io/github/forks/samber/awesome-prometheus-alerts?style=social">
</a>
</li>
<li>
<a href="https://twitter.com/share?via=samuelberthe&related=samuelberthe&text=🚨 📊 Here is a collection of Awesome Prometheus Alerts&url=https://samber.github.io/awesome-prometheus-alerts"
class="fa fa-twitter" target="_blank"></a>
</li>
<li>
<a href="http://www.linkedin.com/shareArticle?mini=true&url=https://samber.github.io/awesome-prometheus-alerts/"
class="fa fa-linkedin" target="_blank"></a>
</li>
</ul>
<ul id="sponsoring">
<li>
Kindly supported by&nbsp; 👉
</li>
<li>
<a href="https://betterstack.com/">
<img width="" src="assets/sponsor-betterstack.png" />
</a>
</li>
</ul>
</header>
<main id="content" class="main-content" role="main">
{{ content }}
<footer class="site-footer">
{% if site.github.is_project_page %}
<span class="site-footer-owner">
<a href="{{ site.github.repository_url }}">{{ site.title }}</a> is maintained by
<a href="{{ site.github.owner_url }}">{{ site.github.owner_name }}</a>.
</span>
{% endif %}
</footer>
</main>
<!-- Screeb tag -->
<script type="text/javascript">
(function (s,c,r,ee,b) {
s['ScreebObject']=r;s[r]=s[r]||function(){(s[r].q=s[r].q||[]).push(arguments)};
b=c.createElement('script');b.type='text/javascript';
b.id=r;b.src=ee;b.async=1;c.getElementsByTagName("head")[0].appendChild(b);
}(window,document,'$screeb','https://t2.screeb.app/tag.js'));
$screeb('init', '232450e3-d3fe-4240-b543-649a5041a7db');
</script>
<!-- End of Screeb tag -->
</body>
</html>

View file

@ -1,141 +0,0 @@
<h1 style="text-align: center;">
Global configuration
</h1>
If you notice a delay between an event and the first notification, read the following blog post => [https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html](https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html).
## Prometheus configuration
{% highlight yaml %}
# prometheus.yml
global:
scrape_interval: 20s
# A short evaluation_interval will check alerting rules very often.
# It can be costly if you run Prometheus with 100+ alerts.
evaluation_interval: 20s
...
rule_files:
- 'alerts/*.yml'
scrape_configs:
...
{% endhighlight %}
{% highlight yaml %}
# alerts/example-redis.yml
groups:
- name: ExampleRedisGroup
rules:
- alert: ExampleRedisDown
expr: redis_up{} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Redis instance down"
description: "Whatever"
{% endhighlight %}
## AlertManager configuration
{% highlight yaml %}
{% raw %}
# alertmanager.yml
route:
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 10s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 30s
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 30m
# A default receiver
receiver: "slack"
# All the above attributes are inherited by all child routes and can
# overwritten on each.
routes:
- receiver: "slack"
group_wait: 10s
match_re:
severity: critical|warning
continue: true
- receiver: "pager"
group_wait: 10s
match_re:
severity: critical
continue: true
receivers:
- name: "slack"
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxxxxx'
send_resolved: true
channel: 'monitoring'
text: "{{ range .Alerts }}<!channel> {{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}"
- name: "pager"
webhook_configs:
- url: http://a.b.c.d:8080/send/sms
send_resolved: true
{% endraw %}
{% endhighlight %}
## Reduce Prometheus server load
For expansive or frequent PromQL queries, Prometheus allows to precompute rules.
{% highlight yaml %}
{% raw %}
groups:
# first define the recorded rule
- name: ExampleRecordedGroup
rules:
- record: job:rabbitmq_queue_messages_delivered_total:rate:5m
expr: rate(rabbitmq_queue_messages_delivered_total[5m])
# then use it in alerts
- name: ExampleAlertingGroup
rules:
- alert: ExampleRabbitmqLowMessageDelivery
expr: sum(job:rabbitmq_queue_messages_delivered_total:rate:5m) < 10
for: 2m
labels:
severity: critical
annotations:
summary: "Low delivery rate in Rabbitmq queues"
{% endraw %}
{% endhighlight %}
## Troubleshooting
If the notification takes too much time to be triggered, check the following delays:
- `scrape_interval = 20s` (prometheus.yml)
- `evaluation_interval = 20s` (prometheus.yml)
- `increase(mysql_global_status_slow_queries[1m]) > 0` (alerts/example-mysql.yml)
- `for: 5m` (alerts/example-mysql.yml)
- `group_wait = 10s` (alertmanager.yml)
Also read:
- [https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html](https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html).
- [https://hodovi.cc/blog/creating-awesome-alertmanager-templates-for-slack/](https://hodovi.cc/blog/creating-awesome-alertmanager-templates-for-slack/)

View file

@ -1,143 +0,0 @@
a.anchor {
font-size: 15px;
vertical-align: middle;
color: darkblue;
display: inline-block;
padding-bottom: 5px;
margin-right: 5px;
opacity: 0;
transition: opacity 0.4s;
}
h2:hover a.anchor,
h3:hover a.anchor,
h4:hover a.anchor {
opacity: 1;
}
summary {
position: relative;
padding-left: 60px;
padding-right: 50px;
margin-bottom: 15px;
font-size: 15px;
}
h2 {
position: relative;
}
.clipboard-single,
.clipboard-multiple {
right: 0;
position: absolute;
cursor: pointer;
font-size: 14px;
color: #606c71;
}
/* NAVBAR */
#rules-navbar.affix {
/* showed by JS */
display: none;
position: fixed;
overflow: auto;
top: 0;
right: 0;
max-width: 250px;
max-height: 100%;
padding-top: 20px;
padding-bottom: 20px;
padding-left: 20px;
padding-right: 10px;
background-color: #f3f6fa;
}
/* hide menu on small screens */
@media screen and (max-width: 1350px) {
#rules-navbar.affix {
display: none !important;
}
}
/* hide menu scrollbar */
#rules-navbar.affix::-webkit-scrollbar {
display: none;
}
#rules-navbar.affix {
-ms-overflow-style: none;
/* IE and Edge */
scrollbar-width: none;
/* Firefox */
}
#rules-navbar.affix h3 {
margin-bottom: 10px;
}
#rules-navbar.affix h4 {
margin: 0;
font-weight: bold;
font-size: 14px;
line-height: 14px;
}
#rules-navbar.affix ul,
#rules-navbar.affix ul li {
margin: 0;
padding-top: 0;
padding-bottom: 0;
line-height: normal;
}
#rules-navbar.affix>ul {
padding-left: 0;
padding-right: 0;
}
#rules-navbar.affix>ul>li {
margin-bottom: 10px;
padding-left: 0;
padding-right: 0;
}
#rules-navbar.affix a {
font-size: 14px;
line-height: 14px;
}
/* https://github.com/samber/awesome-prometheus-alerts/issues/356 */
@media screen and (min-width: 64em) {
.main-content {
max-width: 85rem;
}
}
ul#sponsoring {
display: flex;
align-items: center;
justify-content: center;
margin-top: 50px;
}
ul#sponsoring li {
display: flex;
padding: 0px 15px;
font-size: 16px;
}
ul#sponsoring li a {
display: flex;
}
ul#sponsoring li a img {
max-width: 180px;
max-height: 80px;
}
.page-header {
padding-bottom: 30px;
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 4.3 KiB

View file

@ -1,16 +0,0 @@
$(function () {
var clipboardRules = new ClipboardJS('.clipboard-single', {
text: function (trigger) {
const id = trigger.getAttribute('data-clipboard-target-id');
const html = $("#" + id + " .highlight");
return html.text() + '\n';
},
});
var clipboardCategories = new ClipboardJS('.clipboard-multiple', {
text: function (trigger) {
const id = trigger.getAttribute('data-clipboard-target-id');
const html = $("[id^=" + id + "] .highlight");
return Array.from(html.map((i, target) => $(target).text())).join('\n\n');
},
});
});

View file

@ -1,125 +0,0 @@
<h1 style="text-align: center;">
Blackbox exporter
</h1>
## Wordwide probes
<a href="https://github.com/prometheus/blackbox_exporter" target="_blank">Blackbox Exporter</a> gives you the ability to probe endpoints over HTTP, HTTPS, DNS, TCP and ICMP.
You should deploy blackbox exporters in multiple Point of Presence around the globe, to monitor latency. Feel free to use the following endpoints for your own projects:
- https://screeb-probe-<b>montreal</b>.cleverapps.io
- https://screeb-probe-<b>paris</b>.cleverapps.io
- https://screeb-probe-<b>jeddah</b>.cleverapps.io
- https://screeb-probe-<b>singapore</b>.cleverapps.io
- https://screeb-probe-<b>sydney</b>.cleverapps.io
- https://screeb-probe-<b>warsaw</b>.cleverapps.io
☝️ Logs have been disabled. More probes from the community would be appreciated, please contribute <a href="https://github.com/samber/awesome-prometheus-alerts/" target="_blank">here</a>! These blackbox exporters use the following <a href="https://github.com/ScreebApp/blackbox_exporter/blob/master/screeb.yml" target="_blank">configuration</a>.
## Prometheus Configuration
Blackbox exporters and endpoints must be declared in Prometheus. Here is a simple configuration, inspired by [Hayk Davtyan medium post](https://medium.com/geekculture/single-prometheus-job-for-dozens-of-blackbox-exporters-2a7ba492d6c8):
```yml
# sd/blackbox.yml
- targets:
#
# Montreal
#
# http
- screeb-probe-montreal.cleverapps.io:_:http_2xx:_:Montreal:_:f229cy:_:https://api.screeb.app
- screeb-probe-montreal.cleverapps.io:_:http_2xx:_:Montreal:_:f229cy:_:https://t.screeb.app/tag.js
# icmp
- screeb-probe-montreal.cleverapps.io:_:icmp_ipv4:_:Montreal:_:f229cy:_:api.screeb.app
- screeb-probe-montreal.cleverapps.io:_:icmp_ipv4:_:Montreal:_:f229cy:_:t.screeb.app
#
# Paris
#
# http
- screeb-probe-paris.cleverapps.io:_:http_2xx:_:Paris:_:u09tgy:_:https://api.screeb.app
- screeb-probe-paris.cleverapps.io:_:http_2xx:_:Paris:_:u09tgy:_:https://t.screeb.app/tag.js
# icmp
- screeb-probe-paris.cleverapps.io:_:icmp_ipv4:_:Paris:_:u09tgy:_:api.screeb.app
- screeb-probe-paris.cleverapps.io:_:icmp_ipv4:_:Paris:_:u09tgy:_:t.screeb.app
#
# Sydney
#
# http
- screeb-probe-sydney.cleverapps.io:_:http_2xx:_:Sydney:_:r3gpkn:_:https://api.screeb.app
- screeb-probe-sydney.cleverapps.io:_:http_2xx:_:Sydney:_:r3gpkn:_:https://t.screeb.app/tag.js
# icmp
- screeb-probe-sydney.cleverapps.io:_:icmp_ipv4:_:Sydney:_:r3gpkn:_:api.screeb.app
- screeb-probe-sydney.cleverapps.io:_:icmp_ipv4:_:Sydney:_:r3gpkn:_:t.screeb.app
# ...
```
```yml
# prometheus.yml
global:
# ...
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
scrape_interval: 30s
scheme: https
file_sd_configs:
- files:
- /etc/prometheus/sd/blackbox.yml
relabel_configs:
# adds "module" label in the final labelset
- source_labels: [__address__]
regex: '.*:_:(.*):_:.*:_:.*:_:.*'
target_label: module
# adds "geohash" label in the final labelset
- source_labels: [__address__]
regex: '.*:_:.*:_:.*:_:(.*):_:.*'
target_label: geohash
# rewrites "instance" label with corresponding URL
- source_labels: [__address__]
regex: '.*:_:.*:_:.*:_:.*:_:(.*)'
target_label: instance
# rewrites "pop" label with corresponding location name
- source_labels: [__address__]
regex: '.*:_:.*:_:(.*):_:.*:_:.*'
target_label: pop
# passes "module" parameter to Blackbox exporter
- source_labels: [module]
target_label: __param_module
# passes "target" parameter to Blackbox exporter
- source_labels: [instance]
target_label: __param_target
# the Blackbox exporter's real hostname:port
- source_labels: [__address__]
regex: '(.*):_:.*:_:.*:_:.*:_:.*'
target_label: __address__
# ...
```
## Geohash
![](assets/grafana-map-panel.png)
To display nice maps in Grafana, you need to instruct blackbox exporters about the location. Grafana map panel speaks the "geohash" format:
- go to google map
- extract the lat/long from the url
- convert lat/long to geohash here: http://geohash.co
## Grafana
Some great dashboard have been created by the community: https://grafana.com/grafana/dashboards/?search=blackbox
Since Grafana v5.0.0, a map panel is available: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/geomap/

View file

@ -0,0 +1,123 @@
groups:
- name: FlinkPrometheusReporter
rules:
- alert: FlinkJobIsNotRunning
expr: 'flink_jobmanager_numRunningJobs == 0'
for: 1m
labels:
severity: critical
annotations:
summary: Flink job is not running (instance {{ $labels.instance }})
description: "No Flink jobs are currently running. All jobs may have failed or been cancelled.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: FlinkNoTaskmanagersRegistered
expr: 'flink_jobmanager_numRegisteredTaskManagers == 0'
for: 1m
labels:
severity: critical
annotations:
summary: Flink no TaskManagers registered (instance {{ $labels.instance }})
description: "No TaskManagers are registered with the JobManager. The cluster has no processing capacity.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This alert fires when there are no available task slots. Adjust the threshold if your cluster is expected to run at full capacity.
- alert: FlinkAllTaskSlotsUsed
expr: 'flink_jobmanager_taskSlotsAvailable == 0'
for: 5m
labels:
severity: warning
annotations:
summary: Flink all task slots used (instance {{ $labels.instance }})
description: "All Flink task slots are in use ({{ $value }} available). New jobs cannot be scheduled.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# A single restart may be normal during deployments. Adjust threshold based on restart tolerance.
- alert: FlinkJobRestartIncreasing
expr: 'delta(flink_jobmanager_job_numRestarts[5m]) > 1'
for: 5m
labels:
severity: warning
annotations:
summary: Flink job restart increasing (instance {{ $labels.instance }})
description: "Flink job {{ $labels.job_name }} has restarted {{ $value }} times in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: FlinkCheckpointFailures
expr: 'delta(flink_jobmanager_job_numberOfFailedCheckpoints[10m]) > 1'
for: 5m
labels:
severity: warning
annotations:
summary: Flink checkpoint failures (instance {{ $labels.instance }})
description: "Flink job {{ $labels.job_name }} has {{ $value }} failed checkpoints in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Value is converted from milliseconds to seconds for correct humanizeDuration display.
# Threshold is 60 seconds. Adjust based on your checkpoint interval and state size.
- alert: FlinkCheckpointDurationHigh
expr: 'flink_jobmanager_job_lastCheckpointDuration / 1000 > 60'
for: 5m
labels:
severity: warning
annotations:
summary: Flink checkpoint duration high (instance {{ $labels.instance }})
description: "Flink job {{ $labels.job_name }} last checkpoint took {{ $value | humanizeDuration }} to complete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: FlinkTaskBackpressured
expr: 'flink_taskmanager_job_task_isBackPressured == 1'
for: 5m
labels:
severity: warning
annotations:
summary: Flink task backpressured (instance {{ $labels.instance }})
description: "Flink task {{ $labels.task_name }} in job {{ $labels.job_name }} is backpressured.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Fires when a task spends more than 500ms/sec backpressured. This indicates the task cannot keep up with upstream data rate.
- alert: FlinkTaskHighBackpressureTime
expr: 'flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500'
for: 5m
labels:
severity: warning
annotations:
summary: Flink task high backpressure time (instance {{ $labels.instance }})
description: "Flink task {{ $labels.task_name }} is spending {{ $value | humanize }}ms/sec in backpressure.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Flink TaskManagers manage their own memory pool. High JVM heap usage (outside managed memory) may indicate memory leaks or misconfiguration.
- alert: FlinkTaskmanagerHeapMemoryHigh
expr: 'flink_taskmanager_Status_JVM_Memory_Heap_Used / flink_taskmanager_Status_JVM_Memory_Heap_Max > 0.9 and flink_taskmanager_Status_JVM_Memory_Heap_Max > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Flink TaskManager heap memory high (instance {{ $labels.instance }})
description: "Flink TaskManager {{ $labels.instance }} heap memory usage is above 90%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: FlinkJobmanagerHeapMemoryHigh
expr: 'flink_jobmanager_Status_JVM_Memory_Heap_Used / flink_jobmanager_Status_JVM_Memory_Heap_Max > 0.9 and flink_jobmanager_Status_JVM_Memory_Heap_Max > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Flink JobManager heap memory high (instance {{ $labels.instance }})
description: "Flink JobManager {{ $labels.instance }} heap memory usage is above 90%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Flink exposes GC time as a gauge (cumulative milliseconds), so deriv() is used instead of rate().
# Threshold: more than 100ms/sec of GC time (10% of wall clock). Adjust based on your workload.
- alert: FlinkTaskmanagerGcTimeHigh
expr: 'deriv(flink_taskmanager_Status_JVM_GarbageCollector_All_Time[5m]) > 100'
for: 5m
labels:
severity: warning
annotations:
summary: Flink TaskManager GC time high (instance {{ $labels.instance }})
description: "Flink TaskManager {{ $labels.instance }} is spending more than 10% of time in garbage collection.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Only fires for tasks that have previously received records, to avoid false positives during startup.
- alert: FlinkNoRecordsProcessed
expr: 'delta(flink_taskmanager_job_task_numRecordsIn[5m]) == 0 and flink_taskmanager_job_task_numRecordsIn > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Flink no records processed (instance {{ $labels.instance }})
description: "Flink task {{ $labels.task_name }} has not processed any records in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,89 @@
groups:
- name: SparkPrometheus
# Spark exposes metrics via two built-in endpoints:
# - PrometheusServlet: master/worker/driver metrics at /metrics/prometheus/ (ports 8080, 8081, 4040)
# - PrometheusResource: executor metrics at /metrics/executors/prometheus/ (port 4040, requires spark.ui.prometheus.enabled=true in Spark 3.x)
# Metric names from PrometheusServlet include a dynamic namespace (application ID), making static PromQL queries challenging.
# Configuration: spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
rules:
- alert: SparkNoAliveWorkers
expr: 'metrics_master_aliveWorkers_Value == 0'
for: 1m
labels:
severity: critical
annotations:
summary: Spark no alive workers (instance {{ $labels.instance }})
description: "No Spark workers are alive. The cluster has no processing capacity.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Adjust the threshold based on your cluster's typical queuing behavior.
- alert: SparkTooManyWaitingApps
expr: 'metrics_master_waitingApps_Value > 10'
for: 5m
labels:
severity: warning
annotations:
summary: Spark too many waiting apps (instance {{ $labels.instance }})
description: "Spark has {{ $value }} applications waiting for resources.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: SparkWorkerMemoryExhausted
expr: 'metrics_worker_memFree_MB_Value == 0'
for: 2m
labels:
severity: warning
annotations:
summary: Spark worker memory exhausted (instance {{ $labels.instance }})
description: "Spark worker {{ $labels.instance }} has no free memory ({{ $value }}MB free).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Fires when a worker has no free cores. This may be normal under high load but can indicate capacity issues.
- alert: SparkWorkerCoresExhausted
expr: 'metrics_worker_coresFree_Value == 0'
for: 5m
labels:
severity: warning
annotations:
summary: Spark worker cores exhausted (instance {{ $labels.instance }})
description: "Spark worker {{ $labels.instance }} has no free cores.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Fires when more than 10% of executor time is spent in garbage collection.
# This metric comes from the PrometheusResource endpoint (/metrics/executors/prometheus/).
- alert: SparkExecutorHighGcTime
expr: 'metrics_executor_totalGCTime_seconds_total / metrics_executor_totalDuration > 0.1 and metrics_executor_totalDuration > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Spark executor high GC time (instance {{ $labels.instance }})
description: "Spark executor {{ $labels.executor_id }} in {{ $labels.application_name }} is spending too much time in GC.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: SparkExecutorAllTasksFailing
expr: 'metrics_executor_failedTasks_total > 0 and metrics_executor_completedTasks_total == 0'
for: 5m
labels:
severity: critical
annotations:
summary: Spark executor all tasks failing (instance {{ $labels.instance }})
description: "Spark executor {{ $labels.executor_id }} has only failing tasks ({{ $value }} failed, 0 completed).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: SparkExecutorHighTaskFailureRate
expr: 'metrics_executor_failedTasks_total / metrics_executor_totalTasks_total > 0.1 and metrics_executor_totalTasks_total > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Spark executor high task failure rate (instance {{ $labels.instance }})
description: "Spark executor {{ $labels.executor_id }} has a task failure rate above 10%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# diskUsed is a gauge, not a counter — do not use rate(). Threshold of 1GB is a rough default.
# Disk spilling indicates insufficient memory for the workload.
- alert: SparkExecutorHighDiskSpill
expr: 'metrics_executor_diskUsed_bytes > 1e9'
for: 5m
labels:
severity: warning
annotations:
summary: Spark executor high disk spill (instance {{ $labels.instance }})
description: "Spark executor {{ $labels.executor_id }} is spilling data to disk. Consider increasing executor memory.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: LusitaniaeApacheExporter
rules:
- alert: ApacheDown
@ -14,7 +15,7 @@ groups:
description: "Apache down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ApacheWorkersLoad
expr: '(sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80'
expr: '(sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80 and sum by (instance) (apache_scoreboard) > 0'
for: 2m
labels:
severity: warning
@ -26,7 +27,7 @@ groups:
expr: 'apache_uptime_seconds_total / 60 < 1'
for: 0m
labels:
severity: warning
severity: info
annotations:
summary: Apache restart (instance {{ $labels.instance }})
description: "Apache has just been restarted.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: Apcupsd_exporter
rules:
- alert: ApcUpsBatteryNearlyEmpty
@ -32,7 +33,7 @@ groups:
description: "UPS now running on battery (since {{$value | humanizeDuration}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ApcUpsLowBatteryVoltage
expr: '(apcupsd_battery_volts / apcupsd_battery_nominal_volts) < 0.95'
expr: '(apcupsd_battery_volts / apcupsd_battery_nominal_volts) < 0.95 and apcupsd_battery_nominal_volts > 0'
for: 0m
labels:
severity: warning

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: ArgocdServiceNotSynced

View file

@ -0,0 +1,141 @@
groups:
- name: PrometheusCloudwatchExporter
# CloudWatch metrics are exported as aws_{namespace}_{metric_name}_{statistic} gauges.
# The rules below cover both exporter health and common AWS service alerts.
# Adjust thresholds and label filters to match your CloudWatch exporter configuration.
rules:
- alert: CloudwatchExporterScrapeError
expr: 'cloudwatch_exporter_scrape_error > 0'
for: 5m
labels:
severity: warning
annotations:
summary: CloudWatch exporter scrape error (instance {{ $labels.instance }})
description: "CloudWatch exporter on {{ $labels.instance }} failed to scrape metrics from AWS CloudWatch API.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CloudwatchExporterSlowScrape
expr: 'cloudwatch_exporter_scrape_duration_seconds > 300'
for: 5m
labels:
severity: warning
annotations:
summary: CloudWatch exporter slow scrape (instance {{ $labels.instance }})
description: "CloudWatch exporter on {{ $labels.instance }} scrape is taking more than 5 minutes ({{ $value }}s). Consider reducing the number of metrics or splitting across multiple exporters.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# CloudWatch API calls cost money (~$0.01 per 1000 GetMetricData requests).
# 100 requests/minute ≈ $45/month. Adjust the threshold based on your budget.
- alert: CloudwatchApiHighRequestRate
expr: 'sum by (instance, namespace) (rate(cloudwatch_requests_total[5m])) * 60 > 100'
for: 0m
labels:
severity: warning
annotations:
summary: CloudWatch API high request rate (instance {{ $labels.instance }})
description: "CloudWatch exporter on {{ $labels.instance }} is making {{ $value }} API calls per minute to namespace {{ $labels.namespace }}. This can lead to high AWS costs.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Requires EC2 CPUUtilization metric configured in the CloudWatch exporter.
- alert: AwsEc2HighCpuUtilization
expr: 'aws_ec2_cpuutilization_average > 90'
for: 15m
labels:
severity: warning
annotations:
summary: AWS EC2 high CPU utilization (instance {{ $labels.instance }})
description: "EC2 instance {{ $labels.instance_id }} CPU utilization is above 90% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Requires RDS FreeStorageSpace metric. The threshold of 2GB is a rough default.
# Adjust based on your database size.
- alert: AwsRdsLowFreeStorageSpace
expr: 'aws_rds_free_storage_space_average < 2000000000'
for: 5m
labels:
severity: warning
annotations:
summary: AWS RDS low free storage space (instance {{ $labels.instance }})
description: "RDS instance {{ $labels.dbinstance_identifier }} has less than 2GB free storage ({{ $value }} bytes remaining).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Requires RDS CPUUtilization metric configured in the CloudWatch exporter.
- alert: AwsRdsHighCpuUtilization
expr: 'aws_rds_cpuutilization_average > 90'
for: 15m
labels:
severity: warning
annotations:
summary: AWS RDS high CPU utilization (instance {{ $labels.instance }})
description: "RDS instance {{ $labels.dbinstance_identifier }} CPU utilization is above 90% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# The threshold depends on the RDS instance class. Adjust based on your
# instance type's max_connections parameter.
- alert: AwsRdsHighDatabaseConnections
expr: 'aws_rds_database_connections_average > 100'
for: 5m
labels:
severity: warning
annotations:
summary: AWS RDS high database connections (instance {{ $labels.instance }})
description: "RDS instance {{ $labels.dbinstance_identifier }} has {{ $value }} active connections.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Requires SQS ApproximateNumberOfMessagesVisible metric. The threshold of 1000
# is a rough default. Adjust based on your expected queue depth.
- alert: AwsSqsQueueMessagesVisible
expr: 'aws_sqs_approximate_number_of_messages_visible_average > 1000'
for: 10m
labels:
severity: warning
annotations:
summary: AWS SQS queue messages visible (instance {{ $labels.instance }})
description: "SQS queue {{ $labels.queue_name }} has {{ $value }} messages waiting to be processed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Requires SQS ApproximateAgeOfOldestMessage metric.
- alert: AwsSqsMessageAgeTooOld
expr: 'aws_sqs_approximate_age_of_oldest_message_maximum > 3600'
for: 0m
labels:
severity: warning
annotations:
summary: AWS SQS message age too old (instance {{ $labels.instance }})
description: "SQS queue {{ $labels.queue_name }} has messages older than 1 hour ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Requires ApplicationELB UnHealthyHostCount metric.
- alert: AwsAlbUnhealthyTargets
expr: 'aws_applicationelb_unhealthy_host_count_average > 0'
for: 5m
labels:
severity: critical
annotations:
summary: AWS ALB unhealthy targets (instance {{ $labels.instance }})
description: "ALB {{ $labels.load_balancer }} has {{ $value }} unhealthy target(s) in target group {{ $labels.target_group }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Requires ApplicationELB HTTPCode_ELB_5XX_Count and RequestCount metrics.
- alert: AwsAlbHigh5xxErrorRate
expr: '(aws_applicationelb_httpcode_elb_5_xx_count_sum / aws_applicationelb_request_count_sum) * 100 > 5 and aws_applicationelb_request_count_sum > 0'
for: 5m
labels:
severity: critical
annotations:
summary: AWS ALB high 5xx error rate (instance {{ $labels.instance }})
description: "ALB {{ $labels.load_balancer }} 5xx error rate is above 5% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Requires ApplicationELB TargetResponseTime metric.
- alert: AwsAlbHighTargetResponseTime
expr: 'aws_applicationelb_target_response_time_average > 2'
for: 5m
labels:
severity: warning
annotations:
summary: AWS ALB high target response time (instance {{ $labels.instance }})
description: "ALB {{ $labels.load_balancer }} average target response time is above 2 seconds ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Requires Lambda Errors and Invocations metrics.
- alert: AwsLambdaHighErrorRate
expr: '(aws_lambda_errors_sum / aws_lambda_invocations_sum) * 100 > 5 and aws_lambda_invocations_sum > 0'
for: 5m
labels:
severity: warning
annotations:
summary: AWS Lambda high error rate (instance {{ $labels.instance }})
description: "Lambda function {{ $labels.function_name }} error rate is above 5% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,57 @@
groups:
- name: AzureMetricsExporter
# The exporter uses azurerm_resource_metric as the default metric name for forwarded Azure Monitor metrics.
# The metric name can be customized via the name parameter in probe configuration.
# Self-monitoring metrics use the azurerm_stats_* and azurerm_api_* prefixes.
rules:
- alert: AzureExporterRequestErrors
expr: 'increase(azurerm_stats_metric_requests{result="error"}[15m]) > 5'
for: 0m
labels:
severity: warning
annotations:
summary: Azure exporter request errors (instance {{ $labels.instance }})
description: "Azure metrics exporter on {{ $labels.instance }} has {{ $value }} API request errors in the last 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: AzureExporterHighErrorRate
expr: 'sum by (instance) (rate(azurerm_stats_metric_requests{result="error"}[5m])) / sum by (instance) (rate(azurerm_stats_metric_requests[5m])) * 100 > 10 and sum by (instance) (rate(azurerm_stats_metric_requests[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Azure exporter high error rate (instance {{ $labels.instance }})
description: "Azure metrics exporter on {{ $labels.instance }} has an error rate above 10% ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Azure Resource Manager enforces rate limits per subscription.
# The threshold of 100 remaining calls is a rough default. Adjust based on your
# scrape interval and number of monitored resources.
- alert: AzureApiReadRateLimitApproaching
expr: 'azurerm_api_ratelimit{type="read"} < 100'
for: 0m
labels:
severity: warning
annotations:
summary: Azure API read rate limit approaching (instance {{ $labels.instance }})
description: "Azure API read rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: AzureApiWriteRateLimitApproaching
expr: 'azurerm_api_ratelimit{type="write"} < 50'
for: 0m
labels:
severity: warning
annotations:
summary: Azure API write rate limit approaching (instance {{ $labels.instance }})
description: "Azure API write rate limit for subscription {{ $labels.subscriptionID }} is running low ({{ $value }} remaining).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: AzureExporterSlowCollection
expr: 'azurerm_stats_metric_collecttime > 300'
for: 5m
labels:
severity: warning
annotations:
summary: Azure exporter slow collection (instance {{ $labels.instance }})
description: "Azure metrics exporter on {{ $labels.instance }} metric collection is taking more than 5 minutes ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,11 +2,12 @@ groups:
- name: BlackboxExporter
rules:
- alert: BlackboxProbeFailed
expr: 'probe_success == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -23,7 +24,7 @@ groups:
description: "Blackbox configuration reload failure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxSlowProbe
expr: 'avg_over_time(probe_duration_seconds[1m]) > 1'
expr: 'probe_duration_seconds > 1'
for: 1m
labels:
severity: warning
@ -33,7 +34,7 @@ groups:
- alert: BlackboxProbeHttpFailure
expr: 'probe_http_status_code <= 199 OR probe_http_status_code >= 400'
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -49,15 +50,19 @@ groups:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: "SSL certificate expires in less than 20 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxSslCertificateWillExpireSoon
- alert: BlackboxSslCertificateWillExpireVerySoon
expr: '0 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 3'
for: 0m
labels:
severity: critical
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
summary: Blackbox SSL certificate will expire very soon (instance {{ $labels.instance }})
description: "SSL certificate expires in less than 3 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# For probe_ssl_earliest_cert_expiry to be exposed after expiration, you
# need to enable insecure_skip_verify. Note that this will disable
# certificate validation.
# See https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#tls_config
- alert: BlackboxSslCertificateExpired
expr: 'round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 0'
for: 0m
@ -68,7 +73,7 @@ groups:
description: "SSL certificate has expired already\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxProbeSlowHttp
expr: 'avg_over_time(probe_http_duration_seconds[1m]) > 1'
expr: 'probe_http_duration_seconds > 1'
for: 1m
labels:
severity: warning
@ -77,7 +82,7 @@ groups:
description: "HTTP request took more than 1s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: BlackboxProbeSlowPing
expr: 'avg_over_time(probe_icmp_duration_seconds[1m]) > 1'
expr: 'probe_icmp_duration_seconds > 1'
for: 1m
labels:
severity: warning

33
dist/rules/caddy/embedded-exporter.yml vendored Normal file
View file

@ -0,0 +1,33 @@
groups:
- name: EmbeddedExporter
rules:
- alert: CaddyReverseProxyDown
expr: 'caddy_reverse_proxy_upstreams_healthy == 0'
for: 0m
labels:
severity: critical
annotations:
summary: Caddy Reverse Proxy Down (instance {{ $labels.instance }})
description: "Caddy reverse proxy upstream {{ $labels.upstream }} is unhealthy\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CaddyHighHttp4xxErrorRateService
expr: 'sum(rate(caddy_http_request_duration_seconds_count{code=~"4.."}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5 and sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: Caddy high HTTP 4xx error rate service (instance {{ $labels.instance }})
description: "Caddy service 4xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CaddyHighHttp5xxErrorRateService
expr: 'sum(rate(caddy_http_request_duration_seconds_count{code=~"5.."}[3m])) by (instance) / sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) * 100 > 5 and sum(rate(caddy_http_request_duration_seconds_count[3m])) by (instance) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: Caddy high HTTP 5xx error rate service (instance {{ $labels.instance }})
description: "Caddy service 5xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: CriteoCassandraExporter
rules:
- alert: CassandraHintsCount
@ -14,7 +15,7 @@ groups:
description: "Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraCompactionTaskPending
expr: 'avg_over_time(cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"}[1m]) > 100'
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"} > 100'
for: 2m
labels:
severity: warning
@ -23,7 +24,7 @@ groups:
description: "Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraViewwriteLatency
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile",service="cas"} > 100000'
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile"} > 100000'
for: 2m
labels:
severity: warning
@ -31,49 +32,50 @@ groups:
summary: Cassandra viewwrite latency (instance {{ $labels.instance }})
description: "High viewwrite latency on {{ $labels.instance }} cassandra node\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraBadHacker
expr: 'rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5'
- alert: CassandraAuthenticationFailures
expr: 'delta(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5'
for: 2m
labels:
severity: warning
annotations:
summary: Cassandra bad hacker (instance {{ $labels.instance }})
summary: Cassandra authentication failures (instance {{ $labels.instance }})
description: "Increase of Cassandra authentication failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# 1m delay allows a restart without triggering an alert.
- alert: CassandraNodeDown
expr: 'sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
summary: Cassandra node down (instance {{ $labels.instance }})
description: "Cassandra node down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraCommitlogPendingTasks
- alert: CassandraCommitlogPendingTasks(criteo)
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15'
for: 2m
labels:
severity: warning
annotations:
summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
summary: Cassandra commitlog pending tasks (Criteo) (instance {{ $labels.instance }})
description: "Unexpected number of Cassandra commitlog pending tasks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraCompactionExecutorBlockedTasks
- alert: CassandraCompactionExecutorBlockedTasks(criteo)
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0'
for: 2m
labels:
severity: warning
annotations:
summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
summary: Cassandra compaction executor blocked tasks (Criteo) (instance {{ $labels.instance }})
description: "Some Cassandra compaction executor tasks are blocked\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraFlushWriterBlockedTasks
- alert: CassandraFlushWriterBlockedTasks(criteo)
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0'
for: 2m
labels:
severity: warning
annotations:
summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
summary: Cassandra flush writer blocked tasks (Criteo) (instance {{ $labels.instance }})
description: "Some Cassandra flush writer tasks are blocked\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraRepairPendingTasks
@ -94,74 +96,75 @@ groups:
summary: Cassandra repair blocked tasks (instance {{ $labels.instance }})
description: "Some Cassandra repair tasks are blocked\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraConnectionTimeoutsTotal
expr: 'rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5'
- alert: CassandraConnectionTimeoutsTotal(criteo)
expr: 'delta(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5'
for: 2m
labels:
severity: critical
annotations:
summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
summary: Cassandra connection timeouts total (Criteo) (instance {{ $labels.instance }})
description: "Some connection between nodes are ending in timeout\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraStorageExceptions
- alert: CassandraStorageExceptions(criteo)
expr: 'changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1'
for: 0m
labels:
severity: critical
annotations:
summary: Cassandra storage exceptions (instance {{ $labels.instance }})
summary: Cassandra storage exceptions (Criteo) (instance {{ $labels.instance }})
description: "Something is going wrong with cassandra storage\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraTombstoneDump
- alert: CassandraTombstoneDump(criteo)
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000'
for: 0m
labels:
severity: critical
annotations:
summary: Cassandra tombstone dump (instance {{ $labels.instance }})
summary: Cassandra tombstone dump (Criteo) (instance {{ $labels.instance }})
description: "Too much tombstones scanned in queries\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraClientRequestUnavailableWrite
- alert: CassandraClientRequestUnavailableWrite(criteo)
expr: 'changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Cassandra client request unavailable write (instance {{ $labels.instance }})
summary: Cassandra client request unavailable write (Criteo) (instance {{ $labels.instance }})
description: "Write failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraClientRequestUnavailableRead
- alert: CassandraClientRequestUnavailableRead(criteo)
expr: 'changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Cassandra client request unavailable read (instance {{ $labels.instance }})
summary: Cassandra client request unavailable read (Criteo) (instance {{ $labels.instance }})
description: "Read failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraClientRequestWriteFailure
expr: 'increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0'
- alert: CassandraClientRequestWriteFailure(criteo)
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"} > 0.05'
for: 0m
labels:
severity: critical
annotations:
summary: Cassandra client request write failure (instance {{ $labels.instance }})
summary: Cassandra client request write failure (Criteo) (instance {{ $labels.instance }})
description: "A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraClientRequestReadFailure
expr: 'increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 0'
- alert: CassandraClientRequestReadFailure(criteo)
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"} > 0.05'
for: 0m
labels:
severity: critical
annotations:
summary: Cassandra client request read failure (instance {{ $labels.instance }})
summary: Cassandra client request read failure (Criteo) (instance {{ $labels.instance }})
description: "A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# A low key cache hit rate increases disk I/O. Threshold is workload-dependent — adjust based on your data access patterns.
- alert: CassandraCacheHitRateKeyCache
expr: 'cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85'
for: 2m
labels:
severity: critical
severity: warning
annotations:
summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }})
description: "Key cache hit rate is below 85%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,11 +2,13 @@ groups:
- name: InstaclustrCassandraExporter
rules:
# 1m delay allows a restart without triggering an alert.
- alert: CassandraNodeIsUnavailable
expr: 'sum(cassandra_endpoint_active) by (cassandra_cluster,instance,exported_endpoint) < 1'
for: 0m
expr: 'cassandra_endpoint_active < 1'
for: 1m
labels:
severity: critical
annotations:
@ -22,92 +24,92 @@ groups:
summary: Cassandra many compaction tasks are pending (instance {{ $labels.instance }})
description: "Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraCommitlogPendingTasks
- alert: CassandraCommitlogPendingTasks(instaclustr)
expr: 'cassandra_commit_log_pending_tasks > 15'
for: 2m
labels:
severity: warning
annotations:
summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
summary: Cassandra commitlog pending tasks (Instaclustr) (instance {{ $labels.instance }})
description: "Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraCompactionExecutorBlockedTasks
- alert: CassandraCompactionExecutorBlockedTasks(instaclustr)
expr: 'cassandra_thread_pool_blocked_tasks{pool="CompactionExecutor"} > 15'
for: 2m
labels:
severity: warning
annotations:
summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
summary: Cassandra compaction executor blocked tasks (Instaclustr) (instance {{ $labels.instance }})
description: "Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraFlushWriterBlockedTasks
- alert: CassandraFlushWriterBlockedTasks(instaclustr)
expr: 'cassandra_thread_pool_blocked_tasks{pool="MemtableFlushWriter"} > 15'
for: 2m
labels:
severity: warning
annotations:
summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
summary: Cassandra flush writer blocked tasks (Instaclustr) (instance {{ $labels.instance }})
description: "Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraConnectionTimeoutsTotal
expr: 'avg(cassandra_client_request_timeouts_total) by (cassandra_cluster,instance) > 5'
- alert: CassandraConnectionTimeoutsTotal(instaclustr)
expr: 'sum by (cassandra_cluster,instance) (rate(cassandra_client_request_timeouts_total[5m])) > 5'
for: 2m
labels:
severity: critical
annotations:
summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
summary: Cassandra connection timeouts total (Instaclustr) (instance {{ $labels.instance }})
description: "Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraStorageExceptions
- alert: CassandraStorageExceptions(instaclustr)
expr: 'changes(cassandra_storage_exceptions_total[1m]) > 1'
for: 0m
labels:
severity: critical
annotations:
summary: Cassandra storage exceptions (instance {{ $labels.instance }})
summary: Cassandra storage exceptions (Instaclustr) (instance {{ $labels.instance }})
description: "Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraTombstoneDump
- alert: CassandraTombstoneDump(instaclustr)
expr: 'avg(cassandra_table_tombstones_scanned{quantile="0.99"}) by (instance,cassandra_cluster,keyspace) > 100'
for: 2m
labels:
severity: critical
annotations:
summary: Cassandra tombstone dump (instance {{ $labels.instance }})
summary: Cassandra tombstone dump (Instaclustr) (instance {{ $labels.instance }})
description: "Cassandra tombstone dump - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraClientRequestUnavailableWrite
- alert: CassandraClientRequestUnavailableWrite(instaclustr)
expr: 'changes(cassandra_client_request_unavailable_exceptions_total{operation="write"}[1m]) > 0'
for: 2m
labels:
severity: critical
annotations:
summary: Cassandra client request unavailable write (instance {{ $labels.instance }})
summary: Cassandra client request unavailable write (Instaclustr) (instance {{ $labels.instance }})
description: "Some Cassandra client requests are unavailable to write - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraClientRequestUnavailableRead
- alert: CassandraClientRequestUnavailableRead(instaclustr)
expr: 'changes(cassandra_client_request_unavailable_exceptions_total{operation="read"}[1m]) > 0'
for: 2m
labels:
severity: critical
annotations:
summary: Cassandra client request unavailable read (instance {{ $labels.instance }})
summary: Cassandra client request unavailable read (Instaclustr) (instance {{ $labels.instance }})
description: "Some Cassandra client requests are unavailable to read - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraClientRequestWriteFailure
expr: 'increase(cassandra_client_request_failures_total{operation="write"}[1m]) > 0'
- alert: CassandraClientRequestWriteFailure(instaclustr)
expr: 'increase(cassandra_client_request_failures_total{operation="write"}[1m]) > 5'
for: 2m
labels:
severity: critical
annotations:
summary: Cassandra client request write failure (instance {{ $labels.instance }})
description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Cassandra client request write failure (Instaclustr) (instance {{ $labels.instance }})
description: "Write failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CassandraClientRequestReadFailure
expr: 'increase(cassandra_client_request_failures_total{operation="read"}[1m]) > 0'
- alert: CassandraClientRequestReadFailure(instaclustr)
expr: 'increase(cassandra_client_request_failures_total{operation="read"}[1m]) > 5'
for: 2m
labels:
severity: critical
annotations:
summary: Cassandra client request read failure (instance {{ $labels.instance }})
summary: Cassandra client request read failure (Instaclustr) (instance {{ $labels.instance }})
description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,11 +2,14 @@ groups:
- name: EmbeddedExporter
rules:
# ceph_health_status: 0=HEALTH_OK, 1=HEALTH_WARN, 2=HEALTH_ERR.
# This rule fires on any non-OK state. Split into ==1 (warning) and ==2 (critical) if you want separate severity levels.
- alert: CephState
expr: 'ceph_health_status != 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -33,15 +36,16 @@ groups:
- alert: CephOsdDown
expr: 'ceph_osd_up == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
summary: Ceph OSD Down (instance {{ $labels.instance }})
description: "Ceph Object Storage Daemon Down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 5000ms (5 seconds). Adjust based on your expected OSD performance.
- alert: CephHighOsdLatency
expr: 'ceph_osd_perf_apply_latency_seconds > 5'
expr: 'ceph_osd_apply_latency_ms > 5000'
for: 1m
labels:
severity: warning
@ -49,14 +53,16 @@ groups:
summary: Ceph high OSD latency (instance {{ $labels.instance }})
description: "Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CephOsdLowSpace
expr: 'ceph_osd_utilization > 90'
for: 2m
# Ceph internally triggers OSD_NEARFULL based on the nearfull_ratio (default 85%).
# ceph_health_detail exposes named health checks as individual time series.
- alert: CephOsdNearFull
expr: 'ceph_health_detail{name="OSD_NEARFULL"} == 1'
for: 5m
labels:
severity: warning
annotations:
summary: Ceph OSD low space (instance {{ $labels.instance }})
description: "Ceph Object Storage Daemon is going out of space. Please add more disks.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Ceph OSD near full (instance {{ $labels.instance }})
description: "A Ceph OSD is dangerously full. Please add more disks.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CephOsdReweighted
expr: 'ceph_osd_weight < 1'
@ -114,7 +120,7 @@ groups:
- alert: CephPgUnavailable
expr: 'ceph_pg_total - ceph_pg_active > 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:

View file

@ -0,0 +1,45 @@
groups:
- name: EmbeddedExporter
rules:
- alert: Cert-managerAbsent
expr: 'absent(up{job="cert-manager"})'
for: 10m
labels:
severity: critical
annotations:
summary: Cert-Manager absent (instance {{ $labels.instance }})
description: "Cert-Manager has disappeared from Prometheus service discovery. New certificates will not be able to be minted, and existing ones can't be renewed until cert-manager is back.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 21 days is a rough default. ACME certificates are typically renewed 30 days before expiry, so expiring within 21 days may indicate issuer misconfiguration.
- alert: Cert-managerCertificateExpiringSoon
expr: 'avg by (exported_namespace, namespace, name) (certmanager_certificate_expiration_timestamp_seconds - time()) < (21 * 24 * 3600)'
for: 1h
labels:
severity: warning
annotations:
summary: Cert-Manager certificate expiring soon (instance {{ $labels.instance }})
description: "The certificate {{ $labels.name }} is expiring in less than 21 days.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: Cert-managerCertificateNotReady
expr: 'max by (name, exported_namespace, namespace, condition) (certmanager_certificate_ready_status{condition!="True"} == 1)'
for: 10m
labels:
severity: critical
annotations:
summary: Cert-Manager certificate not ready (instance {{ $labels.instance }})
description: "The certificate {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is not ready to serve traffic.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Metric renamed in cert-manager v1.19+ (dropped the http_ prefix): certmanager_acme_client_request_count.
# For cert-manager < v1.19, use: certmanager_http_acme_client_request_count.
- alert: Cert-managerHittingAcmeRateLimits
expr: 'sum by (host) (rate(certmanager_acme_client_request_count{status="429"}[5m])) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Cert-Manager hitting ACME rate limits (instance {{ $labels.instance }})
description: "Cert-Manager is being rate-limited by the ACME provider. Certificate issuance and renewal may be blocked for up to a week.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

294
dist/rules/cilium/embedded-exporter.yml vendored Normal file
View file

@ -0,0 +1,294 @@
groups:
- name: EmbeddedExporter
rules:
# Metric name depends on Cilium version. Use cilium_unreachable_nodes (older) or cilium_node_connectivity_status (1.14+).
- alert: CiliumAgentUnreachableNodes
expr: 'sum(cilium_unreachable_nodes{}) by (pod) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Cilium agent unreachable nodes (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} cannot reach {{ $value }} node(s). Check network connectivity and node health.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Metric name depends on Cilium version. Use cilium_unreachable_health_endpoints (older) or cilium_node_connectivity_status (1.14+).
- alert: CiliumAgentUnreachableHealthEndpoints
expr: 'sum(cilium_unreachable_health_endpoints{}) by (pod) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Cilium agent unreachable health endpoints (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} cannot reach {{ $value }} health endpoint(s). Node-to-node health probes are failing.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Metric name depends on Cilium version. Use cilium_controllers_failing (older) or cilium_controllers_runs_total (1.14+).
- alert: CiliumAgentFailingControllers
expr: 'sum(cilium_controllers_failing{}) by (pod) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent failing controllers (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} has {{ $value }} failing controller(s). Check cilium-agent logs for details.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentEndpointFailures
expr: 'sum(cilium_endpoint_state{endpoint_state="invalid"}) by (pod) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent endpoint failures (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} has {{ $value }} endpoint(s) in invalid state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentEndpointRegenerationFailures
expr: 'sum(rate(cilium_endpoint_regenerations_total{outcome="fail"}[5m])) by (pod) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent endpoint regeneration failures (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is failing to regenerate endpoints. Network policy enforcement may be stale.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentEndpointUpdateFailure
expr: 'sum(rate(cilium_k8s_client_api_calls_total{method=~"(PUT|POST|PATCH)", endpoint="endpoint", return_code!~"2[0-9][0-9]"}[5m])) by (pod, method, return_code) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent endpoint update failure (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is failing K8s endpoint update API calls ({{ $labels.method }} {{ $labels.return_code }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentEndpointCreateFailure
expr: 'sum(rate(cilium_api_limiter_processed_requests_total{api_call=~"endpoint-create", outcome="fail"}[1m])) by (pod, api_call) > 0.05'
for: 5m
labels:
severity: info
annotations:
summary: Cilium agent endpoint create failure (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is failing CNI endpoint-create calls. New pods may fail to get networking.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentMapOperationFailures
expr: 'sum(rate(cilium_bpf_map_ops_total{outcome="fail"}[5m])) by (map_name, pod) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent map operation failures (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} has eBPF map operation failures on {{ $labels.map_name }}. Datapath may be degraded.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Map pressure is a ratio from 0 to 1. At 1.0, the map is full and new entries will be dropped.
- alert: CiliumAgentBpfMapPressure
expr: 'cilium_bpf_map_pressure{} > 0.9'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent BPF map pressure (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} eBPF map {{ $labels.map_name }} is above 90% utilization. Map may become full.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentConntrackTableFull
expr: 'sum(rate(cilium_drop_count_total{reason="CT: Map insertion failed"}[5m])) by (pod) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Cilium agent conntrack table full (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} conntrack table is full, causing packet drops. Increase CT map size or investigate connection leaks.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentConntrackFailedGarbageCollection
expr: 'sum(rate(cilium_datapath_conntrack_gc_runs_total{status="uncompleted"}[5m])) by (pod) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent conntrack failed garbage collection (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} conntrack garbage collection is failing. Stale entries may accumulate.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentNatTableFull
expr: 'sum(rate(cilium_drop_count_total{reason="No mapping for NAT masquerade"}[1m])) by (pod) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Cilium agent NAT table full (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} NAT table is full, causing masquerade failures. Increase NAT map size or investigate.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Policy denials may be expected behavior. Investigate only if unexpected traffic is being blocked.
- alert: CiliumAgentHighDeniedRate
expr: 'sum(rate(cilium_drop_count_total{reason="Policy denied"}[1m])) by (pod) > 0'
for: 10m
labels:
severity: info
annotations:
summary: Cilium agent high denied rate (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is dropping packets due to policy denial. Verify network policies are correct.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentHighDropRate
expr: 'sum(rate(cilium_drop_count_total{reason!~"Policy denied"}[5m])) by (pod, reason) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent high drop rate (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is dropping packets for reason {{ $labels.reason }}. This indicates infrastructure issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentPolicyMapPressure
expr: 'sum(cilium_bpf_map_pressure{map_name=~"cilium_policy_.*"}) by (pod) > 0.9'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent policy map pressure (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} policy BPF map is above 90% utilization. New policies may fail to apply.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentPolicyImportErrors
expr: 'sum(rate(cilium_policy_change_total{outcome="fail"}[5m])) by (pod) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent policy import errors (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is failing to import network policies. Policy enforcement may be incomplete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 60s is a rough default. Adjust based on cluster size and policy complexity.
- alert: CiliumAgentPolicyImplementationDelay
expr: 'histogram_quantile(0.99, sum(rate(cilium_policy_implementation_delay_bucket[5m])) by (le, pod)) > 60'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent policy implementation delay (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} P99 policy deployment latency exceeds 60 seconds. Endpoints may run with stale policies.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumNode-localHighIdentityAllocation
expr: '(sum(cilium_identity{type="node_local"}) by (pod) / (2^16-1)) > 0.8'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium node-local high identity allocation (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} node-local identity allocation is above 80%. Approaching the 65535 identity limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumClusterHighIdentityAllocation
expr: '(sum(cilium_identity{type="cluster_local"}) by () / (2^16-256)) > 0.8'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium cluster high identity allocation (instance {{ $labels.instance }})
description: "Cilium cluster-wide identity allocation is above 80%. Approaching the maximum identity limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumOperatorExhaustedIpamIps
expr: 'sum(cilium_operator_ipam_ips{type="available"}) by () <= 0'
for: 5m
labels:
severity: critical
annotations:
summary: Cilium operator exhausted IPAM IPs (instance {{ $labels.instance }})
description: "Cilium operator has no available IPAM IPs. New pods will fail to schedule networking.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 90% is a rough default. Adjust based on your pod churn rate and IP pool size.
- alert: CiliumOperatorLowAvailableIpamIps
expr: 'sum(cilium_operator_ipam_ips{type!="available"}) by () / sum(cilium_operator_ipam_ips) by () > 0.9 and sum(cilium_operator_ipam_ips) by () > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium operator low available IPAM IPs (instance {{ $labels.instance }})
description: "Cilium operator IPAM IP pool is over 90% utilized. Allocate more IPs to avoid exhaustion.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Some Cilium versions may not have a status label on this metric. Verify against your Cilium version.
- alert: CiliumOperatorIpamInterfaceCreationFailures
expr: 'sum(rate(cilium_operator_ipam_interface_creation_ops{status!="success"}[5m])) by () > 0.05'
for: 10m
labels:
severity: warning
annotations:
summary: Cilium operator IPAM interface creation failures (instance {{ $labels.instance }})
description: "Cilium operator is failing to create IPAM network interfaces. IP allocation may be impacted.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentApiErrors
expr: 'sum(rate(cilium_agent_api_process_time_seconds_count{return_code=~"5[0-9][0-9]"}[5m])) by (pod, return_code) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium agent API errors (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} API is returning 5xx errors ({{ $labels.return_code }}). Agent may be unhealthy.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumAgentKubernetesClientErrors
expr: 'sum(rate(cilium_k8s_client_api_calls_total{endpoint!="metrics", return_code!~"2[0-9][0-9]"}[5m])) by (pod, endpoint, return_code) > 0.05'
for: 5m
labels:
severity: info
annotations:
summary: Cilium agent Kubernetes client errors (instance {{ $labels.instance }})
description: "Cilium agent {{ $labels.pod }} is receiving errors from K8s API for endpoint {{ $labels.endpoint }} ({{ $labels.return_code }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumClustermeshRemoteClusterNotReady
expr: 'count(cilium_clustermesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Cilium ClusterMesh remote cluster not ready (instance {{ $labels.instance }})
description: "Cilium ClusterMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumClustermeshRemoteClusterFailing
expr: 'sum(cilium_clustermesh_remote_cluster_failures) by (source_cluster, target_cluster) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Cilium ClusterMesh remote cluster failing (instance {{ $labels.instance }})
description: "Cilium ClusterMesh connectivity to remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is failing ({{ $value }} failures).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumKvstoremeshRemoteClusterNotReady
expr: 'count(cilium_kvstoremesh_remote_cluster_readiness_status < 1) by (source_cluster, target_cluster) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Cilium KVStoreMesh remote cluster not ready (instance {{ $labels.instance }})
description: "Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} is not ready from {{ $labels.source_cluster }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumKvstoremeshRemoteClusterFailing
expr: 'sum(cilium_kvstoremesh_remote_cluster_failures) by (source_cluster, target_cluster) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Cilium KVStoreMesh remote cluster failing (instance {{ $labels.instance }})
description: "Cilium KVStoreMesh remote cluster {{ $labels.target_cluster }} from {{ $labels.source_cluster }} is experiencing failures ({{ $value }} failures).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumKvstoremeshSyncErrors
expr: 'sum(rate(cilium_kvstoremesh_kvstore_sync_errors_total[5m])) by (source_cluster) > 0.05'
for: 5m
labels:
severity: critical
annotations:
summary: Cilium KVStoreMesh sync errors (instance {{ $labels.instance }})
description: "Cilium KVStoreMesh from {{ $labels.source_cluster }} is experiencing kvstore sync errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CiliumHubbleLostEvents
expr: 'sum(rate(hubble_lost_events_total[5m])) by (pod) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium Hubble lost events (instance {{ $labels.instance }})
description: "Cilium Hubble on {{ $labels.pod }} is losing flow events. Observability data may be incomplete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 10% is a rough default. Some DNS errors may be normal depending on your workload.
- alert: CiliumHubbleHighDnsErrorRate
expr: 'sum(rate(hubble_dns_responses_total{rcode!="No Error"}[5m])) by (pod) / sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0.1 and sum(rate(hubble_dns_responses_total[5m])) by (pod) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Cilium Hubble high DNS error rate (instance {{ $labels.instance }})
description: "Cilium Hubble on {{ $labels.pod }} is observing more than 10% DNS error responses.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,10 +2,21 @@ groups:
- name: EmbeddedExporter
rules:
# Adjust the job label to match your Prometheus configuration.
- alert: ClickhouseNodeDown
expr: 'up{job="clickhouse"} == 0'
for: 2m
labels:
severity: critical
annotations:
summary: ClickHouse node down (instance {{ $labels.instance }})
description: "No metrics received from ClickHouse exporter for over 2 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseMemoryUsageCritical
expr: 'ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 90'
expr: 'ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 90 and ClickHouseAsyncMetrics_CGroupMemoryTotal > 0'
for: 5m
labels:
severity: critical
@ -14,7 +25,7 @@ groups:
description: "Memory usage is critically high, over 90%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseMemoryUsageWarning
expr: 'ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 80'
expr: 'ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 80 and ClickHouseAsyncMetrics_CGroupMemoryTotal > 0'
for: 5m
labels:
severity: warning
@ -23,7 +34,7 @@ groups:
description: "Memory usage is over 80%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseDiskSpaceLowOnDefault
expr: 'ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 20'
expr: 'ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 20 and (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) > 0'
for: 2m
labels:
severity: warning
@ -32,7 +43,7 @@ groups:
description: "Disk space on default is below 20%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseDiskSpaceCriticalOnDefault
expr: 'ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 10'
expr: 'ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 10 and (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) > 0'
for: 2m
labels:
severity: critical
@ -41,7 +52,7 @@ groups:
description: "Disk space on default disk is critically low, below 10%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseDiskSpaceLowOnBackups
expr: 'ClickHouseAsyncMetrics_DiskAvailable_backups / (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) * 100 < 20'
expr: 'ClickHouseAsyncMetrics_DiskAvailable_backups / (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) * 100 < 20 and (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) > 0'
for: 2m
labels:
severity: warning
@ -76,15 +87,7 @@ groups:
summary: ClickHouse No Live Replicas (instance {{ $labels.instance }})
description: "There are too few live replicas available, risking data loss and service disruption.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseHighNetworkTraffic
expr: 'ClickHouseMetrics_NetworkSend > 250 or ClickHouseMetrics_NetworkReceive > 250'
for: 5m
labels:
severity: warning
annotations:
summary: ClickHouse High Network Traffic (instance {{ $labels.instance }})
description: "Network traffic is unusually high, may affect cluster performance.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Please replace the threshold with an appropriate value
- alert: ClickhouseHighTcpConnections
expr: 'ClickHouseMetrics_TCPConnection > 400'
for: 5m
@ -94,17 +97,18 @@ groups:
summary: ClickHouse High TCP Connections (instance {{ $labels.instance }})
description: "High number of TCP connections, indicating heavy client or inter-cluster communication.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Adjust the threshold based on your cluster size and expected replication traffic.
- alert: ClickhouseInterserverConnectionIssues
expr: 'increase(ClickHouseMetrics_InterserverConnection[5m]) > 0'
for: 1m
expr: 'ClickHouseMetrics_InterserverConnection > 50'
for: 5m
labels:
severity: warning
annotations:
summary: ClickHouse Interserver Connection Issues (instance {{ $labels.instance }})
description: "An increase in interserver connections may indicate replication or distributed query handling issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "High number of interserver connections may indicate replication or distributed query handling issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseZookeeperConnectionIssues
expr: 'avg(ClickHouseMetrics_ZooKeeperSession) != 1'
expr: 'ClickHouseMetrics_ZooKeeperSession != 1'
for: 3m
labels:
severity: warning
@ -113,7 +117,7 @@ groups:
description: "ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseAuthenticationFailures
expr: 'increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 0'
expr: 'increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 3'
for: 0m
labels:
severity: info
@ -122,10 +126,56 @@ groups:
description: "Authentication failures detected, indicating potential security issues or misconfiguration.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseAccessDeniedErrors
expr: 'increase(ClickHouseErrorMetric_RESOURCE_ACCESS_DENIED[5m]) > 0'
expr: 'increase(ClickHouseErrorMetric_RESOURCE_ACCESS_DENIED[5m]) > 3'
for: 0m
labels:
severity: info
annotations:
summary: ClickHouse Access Denied Errors (instance {{ $labels.instance }})
description: "Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseRejectedInsertQueries
expr: 'increase(ClickHouseProfileEvents_RejectedInserts[1m]) > 2'
for: 1m
labels:
severity: warning
annotations:
summary: ClickHouse rejected insert queries (instance {{ $labels.instance }})
description: "INSERTs rejected due to too many active data parts. Reduce insert frequency.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseDelayedInsertQueries
expr: 'increase(ClickHouseProfileEvents_DelayedInserts[5m]) > 10'
for: 2m
labels:
severity: warning
annotations:
summary: ClickHouse delayed insert queries (instance {{ $labels.instance }})
description: "INSERTs delayed due to high number of active parts.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseZookeeperHardwareException
expr: 'increase(ClickHouseProfileEvents_ZooKeeperHardwareExceptions[1m]) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: ClickHouse zookeeper hardware exception (instance {{ $labels.instance }})
description: "Zookeeper hardware exception: network issues communicating with ZooKeeper\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Please replace the threshold with an appropriate value
- alert: ClickhouseHighNetworkUsage
expr: 'rate(ClickHouseProfileEvents_NetworkSendBytes[1m]) > 100*1024*1024 or rate(ClickHouseProfileEvents_NetworkReceiveBytes[1m]) > 100*1024*1024'
for: 2m
labels:
severity: warning
annotations:
summary: ClickHouse high network usage (instance {{ $labels.instance }})
description: "High network usage. ClickHouse network usage exceeds 100MB/s.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ClickhouseDistributedRejectedInserts
expr: 'increase(ClickHouseProfileEvents_DistributedRejectedInserts[5m]) > 3'
for: 2m
labels:
severity: critical
annotations:
summary: ClickHouse distributed rejected inserts (instance {{ $labels.instance }})
description: "INSERTs into Distributed tables rejected due to pending bytes limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,10 +2,11 @@ groups:
- name: LablabsCloudflareExporter
rules:
- alert: CloudflareHttp4xxErrorRate
expr: '(sum by(zone) (rate(cloudflare_zone_requests_status{status=~"^4.."}[15m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[15m]))) * 100 > 5'
expr: '(sum by(zone) (rate(cloudflare_zone_requests_status{status=~"^4.."}[15m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[15m]))) * 100 > 5 and sum by (zone) (rate(cloudflare_zone_requests_status[15m])) > 0'
for: 0m
labels:
severity: warning
@ -14,7 +15,7 @@ groups:
description: "Cloudflare high HTTP 4xx error rate (> 5% for domain {{ $labels.zone }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CloudflareHttp5xxErrorRate
expr: '(sum by (zone) (rate(cloudflare_zone_requests_status{status=~"^5.."}[5m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[5m]))) * 100 > 5'
expr: '(sum by (zone) (rate(cloudflare_zone_requests_status{status=~"^5.."}[5m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[5m]))) * 100 > 5 and sum by (zone) (rate(cloudflare_zone_requests_status[5m])) > 0'
for: 0m
labels:
severity: critical

View file

@ -2,6 +2,7 @@ groups:
- name: ConsulExporter
rules:
- alert: ConsulServiceHealthcheckFailed

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: CorednsPanicCount

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: CortexRulerConfigurationReloadFailure
@ -22,23 +23,25 @@ groups:
summary: Cortex not connected to Alertmanager (instance {{ $labels.instance }})
description: "Cortex not connected to Alertmanager (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CortexNotificationAreBeingDropped
expr: 'rate(cortex_prometheus_notifications_dropped_total[5m]) > 0'
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: CortexNotificationsAreBeingDropped
expr: 'rate(cortex_prometheus_notifications_dropped_total[5m]) > 0.05'
for: 0m
labels:
severity: critical
annotations:
summary: Cortex notification are being dropped (instance {{ $labels.instance }})
description: "Cortex notification are being dropped due to errors (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Cortex notifications are being dropped (instance {{ $labels.instance }})
description: "Cortex notifications are being dropped due to errors (instance {{ $labels.instance }}, {{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CortexNotificationError
expr: 'rate(cortex_prometheus_notifications_errors_total[5m]) > 0'
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: CortexNotificationErrors
expr: 'rate(cortex_prometheus_notifications_errors_total[5m]) > 0.05'
for: 0m
labels:
severity: critical
annotations:
summary: Cortex notification error (instance {{ $labels.instance }})
description: "Cortex is failing when sending alert notifications (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Cortex notification errors (instance {{ $labels.instance }})
description: "Cortex is failing when sending alert notifications (instance {{ $labels.instance }}, {{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CortexIngesterUnhealthy
expr: 'cortex_ring_members{state="Unhealthy", name="ingester"} > 0'

View file

@ -0,0 +1,170 @@
groups:
- name: GesellixCouchdbPrometheusExporter
rules:
- alert: CouchdbNodeDown
expr: 'couchdb_httpd_node_up == 0 or couchdb_httpd_up == 0'
for: 2m
labels:
severity: critical
annotations:
summary: CouchDB node down (instance {{ $labels.instance }})
description: "CouchDB node is not responding (node_up metric is 0) for more than 2 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbAtomMemoryUsageCritical
expr: 'couchdb_erlang_memory_atom_used > 0.9 * couchdb_erlang_memory_atom'
for: 5m
labels:
severity: critical
annotations:
summary: CouchDB atom memory usage critical (instance {{ $labels.instance }})
description: "Atom memory usage is above 90% of limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# The default max_dbs_open is 500. Adjust the threshold (currently 0.9 * 1000) to match your max_dbs_open setting.
- alert: CouchdbOpenDatabasesCritical
expr: 'couchdb_httpd_open_databases > 0.9 * 1000'
for: 5m
labels:
severity: critical
annotations:
summary: CouchDB open databases critical (instance {{ $labels.instance }})
description: "Number of open databases exceeds 90% of node capacity\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Adjust 65535 to match your system's file descriptor limit (ulimit -n).
- alert: CouchdbOpenOsFilesCritical
expr: 'couchdb_httpd_open_os_files > 0.9 * 65535'
for: 5m
labels:
severity: critical
annotations:
summary: CouchDB open OS files critical (instance {{ $labels.instance }})
description: "CouchDB is using more than 90% of allowed OS file descriptors, may fail to open new files\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: Couchdb5xxErrorRatioHigh
expr: 'rate(couchdb_httpd_status_codes{code=~"5.."}[5m]) / rate(couchdb_httpd_requests[5m]) > 0.05 and rate(couchdb_httpd_requests[5m]) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: CouchDB 5xx error ratio high (instance {{ $labels.instance }})
description: "More than 5% of HTTP requests are returning 5xx errors\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbTemporaryViewReadRateCritical
expr: 'rate(couchdb_httpd_temporary_view_reads[5m]) > 100'
for: 5m
labels:
severity: critical
annotations:
summary: CouchDB temporary view read rate critical (instance {{ $labels.instance }})
description: "Temporary view read rate exceeds 100 reads/sec, high risk of performance degradation\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbMangoQueriesScanningTooManyDocs
expr: 'rate(couchdb_mango_too_many_docs_scanned[5m]) > 50'
for: 5m
labels:
severity: warning
annotations:
summary: CouchDB Mango queries scanning too many docs (instance {{ $labels.instance }})
description: "Some Mango queries are scanning too many documents, consider adding indexes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbMangoQueriesFailedDueToInvalidIndex
expr: 'rate(couchdb_mango_query_invalid_index[5m]) > 5'
for: 5m
labels:
severity: warning
annotations:
summary: CouchDB Mango queries failed due to invalid index (instance {{ $labels.instance }})
description: "Some Mango queries failed to execute because the index was missing or invalid\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbMangoDocsExaminedHigh
expr: 'rate(couchdb_mango_docs_examined[5m]) > 1000'
for: 5m
labels:
severity: warning
annotations:
summary: CouchDB Mango docs examined high (instance {{ $labels.instance }})
description: "High number of documents examined per Mango queries, consider indexing\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbReplicatorManagerDied
expr: 'increase(couchdb_replicator_changes_manager_deaths[5m]) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: CouchDB Replicator manager died (instance {{ $labels.instance }})
description: "Replication manager process has crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbReplicatorQueueProcessDied
expr: 'increase(couchdb_replicator_changes_queue_deaths[5m]) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: CouchDB Replicator queue process died (instance {{ $labels.instance }})
description: "Replication queue process has crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbReplicatorReaderProcessDied
expr: 'increase(couchdb_replicator_changes_reader_deaths[5m]) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: CouchDB Replicator reader process died (instance {{ $labels.instance }})
description: "Replication reader process has crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbReplicatorFailedToStart
expr: 'increase(couchdb_replicator_failed_starts[5m]) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: CouchDB Replicator failed to start (instance {{ $labels.instance }})
description: "One or more replication tasks failed to start\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbReplicationClusterUnstable
expr: 'couchdb_replicator_cluster_is_stable == 0'
for: 2m
labels:
severity: critical
annotations:
summary: CouchDB replication cluster unstable (instance {{ $labels.instance }})
description: "The replication cluster is unstable, replication may be interrupted\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbReplicationReadFailures
expr: 'increase(couchdb_replicator_changes_read_failures[5m]) > 5'
for: 5m
labels:
severity: warning
annotations:
summary: CouchDB replication read failures (instance {{ $labels.instance }})
description: "Replication changes feed has failed reads more than 5 times in 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbFileDescriptorsHigh
expr: 'process_open_fds / process_max_fds > 0.85 and process_max_fds > 0'
for: 5m
labels:
severity: warning
annotations:
summary: CouchDB file descriptors high (instance {{ $labels.instance }})
description: "Process is using more than 85% of allowed file descriptors\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbProcessRestarted
expr: 'changes(process_start_time_seconds[1h]) > 0'
for: 1m
labels:
severity: info
annotations:
summary: CouchDB process restarted (instance {{ $labels.instance }})
description: "CouchDB process has restarted recently\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: CouchdbCriticalLogEntries
expr: 'increase(couchdb_server_couch_log{level=~"error|critical"}[5m]) > 5'
for: 1m
labels:
severity: critical
annotations:
summary: CouchDB critical log entries (instance {{ $labels.instance }})
description: "Critical or error log entries detected in the last 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,97 @@
groups:
- name: DigitaloceanExporter
rules:
- alert: DigitaloceanDropletDown
expr: 'digitalocean_droplet_up == 0'
for: 5m
labels:
severity: critical
annotations:
summary: DigitalOcean droplet down (instance {{ $labels.instance }})
description: "DigitalOcean droplet {{ $labels.name }} ({{ $labels.id }}) in {{ $labels.region }} is not running.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: DigitaloceanAccountNotActive
expr: 'digitalocean_account_active != 1'
for: 5m
labels:
severity: critical
annotations:
summary: DigitalOcean account not active (instance {{ $labels.instance }})
description: "DigitalOcean account is not active. It may be suspended or locked.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: DigitaloceanDatabaseDown
expr: 'digitalocean_database_status == 0'
for: 2m
labels:
severity: critical
annotations:
summary: DigitalOcean database down (instance {{ $labels.instance }})
description: "DigitalOcean managed database {{ $labels.name }} ({{ $labels.engine }}) in {{ $labels.region }} is offline.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: DigitaloceanKubernetesClusterDown
expr: 'digitalocean_kubernetes_cluster_up == 0'
for: 5m
labels:
severity: critical
annotations:
summary: DigitalOcean Kubernetes cluster down (instance {{ $labels.instance }})
description: "DigitalOcean Kubernetes cluster {{ $labels.name }} ({{ $labels.version }}) in {{ $labels.region }} is not running.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: DigitaloceanLoadBalancerDown
expr: 'digitalocean_loadbalancer_status == 0'
for: 2m
labels:
severity: critical
annotations:
summary: DigitalOcean load balancer down (instance {{ $labels.instance }})
description: "DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) is not active.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: DigitaloceanLoadBalancerNoBackends
expr: 'digitalocean_loadbalancer_droplets == 0'
for: 1m
labels:
severity: warning
annotations:
summary: DigitalOcean load balancer no backends (instance {{ $labels.instance }})
description: "DigitalOcean load balancer {{ $labels.name }} ({{ $labels.ip }}) has no droplets attached.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: DigitaloceanFloatingIpNotAssigned
expr: 'digitalocean_floating_ipv4_active == 0'
for: 0m
labels:
severity: warning
annotations:
summary: DigitalOcean floating IP not assigned (instance {{ $labels.instance }})
description: "DigitalOcean floating IP {{ $labels.ipv4 }} in {{ $labels.region }} is not assigned to any droplet.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: DigitaloceanActiveIncidents
expr: 'digitalocean_incidents_total > 0'
for: 0m
labels:
severity: warning
annotations:
summary: DigitalOcean active incidents (instance {{ $labels.instance }})
description: "DigitalOcean platform has {{ $value }} active incident(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: DigitaloceanExporterCollectionErrors
expr: 'increase(digitalocean_errors_total[5m]) > 3'
for: 5m
labels:
severity: warning
annotations:
summary: DigitalOcean exporter collection errors (instance {{ $labels.instance }})
description: "DigitalOcean exporter {{ $labels.collector }} collector has {{ $value }} errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Fires when more than 80% of the account's droplet limit is in use.
- alert: DigitaloceanDropletLimitApproaching
expr: '(count(digitalocean_droplet_up) / digitalocean_account_droplet_limit) * 100 > 80 and digitalocean_account_droplet_limit > 0'
for: 0m
labels:
severity: warning
annotations:
summary: DigitalOcean droplet limit approaching (instance {{ $labels.instance }})
description: "DigitalOcean account is using {{ $value }}% of its droplet quota.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,8 +2,10 @@ groups:
- name: GoogleCadvisor
rules:
# This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
- alert: ContainerKilled
expr: 'time() - container_last_seen > 60'
for: 0m
@ -13,6 +15,7 @@ groups:
summary: Container killed (instance {{ $labels.instance }})
description: "A container has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
- alert: ContainerAbsent
expr: 'absent(container_last_seen)'
for: 5m
@ -22,15 +25,17 @@ groups:
summary: Container absent (instance {{ $labels.instance }})
description: "A container is absent for 5 min\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Only fires for containers with explicit CPU limits. Containers without limits have cpu_quota=0, which is filtered out by the guard.
- alert: ContainerHighCpuUtilization
expr: '(sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80'
expr: '(sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80 and sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) > 0'
for: 2m
labels:
severity: warning
annotations:
summary: Container High CPU utilization (instance {{ $labels.instance }})
description: "Container CPU utilization is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Container CPU utilization is above 80% (current: {{ $value | printf \"%.2f\" }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
- alert: ContainerHighMemoryUsage
expr: '(sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80'
for: 2m
@ -41,7 +46,7 @@ groups:
description: "Container Memory usage is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerVolumeUsage
expr: '(1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80'
expr: '(1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80 and sum(container_fs_inodes_total) BY (instance) > 0'
for: 2m
labels:
severity: warning
@ -50,22 +55,31 @@ groups:
description: "Container Volume usage is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerHighThrottleRate
expr: 'sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > ( 25 / 100 )'
expr: 'sum(rate(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace) / sum(rate(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > ( 25 / 100 ) and sum(rate(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Container high throttle rate (instance {{ $labels.instance }})
description: "Container is being throttled\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Container is being throttled ({{ $value | humanizePercentage }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerHighLowChangeCpuUsage
expr: '(abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[1m] offset 1m)) * 100)) or abs((sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) * 100) - (sum by (instance, name) (rate(container_cpu_usage_seconds_total{name!=""}[5m] offset 1m)) * 100))) > 25'
for: 0m
labels:
severity: info
annotations:
summary: Container high low change CPU usage (instance {{ $labels.instance }})
description: "This alert rule monitors the absolute change in CPU usage within a time window and triggers an alert when the change exceeds 25%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerLowCpuUtilization
expr: '(sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) < 20'
expr: '(sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) < 20 and sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) > 0'
for: 7d
labels:
severity: info
annotations:
summary: Container Low CPU utilization (instance {{ $labels.instance }})
description: "Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU. (current: {{ $value | printf \"%.2f\" }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerLowMemoryUsage
expr: '(sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20'

34
dist/rules/ebpf/ebpf-exporter.yml vendored Normal file
View file

@ -0,0 +1,34 @@
groups:
- name: EbpfExporter
rules:
# The exporter uses loose attachment: if a program fails to load (missing BTF, kernel incompatibility), it sets this metric to 0 and continues running.
- alert: EbpfExporterProgramNotAttached
expr: 'ebpf_exporter_ebpf_program_attached == 0'
for: 5m
labels:
severity: warning
annotations:
summary: eBPF exporter program not attached (instance {{ $labels.instance }})
description: "eBPF program {{ $labels.id }} failed to attach. The program is not collecting data. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EbpfExporterDecoderErrors
expr: 'rate(ebpf_exporter_decoder_errors_total[5m]) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: eBPF exporter decoder errors (instance {{ $labels.instance }})
description: "eBPF exporter is experiencing decoder errors for config {{ $labels.config }}. Kernel data is not being correctly transformed into labels. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EbpfExporterNoEnabledConfigs
expr: 'ebpf_exporter_enabled_configs == 0 or absent(ebpf_exporter_enabled_configs)'
for: 5m
labels:
severity: warning
annotations:
summary: eBPF exporter no enabled configs (instance {{ $labels.instance }})
description: "eBPF exporter has no enabled configurations. No eBPF programs are being run. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,10 +2,11 @@ groups:
- name: PrometheusCommunityElasticsearchExporter
rules:
- alert: ElasticsearchHeapUsageTooHigh
expr: '(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90'
expr: '(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90 and elasticsearch_jvm_memory_max_bytes{area="heap"} > 0'
for: 2m
labels:
severity: critical
@ -14,7 +15,7 @@ groups:
description: "The heap usage is over 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ElasticsearchHeapUsageWarning
expr: '(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80'
expr: '(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80 and elasticsearch_jvm_memory_max_bytes{area="heap"} > 0'
for: 2m
labels:
severity: warning
@ -23,7 +24,7 @@ groups:
description: "The heap usage is over 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ElasticsearchDiskOutOfSpace
expr: 'elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10'
expr: 'elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10 and elasticsearch_filesystem_data_size_bytes > 0'
for: 0m
labels:
severity: critical
@ -32,7 +33,7 @@ groups:
description: "The disk usage is over 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ElasticsearchDiskSpaceLow
expr: 'elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20'
expr: 'elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20 and elasticsearch_filesystem_data_size_bytes > 0'
for: 2m
labels:
severity: warning
@ -58,18 +59,20 @@ groups:
summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }})
description: "Elastic Cluster Yellow status\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# 1m delay allows a restart without triggering an alert.
- alert: ElasticsearchHealthyNodes
expr: 'elasticsearch_cluster_health_number_of_nodes < 3'
for: 0m
for: 1m
labels:
severity: critical
annotations:
summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})
description: "Missing node in Elasticsearch cluster\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# 1m delay allows a restart without triggering an alert.
- alert: ElasticsearchHealthyDataNodes
expr: 'elasticsearch_cluster_health_number_of_data_nodes < 3'
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -114,7 +117,7 @@ groups:
- alert: ElasticsearchUnassignedShards
expr: 'elasticsearch_cluster_health_unassigned_shards > 0'
for: 0m
for: 2m
labels:
severity: critical
annotations:
@ -139,17 +142,19 @@ groups:
summary: Elasticsearch no new documents (instance {{ $labels.instance }})
description: "No new documents for 10 min!\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 10ms (0.01s) per indexing operation is a rough default. Adjust based on your document size and cluster performance.
- alert: ElasticsearchHighIndexingLatency
expr: 'elasticsearch_indices_indexing_index_time_seconds_total / elasticsearch_indices_indexing_index_total > 0.0005'
expr: 'rate(elasticsearch_indices_indexing_index_time_seconds_total[5m]) / rate(elasticsearch_indices_indexing_index_total[5m]) > 0.01 and rate(elasticsearch_indices_indexing_index_total[5m]) > 0'
for: 10m
labels:
severity: warning
annotations:
summary: Elasticsearch High Indexing Latency (instance {{ $labels.instance }})
description: "The indexing latency on Elasticsearch cluster is higher than the threshold.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "The indexing latency on Elasticsearch cluster is higher than the threshold (current value: {{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 10000 ops/s is a rough default. Adjust based on your cluster capacity and expected workload.
- alert: ElasticsearchHighIndexingRate
expr: 'elasticsearch_indices_indexing_index_total > 100000'
expr: 'sum(rate(elasticsearch_indices_indexing_index_total[1m]))> 10000'
for: 5m
labels:
severity: warning
@ -157,8 +162,9 @@ groups:
summary: Elasticsearch High Indexing Rate (instance {{ $labels.instance }})
description: "The indexing rate on Elasticsearch cluster is higher than the threshold.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 100 queries/s is very low for most production clusters. Adjust based on your expected query volume.
- alert: ElasticsearchHighQueryRate
expr: 'elasticsearch_indices_search_query_total > 100000'
expr: 'sum(rate(elasticsearch_indices_search_query_total[1m])) > 100'
for: 5m
labels:
severity: warning
@ -167,10 +173,10 @@ groups:
description: "The query rate on Elasticsearch cluster is higher than the threshold.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ElasticsearchHighQueryLatency
expr: 'elasticsearch_indices_search_fetch_time_seconds / elasticsearch_indices_search_fetch_total > 1'
expr: 'rate(elasticsearch_indices_search_query_time_seconds[1m]) / rate(elasticsearch_indices_search_query_total[1m]) > 1 and rate(elasticsearch_indices_search_query_total[1m]) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Elasticsearch High Query Latency (instance {{ $labels.instance }})
description: "The query latency on Elasticsearch cluster is higher than the threshold.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "The query latency on Elasticsearch cluster is higher than the threshold (current value: {{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

177
dist/rules/envoy/embedded-exporter.yml vendored Normal file
View file

@ -0,0 +1,177 @@
groups:
- name: EmbeddedExporter
rules:
- alert: EnvoyServerNotLive
expr: 'envoy_server_live != 1'
for: 1m
labels:
severity: critical
annotations:
summary: Envoy server not live (instance {{ $labels.instance }})
description: "Envoy server is not live (draining or shutting down) on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyHighMemoryUsage
expr: 'envoy_server_memory_allocated / envoy_server_memory_heap_size * 100 > 90 and envoy_server_memory_heap_size > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Envoy high memory usage (instance {{ $labels.instance }})
description: "Envoy memory allocated is above 90% of heap size on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyHighDownstreamHttp5xxErrorRate
expr: 'sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class="5"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 5 and sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: Envoy high downstream HTTP 5xx error rate (instance {{ $labels.instance }})
description: "More than 5% of downstream HTTP responses are 5xx on {{ $labels.instance }} ({{ $value | printf \"%.1f\" }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyHighDownstreamHttp4xxErrorRate
expr: 'sum by (instance) (rate(envoy_http_downstream_rq_xx{envoy_response_code_class="4"}[5m])) / sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) * 100 > 10 and sum by (instance) (rate(envoy_http_downstream_rq_completed[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Envoy high downstream HTTP 4xx error rate (instance {{ $labels.instance }})
description: "More than 10% of downstream HTTP responses are 4xx on {{ $labels.instance }} ({{ $value | printf \"%.1f\" }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyDownstreamConnectionsOverflowing
expr: 'increase(envoy_listener_downstream_cx_overflow[5m]) > 5'
for: 0m
labels:
severity: warning
annotations:
summary: Envoy downstream connections overflowing (instance {{ $labels.instance }})
description: "Downstream connections are being rejected due to listener overflow on {{ $labels.instance }} ({{ $value }} in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyClusterMembershipEmpty
expr: 'envoy_cluster_membership_healthy == 0'
for: 1m
labels:
severity: critical
annotations:
summary: Envoy cluster membership empty (instance {{ $labels.instance }})
description: "Envoy cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} has no healthy members\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyClusterMembershipDegraded
expr: 'envoy_cluster_membership_healthy / envoy_cluster_membership_total * 100 < 75 and envoy_cluster_membership_total > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Envoy cluster membership degraded (instance {{ $labels.instance }})
description: "Only {{ $value | printf \"%.1f\" }}% of members in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} are healthy (threshold: 75%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyHighClusterUpstreamConnectionFailures
expr: 'increase(envoy_cluster_upstream_cx_connect_fail[5m]) > 10'
for: 5m
labels:
severity: warning
annotations:
summary: Envoy high cluster upstream connection failures (instance {{ $labels.instance }})
description: "High rate of upstream connection failures in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyHighClusterUpstreamRequestTimeoutRate
expr: 'rate(envoy_cluster_upstream_rq_timeout[5m]) / rate(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and rate(envoy_cluster_upstream_rq_completed[5m]) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Envoy high cluster upstream request timeout rate (instance {{ $labels.instance }})
description: "More than 5% of upstream requests are timing out in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyHighClusterUpstream5xxErrorRate
expr: 'rate(envoy_cluster_upstream_rq_xx{envoy_response_code_class="5"}[5m]) / rate(envoy_cluster_upstream_rq_completed[5m]) * 100 > 5 and rate(envoy_cluster_upstream_rq_completed[5m]) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: Envoy high cluster upstream 5xx error rate (instance {{ $labels.instance }})
description: "More than 5% of upstream requests return 5xx in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyClusterHealthCheckFailures
expr: 'increase(envoy_cluster_health_check_failure[5m]) > 5'
for: 5m
labels:
severity: warning
annotations:
summary: Envoy cluster health check failures (instance {{ $labels.instance }})
description: "Health checks are consistently failing in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyClusterOutlierDetectionEjectionsActive
expr: 'envoy_cluster_outlier_detection_ejections_active > 0'
for: 5m
labels:
severity: info
annotations:
summary: Envoy cluster outlier detection ejections active (instance {{ $labels.instance }})
description: "There are active outlier detection ejections in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyListenerSslConnectionErrors
expr: 'increase(envoy_listener_ssl_connection_error[5m]) > 5'
for: 0m
labels:
severity: warning
annotations:
summary: Envoy listener SSL connection errors (instance {{ $labels.instance }})
description: "Envoy listener is experiencing SSL/TLS connection errors on {{ $labels.instance }} ({{ $value }} in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyGlobalDownstreamConnectionsOverflowing
expr: 'increase(envoy_listener_downstream_global_cx_overflow[5m]) > 5'
for: 0m
labels:
severity: critical
annotations:
summary: Envoy global downstream connections overflowing (instance {{ $labels.instance }})
description: "Downstream connections are being rejected due to global connection limit on {{ $labels.instance }} ({{ $value }} in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoySslCertificateExpiringSoon
expr: 'envoy_server_days_until_first_cert_expiring < 7'
for: 0m
labels:
severity: warning
annotations:
summary: Envoy SSL certificate expiring soon (instance {{ $labels.instance }})
description: "SSL certificate loaded by Envoy on {{ $labels.instance }} expires in less than 7 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoySslCertificateExpired
expr: 'envoy_server_days_until_first_cert_expiring < 0'
for: 0m
labels:
severity: critical
annotations:
summary: Envoy SSL certificate expired (instance {{ $labels.instance }})
description: "SSL certificate loaded by Envoy on {{ $labels.instance }} has expired\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyClusterCircuitBreakerTripped
expr: 'envoy_cluster_circuit_breakers_default_cx_open == 1 or envoy_cluster_circuit_breakers_default_rq_open == 1'
for: 0m
labels:
severity: critical
annotations:
summary: Envoy cluster circuit breaker tripped (instance {{ $labels.instance }})
description: "Circuit breaker is open for cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyNoHealthyUpstream
expr: 'increase(envoy_cluster_upstream_cx_none_healthy[5m]) > 3'
for: 0m
labels:
severity: critical
annotations:
summary: Envoy no healthy upstream (instance {{ $labels.instance }})
description: "Upstream connection attempts failed because no healthy upstream was available in cluster {{ $labels.envoy_cluster_name }} on {{ $labels.instance }} ({{ $value }} in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EnvoyHighDownstreamRequestTimeoutRate
expr: 'increase(envoy_http_downstream_rq_timeout[5m]) > 5'
for: 5m
labels:
severity: warning
annotations:
summary: Envoy high downstream request timeout rate (instance {{ $labels.instance }})
description: "Downstream requests are timing out on {{ $labels.instance }} ({{ $value }} in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: EtcdInsufficientMembers
@ -29,24 +30,26 @@ groups:
severity: warning
annotations:
summary: Etcd high number of leader changes (instance {{ $labels.instance }})
description: "Etcd leader changed more than 2 times during 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Etcd leader changed {{ $value }} times during 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EtcdHighNumberOfFailedGrpcRequests
expr: 'sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01'
# Filters to actual error codes. grpc_code!="OK" includes benign codes like NotFound, AlreadyExists, and Cancelled.
- alert: EtcdHighNumberOfFailedGrpcRequestsWarning
expr: 'sum(rate(grpc_server_handled_total{grpc_code=~"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01 and sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0'
for: 2m
labels:
severity: warning
annotations:
summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
summary: Etcd high number of failed GRPC requests warning (instance {{ $labels.instance }})
description: "More than 1% GRPC request failure detected in Etcd\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EtcdHighNumberOfFailedGrpcRequests
expr: 'sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05'
# Filters to actual error codes. grpc_code!="OK" includes benign codes like NotFound, AlreadyExists, and Cancelled.
- alert: EtcdHighNumberOfFailedGrpcRequestsCritical
expr: 'sum(rate(grpc_server_handled_total{grpc_code=~"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05 and sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0'
for: 2m
labels:
severity: critical
annotations:
summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
summary: Etcd high number of failed GRPC requests critical (instance {{ $labels.instance }})
description: "More than 5% GRPC request failure detected in Etcd\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EtcdGrpcRequestsSlow
@ -58,24 +61,27 @@ groups:
summary: Etcd GRPC requests slow (instance {{ $labels.instance }})
description: "GRPC requests slowing down, 99th percentile is over 0.15s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EtcdHighNumberOfFailedHttpRequests
expr: 'sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01'
# These etcd_http_* metrics are from the etcd v2 API and do not exist in etcd 3.x. Remove these rules if running etcd 3.x.
- alert: EtcdHighNumberOfFailedHttpRequestsWarning
expr: 'sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01 and sum(rate(etcd_http_received_total[1m])) BY (method) > 0'
for: 2m
labels:
severity: warning
annotations:
summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
summary: Etcd high number of failed HTTP requests warning (instance {{ $labels.instance }})
description: "More than 1% HTTP failure detected in Etcd\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EtcdHighNumberOfFailedHttpRequests
expr: 'sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05'
# These etcd_http_* metrics are from the etcd v2 API and do not exist in etcd 3.x. Remove these rules if running etcd 3.x.
- alert: EtcdHighNumberOfFailedHttpRequestsCritical
expr: 'sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05 and sum(rate(etcd_http_received_total[1m])) BY (method) > 0'
for: 2m
labels:
severity: critical
annotations:
summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
summary: Etcd high number of failed HTTP requests critical (instance {{ $labels.instance }})
description: "More than 5% HTTP failure detected in Etcd\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This etcd_http_* metric is from the etcd v2 API and does not exist in etcd 3.x. Remove this rule if running etcd 3.x.
- alert: EtcdHttpRequestsSlow
expr: 'histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m])) > 0.15'
for: 2m
@ -86,7 +92,7 @@ groups:
description: "HTTP requests slowing down, 99th percentile is over 0.15s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EtcdMemberCommunicationSlow
expr: 'histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) > 0.15'
expr: 'histogram_quantile(0.99, sum(rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) by (instance, le)) > 0.15'
for: 2m
labels:
severity: warning
@ -101,10 +107,10 @@ groups:
severity: warning
annotations:
summary: Etcd high number of failed proposals (instance {{ $labels.instance }})
description: "Etcd server got more than 5 failed proposals past hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Etcd server got {{ $value }} failed proposals in the past hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EtcdHighFsyncDurations
expr: 'histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m])) > 0.5'
expr: 'histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)) > 0.5'
for: 2m
labels:
severity: warning
@ -113,7 +119,7 @@ groups:
description: "Etcd WAL fsync duration increasing, 99th percentile is over 0.5s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: EtcdHighCommitDurations
expr: 'histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[1m])) > 0.25'
expr: 'histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le)) > 0.25'
for: 2m
labels:
severity: warning

42
dist/rules/fluxcd/embedded-exporter.yml vendored Normal file
View file

@ -0,0 +1,42 @@
groups:
- name: EmbeddedExporter
rules:
- alert: FluxKustomizationFailure
expr: 'gotk_resource_info{ready="False", customresource_kind="Kustomization"} > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Flux Kustomization Failure (instance {{ $labels.instance }})
description: "The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: FluxHelmreleaseFailure
expr: 'gotk_resource_info{ready="False", customresource_kind="HelmRelease"} > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Flux HelmRelease Failure (instance {{ $labels.instance }})
description: "The {{ $labels.customresource_kind }} '{{ $labels.name }}' in namespace {{ $labels.exported_namespace }} is marked as not ready.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: FluxSourceIssue
expr: 'gotk_resource_info{ready="False", customresource_kind=~"GitRepository|HelmRepository|Bucket|OCIRepository"} > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Flux Source Issue (instance {{ $labels.instance }})
description: "Flux source {{ $labels.customresource_kind }} '{{ $labels.name }}' has issue(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: FluxImageIssue
expr: 'gotk_resource_info{ready="False", customresource_kind=~"ImagePolicy|ImageRepository|ImageUpdateAutomation"} > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Flux Image Issue (instance {{ $labels.instance }})
description: "The {{ $labels.customresource_kind }} '{{ $labels.name }}' is marked as not ready.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,19 +2,20 @@ groups:
- name: ZnerolFreeswitchExporter
rules:
- alert: FreeswitchDown
expr: 'freeswitch_up == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
summary: Freeswitch down (instance {{ $labels.instance }})
description: "Freeswitch is unresponsive\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Freeswitch {{ $labels.instance }} is unresponsive.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: FreeswitchSessionsWarning
expr: '(freeswitch_session_active * 100 / freeswitch_session_limit) > 80'
expr: '(freeswitch_session_active * 100 / freeswitch_session_limit) > 80 and freeswitch_session_limit > 0'
for: 10m
labels:
severity: warning
@ -23,7 +24,7 @@ groups:
description: "High sessions usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: FreeswitchSessionsCritical
expr: '(freeswitch_session_active * 100 / freeswitch_session_limit) > 90'
expr: '(freeswitch_session_active * 100 / freeswitch_session_limit) > 90 and freeswitch_session_limit > 0'
for: 5m
labels:
severity: critical

66
dist/rules/gitlab-ci/gitaly.yml vendored Normal file
View file

@ -0,0 +1,66 @@
groups:
- name: Gitaly
rules:
# Filters to actual error codes. grpc_code!="OK" includes benign codes like NotFound, AlreadyExists, and Cancelled.
- alert: GitlabGitalyHighGrpcErrorRate
expr: 'sum(rate(grpc_server_handled_total{job="gitaly",grpc_code=~"Internal|Unavailable|DeadlineExceeded|ResourceExhausted|Aborted|Unknown|DataLoss"}[5m])) / sum(rate(grpc_server_handled_total{job="gitaly"}[5m])) * 100 > 5 and sum(rate(grpc_server_handled_total{job="gitaly"}[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab Gitaly high gRPC error rate (instance {{ $labels.instance }})
description: "Gitaly on {{ $labels.instance }} is returning more than 5% gRPC errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# ResourceExhausted errors from Gitaly mean Git operations are being rejected due to
# concurrency limits. This directly impacts users trying to push, pull, or clone.
- alert: GitlabGitalyResourceExhausted
expr: 'sum(rate(grpc_server_handled_total{job="gitaly",grpc_code="ResourceExhausted"}[5m])) / sum(rate(grpc_server_handled_total{job="gitaly"}[5m])) * 100 > 1 and sum(rate(grpc_server_handled_total{job="gitaly"}[5m])) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: GitLab Gitaly resource exhausted (instance {{ $labels.instance }})
description: "Gitaly on {{ $labels.instance }} is returning ResourceExhausted errors, indicating overload ({{ $value }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabGitalyHighRpcLatency
expr: 'histogram_quantile(0.95, sum(rate(grpc_server_handling_seconds_bucket{job="gitaly",grpc_type="unary"}[5m])) by (le)) > 1'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab Gitaly high RPC latency (instance {{ $labels.instance }})
description: "Gitaly on {{ $labels.instance }} p95 unary RPC latency exceeds 1 second ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Brief throttling spikes are normal. Threshold of 0.1s/s (10% of CPU time throttled) filters out transient noise.
- alert: GitlabGitalyCpuThrottled
expr: 'rate(gitaly_cgroup_cpu_cfs_throttled_seconds_total[5m]) > 0.1'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab Gitaly CPU throttled (instance {{ $labels.instance }})
description: "Gitaly processes on {{ $labels.instance }} are being CPU throttled by cgroups.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabGitalyAuthenticationFailures
expr: 'increase(gitaly_authentications_total{status="failed"}[5m]) > 3'
for: 0m
labels:
severity: warning
annotations:
summary: GitLab Gitaly authentication failures (instance {{ $labels.instance }})
description: "Gitaly on {{ $labels.instance }} has authentication failures ({{ $value }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# When the circuit breaker trips to "open" state, Git operations (push, pull, clone) will fail.
# Check Gitaly service health and logs.
- alert: GitlabGitalyCircuitBreakerTripped
expr: 'increase(gitaly_circuit_breaker_transitions_total{to_state="open"}[5m]) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: GitLab Gitaly circuit breaker tripped (instance {{ $labels.instance }})
description: "Gitaly circuit breaker has tripped on {{ $labels.instance }}. Git operations are failing.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,216 @@
groups:
- name: GitlabBuiltInExporter
rules:
# Queued connections indicate Puma workers are saturated.
# Consider increasing puma['worker_processes'] or puma['max_threads'] in gitlab.rb.
- alert: GitlabPumaHighQueuedConnections
expr: 'puma_queued_connections > 5'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab Puma high queued connections (instance {{ $labels.instance }})
description: "GitLab Puma has {{ $value }} queued connections on {{ $labels.instance }}. Requests are waiting for an available worker thread.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabPumaNoAvailablePoolCapacity
expr: 'puma_pool_capacity == 0'
for: 5m
labels:
severity: critical
annotations:
summary: GitLab Puma no available pool capacity (instance {{ $labels.instance }})
description: "GitLab Puma pool capacity on {{ $labels.instance }} has been at 0 for 5 minutes. All threads are busy.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabPumaWorkersNotRunning
expr: 'puma_running_workers < puma_workers'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab Puma workers not running (instance {{ $labels.instance }})
description: "GitLab Puma on {{ $labels.instance }} has {{ $value }} running workers out of expected total.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold is 5% of all requests returning server errors.
# Check GitLab logs at /var/log/gitlab/ for root cause.
- alert: GitlabHighHttpErrorRate
expr: 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5 and sum(rate(http_requests_total[5m])) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: GitLab high HTTP error rate (instance {{ $labels.instance }})
description: "GitLab is returning more than 5% HTTP 5xx errors on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 10s may need adjustment based on your instance size and workload.
- alert: GitlabHighHttpRequestLatency
expr: 'histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 10'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab high HTTP request latency (instance {{ $labels.instance }})
description: "GitLab p95 HTTP request latency on {{ $labels.instance }} is above 10 seconds.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled.
# A sustained failure rate indicates background processing issues.
- alert: GitlabSidekiqJobsFailing
expr: 'rate(sidekiq_jobs_failed_total[5m]) > 0.1'
for: 10m
labels:
severity: warning
annotations:
summary: GitLab Sidekiq jobs failing (instance {{ $labels.instance }})
description: "GitLab Sidekiq jobs are failing at a rate of {{ $value }} per second on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# When running jobs approach the concurrency limit, new jobs will queue up.
# Consider scaling Sidekiq workers or increasing concurrency.
- alert: GitlabSidekiqQueueTooLarge
expr: 'sum(sidekiq_running_jobs) >= sum(sidekiq_concurrency) * 0.9'
for: 10m
labels:
severity: warning
annotations:
summary: GitLab Sidekiq queue too large (instance {{ $labels.instance }})
description: "GitLab Sidekiq has {{ $value }} running jobs, approaching concurrency limit on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled.
- alert: GitlabSidekiqHighJobCompletionTime
expr: 'histogram_quantile(0.95, sum(rate(sidekiq_jobs_completion_seconds_bucket[5m])) by (le, worker)) > 300'
for: 10m
labels:
severity: warning
annotations:
summary: GitLab Sidekiq high job completion time (instance {{ $labels.instance }})
description: "GitLab Sidekiq job p95 completion time on {{ $labels.instance }} is above 5 minutes ({{ $value | humanizeDuration }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This metric requires the emit_sidekiq_histogram_metrics feature flag to be enabled.
# High queue latency means jobs are stuck waiting. Check Sidekiq concurrency and queue sizes.
- alert: GitlabSidekiqHighQueueLatency
expr: 'histogram_quantile(0.95, sum(rate(sidekiq_jobs_queue_duration_seconds_bucket[5m])) by (le)) > 60'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab Sidekiq high queue latency (instance {{ $labels.instance }})
description: "GitLab Sidekiq jobs on {{ $labels.instance }} are waiting more than 60 seconds before being processed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# When the pool is near saturation, requests may block waiting for a connection.
# Increase db_pool_size in gitlab.rb or investigate slow queries.
- alert: GitlabDatabaseConnectionPoolSaturation
expr: 'gitlab_database_connection_pool_busy / gitlab_database_connection_pool_size * 100 > 90 and gitlab_database_connection_pool_size > 0'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab database connection pool saturation (instance {{ $labels.instance }})
description: "GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) is {{ $value }}% busy.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabDatabaseConnectionPoolDeadConnections
expr: 'gitlab_database_connection_pool_dead > 0'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab database connection pool dead connections (instance {{ $labels.instance }})
description: "GitLab database connection pool on {{ $labels.instance }} ({{ $labels.class }}) has {{ $value }} dead connections.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabDatabaseConnectionPoolWaiting
expr: 'gitlab_database_connection_pool_waiting > 0'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab database connection pool waiting (instance {{ $labels.instance }})
description: "GitLab on {{ $labels.instance }} has {{ $value }} threads waiting for a database connection.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabCiPipelineCreationSlow
expr: 'histogram_quantile(0.95, sum(rate(gitlab_ci_pipeline_creation_duration_seconds_bucket[5m])) by (le)) > 30'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab CI pipeline creation slow (instance {{ $labels.instance }})
description: "GitLab CI pipeline creation p95 latency on {{ $labels.instance }} is above 30 seconds.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This metric may not exist in all GitLab versions. Verify against your GitLab installation.
- alert: GitlabCiPipelineFailuresIncreasing
expr: 'deriv(gitlab_ci_pipeline_failure_reasons[5m]) > 0.05'
for: 10m
labels:
severity: warning
annotations:
summary: GitLab CI pipeline failures increasing (instance {{ $labels.instance }})
description: "GitLab CI pipeline failures are increasing on {{ $labels.instance }} ({{ $value }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Frequent runner auth failures may indicate expired tokens or misconfigured runners.
- alert: GitlabCiRunnerAuthenticationFailures
expr: 'increase(gitlab_ci_runner_authentication_failure_total[5m]) > 5'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab CI runner authentication failures (instance {{ $labels.instance }})
description: "GitLab CI runners are experiencing authentication failures on {{ $labels.instance }} ({{ $value }} failures).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 2GB may need adjustment based on your instance size.
# High memory usage can lead to OOM kills and service disruptions.
- alert: GitlabHighMemoryUsage
expr: 'process_resident_memory_bytes{job=~".*gitlab.*"} > 2e+9'
for: 10m
labels:
severity: warning
annotations:
summary: GitLab high memory usage (instance {{ $labels.instance }})
description: "GitLab process on {{ $labels.instance }} is using {{ $value | humanize1024 }}B of RSS memory.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Heap fragmentation above 50% means a significant amount of memory is wasted.
# A Puma worker restart may help reclaim memory.
- alert: GitlabRubyHeapFragmentation
expr: 'ruby_gc_stat_ext_heap_fragmentation{job=~".*gitlab.*"} > 0.5'
for: 15m
labels:
severity: warning
annotations:
summary: GitLab Ruby heap fragmentation (instance {{ $labels.instance }})
description: "GitLab Ruby heap fragmentation on {{ $labels.instance }} is {{ $value }}. High fragmentation wastes memory.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabRackUncaughtErrors
expr: 'rate(rack_uncaught_errors_total[5m]) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab rack uncaught errors (instance {{ $labels.instance }})
description: "GitLab is experiencing uncaught errors in the Rack layer on {{ $labels.instance }} ({{ $value }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This may happen during a rolling deployment. If it persists, investigate incomplete upgrades.
- alert: GitlabVersionMismatch
expr: 'count(count by (version) (gitlab_build_info)) > 1'
for: 0m
labels:
severity: warning
annotations:
summary: GitLab version mismatch (instance {{ $labels.instance }})
description: "Multiple GitLab versions are running across the fleet.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabHighFileDescriptorUsage
expr: 'process_open_fds{job=~".*gitlab.*"} / process_max_fds * 100 > 80 and process_max_fds > 0'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab high file descriptor usage (instance {{ $labels.instance }})
description: "GitLab on {{ $labels.instance }} is using {{ $value }}% of available file descriptors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabRubyThreadsSaturated
expr: 'sum by (instance) (gitlab_ruby_threads_running_threads) > on(instance) gitlab_ruby_threads_max_expected_threads * 1.5'
for: 10m
labels:
severity: warning
annotations:
summary: GitLab Ruby threads saturated (instance {{ $labels.instance }})
description: "GitLab running threads on {{ $labels.instance }} have exceeded the expected maximum ({{ $value }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

36
dist/rules/gitlab-ci/workhorse.yml vendored Normal file
View file

@ -0,0 +1,36 @@
groups:
- name: Workhorse
rules:
# Workhorse sits in front of Puma and handles Git HTTP, file uploads, and proxying.
# Threshold from GitLab Omnibus default rules: 10% for high-traffic instances.
- alert: GitlabWorkhorseHighErrorRate
expr: 'sum(rate(gitlab_workhorse_http_request_duration_seconds_count{code=~"5.."}[5m])) / sum(rate(gitlab_workhorse_http_request_duration_seconds_count[5m])) * 100 > 10 and sum(rate(gitlab_workhorse_http_request_duration_seconds_count[5m])) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: GitLab Workhorse high error rate (instance {{ $labels.instance }})
description: "GitLab Workhorse on {{ $labels.instance }} is returning more than 10% HTTP 5xx errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GitlabWorkhorseHighLatency
expr: 'histogram_quantile(0.95, sum(rate(gitlab_workhorse_http_request_duration_seconds_bucket[5m])) by (le)) > 10'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab Workhorse high latency (instance {{ $labels.instance }})
description: "GitLab Workhorse on {{ $labels.instance }} p95 request latency is above 10 seconds.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 100 may need adjustment based on instance size.
- alert: GitlabWorkhorseHighIn-flightRequests
expr: 'gitlab_workhorse_http_in_flight_requests > 100'
for: 5m
labels:
severity: warning
annotations:
summary: GitLab Workhorse high in-flight requests (instance {{ $labels.instance }})
description: "GitLab Workhorse on {{ $labels.instance }} has {{ $value }} in-flight requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

109
dist/rules/golang/golang-exporter.yml vendored Normal file
View file

@ -0,0 +1,109 @@
groups:
- name: GolangExporter
rules:
# Threshold is a rough default. High-concurrency servers may legitimately run thousands of goroutines. Adjust to match your baseline.
- alert: GoGoroutineCountHigh
expr: 'go_goroutines > 1000'
for: 5m
labels:
severity: warning
annotations:
summary: Go goroutine count high (instance {{ $labels.instance }})
description: "Go application has too many goroutines (> 1000), potential goroutine leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# quantile="1" is the maximum observed GC pause in the current summary window, not p99.
# A single outlier pause can push this above 1s. The for: 5m ensures the max stays elevated.
- alert: GoGcDurationHigh
expr: 'go_gc_duration_seconds{quantile="1"} > 1'
for: 5m
labels:
severity: warning
annotations:
summary: Go GC duration high (instance {{ $labels.instance }})
description: "Go GC pause duration is too high (max > 1s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# go_memstats_sys_bytes is the total memory obtained from the OS by the Go runtime, not total host memory.
# This ratio measures Go-internal memory utilization, not system-level memory pressure.
- alert: GoMemoryUsageHigh
expr: '(go_memstats_heap_alloc_bytes / go_memstats_sys_bytes) * 100 > 90'
for: 5m
labels:
severity: warning
annotations:
summary: Go memory usage high (instance {{ $labels.instance }})
description: "Go heap allocation is using most of the runtime's reserved memory (> 90%), indicating the process may need more memory or has a leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold is workload-dependent. Applications with heavy CGo or blocking I/O may legitimately use more OS threads. Adjust to match your baseline.
- alert: GoThreadCountHigh
expr: 'go_threads > 500'
for: 5m
labels:
severity: warning
annotations:
summary: Go thread count high (instance {{ $labels.instance }})
description: "Go OS thread count is high (> 500), potential blocking syscall or CGo leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold is a rough default. Adjust based on your application's normal object count.
- alert: GoHeapObjectsCountHigh
expr: 'go_memstats_heap_objects > 10000000'
for: 5m
labels:
severity: warning
annotations:
summary: Go heap objects count high (instance {{ $labels.instance }})
description: "Go heap has too many live objects (> 10M), high GC pressure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# rate(go_gc_duration_seconds_sum) approximates the fraction of wall-clock time spent in GC.
# This replaces go_memstats_gc_cpu_fraction which was removed in client_golang v1.12+.
- alert: GoGcCpuFractionHigh
expr: 'rate(go_gc_duration_seconds_sum[5m]) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Go GC CPU fraction high (instance {{ $labels.instance }})
description: "Go GC is consuming too much CPU (> 5%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# A threshold of 100/s only catches catastrophic leaks (30k goroutines in 5m). 10/s catches gradual leaks (~3k in 5m).
# Adjust based on your application's expected concurrency patterns.
- alert: GoGoroutineSpike
expr: 'deriv(go_goroutines[5m]) > 10'
for: 5m
labels:
severity: warning
annotations:
summary: Go goroutine spike (instance {{ $labels.instance }})
description: "Go goroutine count is growing rapidly ({{ $value | printf \"%.0f\" }} goroutines/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Alerts when heap in-use grows by more than 10MB/s sustained over 10 minutes.
# Adjust threshold based on your workload.
- alert: GoHeapIn-useGrowing
expr: 'deriv(go_memstats_heap_inuse_bytes[10m]) > 1e7'
for: 0m
labels:
severity: warning
annotations:
summary: Go heap in-use growing (instance {{ $labels.instance }})
description: "Go heap in-use memory is growing steadily, potential memory leak or under-sized heap\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GoMemoryLeak
expr: 'rate(go_memstats_alloc_bytes_total[5m]) > 1e9'
for: 5m
labels:
severity: warning
annotations:
summary: Go memory leak (instance {{ $labels.instance }})
description: "Go application has sustained high allocation rate (> 1GB/s), potential memory leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: GoStackMemoryHigh
expr: 'go_memstats_stack_inuse_bytes > 1e9'
for: 5m
labels:
severity: warning
annotations:
summary: Go stack memory high (instance {{ $labels.instance }})
description: "Go stack memory usage is high (> 1GB), likely excessive goroutines or deep recursion\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,53 @@
groups:
- name: StackdriverExporter
# Self-monitoring metrics use the stackdriver_monitoring_* prefix.
# All self-monitoring metrics include a project_id label.
rules:
- alert: StackdriverExporterScrapeError
expr: 'stackdriver_monitoring_last_scrape_error > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Stackdriver exporter scrape error (instance {{ $labels.instance }})
description: "Stackdriver exporter failed to scrape metrics from Google Cloud Monitoring API for project {{ $labels.project_id }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: StackdriverExporterSlowScrape
expr: 'stackdriver_monitoring_last_scrape_duration_seconds > 300'
for: 5m
labels:
severity: warning
annotations:
summary: Stackdriver exporter slow scrape (instance {{ $labels.instance }})
description: "Stackdriver exporter scrape for project {{ $labels.project_id }} is taking more than 5 minutes ({{ $value }}s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: StackdriverExporterScrapeErrorsIncreasing
expr: 'increase(stackdriver_monitoring_scrape_errors_total[15m]) > 5'
for: 0m
labels:
severity: warning
annotations:
summary: Stackdriver exporter scrape errors increasing (instance {{ $labels.instance }})
description: "Stackdriver exporter has had {{ $value }} scrape errors in the last 15 minutes for project {{ $labels.project_id }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: StackdriverExporterHighApiCalls
expr: 'rate(stackdriver_monitoring_api_calls_total[5m]) * 60 > 100'
for: 0m
labels:
severity: warning
annotations:
summary: Stackdriver exporter high API calls (instance {{ $labels.instance }})
description: "Stackdriver exporter is making {{ $value }} API calls per minute for project {{ $labels.project_id }}. This may hit Google Cloud Monitoring API quotas.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: StackdriverExporterScrapeStale
expr: 'time() - stackdriver_monitoring_last_scrape_timestamp > 600'
for: 0m
labels:
severity: warning
annotations:
summary: Stackdriver exporter scrape stale (instance {{ $labels.instance }})
description: "Stackdriver exporter has not successfully scraped metrics for project {{ $labels.project_id }} in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,15 @@
groups:
- name: EmbeddedExporter
rules:
- alert: GrafanaAlloyServiceDown
expr: 'count by (instance) (alloy_build_info offset 2h) unless count by (instance) (alloy_build_info)'
for: 0m
labels:
severity: critical
annotations:
summary: Grafana Alloy service down (instance {{ $labels.instance }})
description: "Alloy on instance {{ $labels.instance }} is not responding or has stopped running.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,465 @@
groups:
- name: EmbeddedExporter
# Mimir uses the `cortex_` metric prefix for backward compatibility with Cortex. This is intentional and expected.
rules:
- alert: MimirIngesterUnhealthy
expr: 'min by (job) (cortex_ring_members{state="Unhealthy", name="ingester"}) > 0'
for: 15m
labels:
severity: critical
annotations:
summary: Mimir ingester unhealthy (instance {{ $labels.instance }})
description: "Mimir has {{ $value }} unhealthy ingester(s) in the ring.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirRequestErrors
expr: '100 * sum by (job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..", route!~"ready|debug_pprof"}[5m])) / sum by (job, route) (rate(cortex_request_duration_seconds_count{route!~"ready|debug_pprof"}[5m])) > 1 and sum by (job, route) (rate(cortex_request_duration_seconds_count{route!~"ready|debug_pprof"}[5m])) > 0'
for: 15m
labels:
severity: critical
annotations:
summary: Mimir request errors (instance {{ $labels.instance }})
description: "Mimir {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}% errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirInconsistentRuntimeConfig
expr: 'count(count by (job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1'
for: 1h
labels:
severity: critical
annotations:
summary: Mimir inconsistent runtime config (instance {{ $labels.instance }})
description: "An inconsistent runtime config file is used across Mimir instances.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirBadRuntimeConfig
expr: 'sum by (job) (cortex_runtime_config_last_reload_successful == 0) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir bad runtime config (instance {{ $labels.instance }})
description: "{{ $labels.job }} failed to reload runtime config.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirSchedulerQueriesStuck
expr: 'sum by (job) (min_over_time(cortex_query_scheduler_queue_length[1m])) > 0'
for: 7m
labels:
severity: critical
annotations:
summary: Mimir scheduler queries stuck (instance {{ $labels.instance }})
description: "There are {{ $value }} queued up queries in {{ $labels.job }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirCacheRequestErrors
expr: '(sum by (name, operation, job) (rate(thanos_cache_operation_failures_total[5m])) / sum by (name, operation, job) (rate(thanos_cache_operations_total[5m]))) * 100 > 5 and sum by (name, operation, job) (rate(thanos_cache_operations_total[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Mimir cache request errors (instance {{ $labels.instance }})
description: "Mimir cache {{ $labels.name }} is experiencing {{ printf \"%.2f\" $value }}% errors for {{ $labels.operation }} operation.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirKvStoreFailure
expr: '(sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.."}[5m])) / sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count[5m]))) == 1 and sum by (job, kv_name) (rate(cortex_kv_request_duration_seconds_count[5m])) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir KV store failure (instance {{ $labels.instance }})
description: "Mimir {{ $labels.job }} KV store {{ $labels.kv_name }} is failing with 100% error rate.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirMemoryMapAreasTooHigh
expr: 'process_memory_map_areas{job=~".*(ingester|cortex|mimir|store-gateway).*"} / process_memory_map_areas_limit{job=~".*(ingester|cortex|mimir|store-gateway).*"} * 100 > 80 and process_memory_map_areas_limit{job=~".*(ingester|cortex|mimir|store-gateway).*"} > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir memory map areas too high (instance {{ $labels.instance }})
description: "Mimir {{ $labels.job }} is using {{ printf \"%.0f\" $value }}% of its memory map area limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirIngesterInstanceHasNoTenants
expr: '(cortex_ingester_memory_users == 0) and on (instance) (cortex_ingester_memory_users offset 1h > 0)'
for: 1h
labels:
severity: warning
annotations:
summary: Mimir ingester instance has no tenants (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} has no tenants assigned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirRulerInstanceHasNoRuleGroups
expr: '(cortex_ruler_managers_total == 0) and on (instance) (cortex_ruler_managers_total offset 1h > 0)'
for: 1h
labels:
severity: warning
annotations:
summary: Mimir ruler instance has no rule groups (instance {{ $labels.instance }})
description: "Mimir ruler {{ $labels.instance }} has no rule groups assigned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirIngestedDataTooFarInTheFuture
expr: 'max by (job) (cortex_ingester_tsdb_head_max_timestamp_seconds - time() and cortex_ingester_tsdb_head_max_timestamp_seconds > 0) > 3600'
for: 5m
labels:
severity: warning
annotations:
summary: Mimir ingested data too far in the future (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.job }} has ingested samples with timestamps more than 1 hour in the future.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirStoreGatewayTooManyFailedOperations
expr: 'sum by (job) (rate(thanos_objstore_bucket_operation_failures_total[5m])) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Mimir store gateway too many failed operations (instance {{ $labels.instance }})
description: "Mimir store-gateway {{ $labels.job }} bucket operations are failing ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirRingMembersMismatch
expr: 'max by (name, job) (sum by (name, job, instance) (cortex_ring_members)) != min by (name, job) (sum by (name, job, instance) (cortex_ring_members))'
for: 15m
labels:
severity: warning
annotations:
summary: Mimir ring members mismatch (instance {{ $labels.instance }})
description: "Mimir {{ $labels.name }} ring has inconsistent member counts across instances.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirIngesterReachingSeriesLimitWarning
expr: '(cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"} * 100 > 80) and cortex_ingester_instance_limits{limit="max_series"} > 0'
for: 3h
labels:
severity: warning
annotations:
summary: Mimir ingester reaching series limit warning (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its series limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirIngesterReachingSeriesLimitCritical
expr: '(cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"} * 100 > 90) and cortex_ingester_instance_limits{limit="max_series"} > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir ingester reaching series limit critical (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its series limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirIngesterReachingTenantsLimitWarning
expr: '(cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"} * 100 > 70) and cortex_ingester_instance_limits{limit="max_tenants"} > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Mimir ingester reaching tenants limit warning (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its tenants limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirIngesterReachingTenantsLimitCritical
expr: '(cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"} * 100 > 80) and cortex_ingester_instance_limits{limit="max_tenants"} > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir ingester reaching tenants limit critical (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} has reached {{ printf \"%.0f\" $value }}% of its tenants limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirReachingTcpConnectionsLimit
expr: 'cortex_tcp_connections / cortex_tcp_connections_limit * 100 > 80 and cortex_tcp_connections_limit > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir reaching TCP connections limit (instance {{ $labels.instance }})
description: "Mimir instance {{ $labels.instance }} is using {{ printf \"%.0f\" $value }}% of its TCP connections limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirDistributorInflightRequestsHigh
expr: '(cortex_distributor_inflight_push_requests / ignoring(limit) cortex_distributor_instance_limits{limit="max_inflight_push_requests"} * 100 > 80) and cortex_distributor_instance_limits{limit="max_inflight_push_requests"} > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir distributor inflight requests high (instance {{ $labels.instance }})
description: "Mimir distributor {{ $labels.instance }} is using {{ printf \"%.0f\" $value }}% of its inflight push requests limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirIngesterTsdbHeadCompactionFailed
expr: 'rate(cortex_ingester_tsdb_compactions_failed_total[5m]) > 0.05'
for: 15m
labels:
severity: critical
annotations:
summary: Mimir ingester TSDB head compaction failed (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} is failing to compact TSDB head ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirIngesterTsdbHeadTruncationFailed
expr: 'rate(cortex_ingester_tsdb_head_truncations_failed_total[5m]) > 0.05'
for: 15m
labels:
severity: critical
annotations:
summary: Mimir ingester TSDB head truncation failed (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} is failing to truncate TSDB head ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirIngesterTsdbCheckpointCreationFailed
expr: 'rate(cortex_ingester_tsdb_checkpoint_creations_failed_total[5m]) > 0.05'
for: 15m
labels:
severity: critical
annotations:
summary: Mimir ingester TSDB checkpoint creation failed (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} is failing to create TSDB checkpoints ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirIngesterTsdbCheckpointDeletionFailed
expr: 'rate(cortex_ingester_tsdb_checkpoint_deletions_failed_total[5m]) > 0.05'
for: 0m
labels:
severity: critical
annotations:
summary: Mimir ingester TSDB checkpoint deletion failed (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} is failing to delete TSDB checkpoints ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirIngesterTsdbWalTruncationFailed
expr: 'rate(cortex_ingester_tsdb_wal_truncations_failed_total[5m]) > 0.05'
for: 0m
labels:
severity: warning
annotations:
summary: Mimir ingester TSDB WAL truncation failed (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} is failing to truncate TSDB WAL ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirIngesterTsdbWalWritesFailed
expr: 'rate(cortex_ingester_tsdb_wal_writes_failed_total[1m]) > 0.05'
for: 3m
labels:
severity: critical
annotations:
summary: Mimir ingester TSDB WAL writes failed (instance {{ $labels.instance }})
description: "Mimir ingester {{ $labels.instance }} is failing to write to TSDB WAL ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 30 minutes. Adjust based on your sync interval.
- alert: MimirStoreGatewayHasNotSyncedBucket
expr: '(time() - cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 1800) and cortex_bucket_stores_blocks_last_successful_sync_timestamp_seconds{component="store-gateway"} > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir store gateway has not synced bucket (instance {{ $labels.instance }})
description: "Mimir store-gateway {{ $labels.instance }} has not synced the bucket for more than 30 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirStoreGatewayNoSyncedTenants
expr: '(min by (instance, job) (cortex_bucket_stores_tenants_synced{component="store-gateway"}) == 0) and on (instance) (cortex_bucket_stores_tenants_synced{component="store-gateway"} offset 1h > 0)'
for: 1h
labels:
severity: warning
annotations:
summary: Mimir store gateway no synced tenants (instance {{ $labels.instance }})
description: "Mimir store-gateway {{ $labels.instance }} has no synced tenants.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirBucketIndexNotUpdated
expr: 'min by (user, job) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 2100'
for: 0m
labels:
severity: critical
annotations:
summary: Mimir bucket index not updated (instance {{ $labels.instance }})
description: "Mimir bucket index for tenant {{ $labels.user }} has not been updated for more than 35 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirCompactorNotCleaningUpBlocks
expr: '(time() - cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds > 21600) and cortex_compactor_block_cleanup_last_successful_run_timestamp_seconds > 0'
for: 1h
labels:
severity: critical
annotations:
summary: Mimir compactor not cleaning up blocks (instance {{ $labels.instance }})
description: "Mimir compactor {{ $labels.instance }} has not cleaned up blocks in the last 6 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirCompactorNotRunningCompaction
expr: '(time() - cortex_compactor_last_successful_run_timestamp_seconds > 86400) and cortex_compactor_last_successful_run_timestamp_seconds > 0'
for: 15m
labels:
severity: critical
annotations:
summary: Mimir compactor not running compaction (instance {{ $labels.instance }})
description: "Mimir compactor {{ $labels.instance }} has not run compaction in the last 24 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirCompactorHasConsecutiveFailures
expr: 'increase(cortex_compactor_runs_failed_total{reason!="shutdown"}[2h]) > 1'
for: 0m
labels:
severity: critical
annotations:
summary: Mimir compactor has consecutive failures (instance {{ $labels.instance }})
description: "Mimir compactor {{ $labels.instance }} has had {{ $value }} compaction failures in the last 2 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# cortex_compactor_disk_out_of_space_errors_total is declared as gauge by Mimir despite the _total suffix, so delta() is used instead of increase().
- alert: MimirCompactorHasRunOutOfDiskSpace
expr: 'delta(cortex_compactor_disk_out_of_space_errors_total[24h]) >= 1'
for: 0m
labels:
severity: critical
annotations:
summary: Mimir compactor has run out of disk space (instance {{ $labels.instance }})
description: "Mimir compactor {{ $labels.instance }} has run out of disk space.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirCompactorHasNotUploadedBlocks
expr: '(time() - thanos_objstore_bucket_last_successful_upload_time{component="compactor"} > 86400) and thanos_objstore_bucket_last_successful_upload_time{component="compactor"} > 0'
for: 15m
labels:
severity: critical
annotations:
summary: Mimir compactor has not uploaded blocks (instance {{ $labels.instance }})
description: "Mimir compactor {{ $labels.instance }} has not uploaded any block in the last 24 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Using a 24h window as compaction skips are rare events.
- alert: MimirCompactorSkippedBlocks
expr: 'increase(cortex_compactor_blocks_marked_for_no_compaction_total[24h]) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Mimir compactor skipped blocks (instance {{ $labels.instance }})
description: "Mimir compactor has found {{ $value }} blocks that cannot be compacted (reason {{ $labels.reason }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirRulerTooManyFailedPushes
expr: '100 * sum by (instance, job) (rate(cortex_ruler_write_requests_failed_total[5m])) / sum by (instance, job) (rate(cortex_ruler_write_requests_total[5m])) > 1 and sum by (instance, job) (rate(cortex_ruler_write_requests_total[5m])) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir ruler too many failed pushes (instance {{ $labels.instance }})
description: "Mimir ruler {{ $labels.instance }} is failing to push {{ printf \"%.2f\" $value }}% of write requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirRulerTooManyFailedQueries
expr: '100 * sum by (instance, job) (rate(cortex_ruler_queries_failed_total[5m])) / sum by (instance, job) (rate(cortex_ruler_queries_total[5m])) > 1 and sum by (instance, job) (rate(cortex_ruler_queries_total[5m])) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir ruler too many failed queries (instance {{ $labels.instance }})
description: "Mimir ruler {{ $labels.instance }} is failing {{ printf \"%.2f\" $value }}% of query evaluations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirRulerMissedEvaluations
expr: '100 * sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_missed_total[5m])) / sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_total[5m])) > 1 and sum by (instance, job) (rate(cortex_prometheus_rule_group_iterations_total[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Mimir ruler missed evaluations (instance {{ $labels.instance }})
description: "Mimir ruler {{ $labels.instance }} is missing {{ printf \"%.2f\" $value }}% of rule group evaluations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirRulerFailedRingCheck
expr: 'sum by (job) (rate(cortex_ruler_ring_check_errors_total[5m])) > 0.05'
for: 5m
labels:
severity: critical
annotations:
summary: Mimir ruler failed ring check (instance {{ $labels.instance }})
description: "Mimir ruler {{ $labels.job }} is failing ring checks ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirAlertmanagerSyncConfigsFailing
expr: 'rate(cortex_alertmanager_sync_configs_failed_total[5m]) > 0.05'
for: 30m
labels:
severity: critical
annotations:
summary: Mimir alertmanager sync configs failing (instance {{ $labels.instance }})
description: "Mimir alertmanager {{ $labels.job }} is failing to sync configs ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirAlertmanagerRingCheckFailing
expr: 'rate(cortex_alertmanager_ring_check_errors_total[5m]) > 0.05'
for: 10m
labels:
severity: critical
annotations:
summary: Mimir alertmanager ring check failing (instance {{ $labels.instance }})
description: "Mimir alertmanager {{ $labels.job }} is failing ring checks ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirAlertmanagerStateMergeFailing
expr: 'rate(cortex_alertmanager_partial_state_merges_failed_total[5m]) > 0.05'
for: 10m
labels:
severity: critical
annotations:
summary: Mimir alertmanager state merge failing (instance {{ $labels.instance }})
description: "Mimir alertmanager {{ $labels.job }} is failing to merge state updates ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirAlertmanagerReplicationFailing
expr: 'rate(cortex_alertmanager_state_replication_failed_total[5m]) > 0.05'
for: 10m
labels:
severity: critical
annotations:
summary: Mimir alertmanager replication failing (instance {{ $labels.instance }})
description: "Mimir alertmanager {{ $labels.job }} is failing to replicate state ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: MimirAlertmanagerPersistStateFailing
expr: 'rate(cortex_alertmanager_state_persist_failed_total[15m]) > 0.05'
for: 1h
labels:
severity: critical
annotations:
summary: Mimir alertmanager persist state failing (instance {{ $labels.instance }})
description: "Mimir alertmanager {{ $labels.job }} is failing to persist state ({{ $value | humanize }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirAlertmanagerInitialSyncFailed
expr: 'increase(cortex_alertmanager_state_initial_sync_completed_total{outcome="failed"}[1m]) > 0'
for: 0m
labels:
severity: warning
annotations:
summary: Mimir alertmanager initial sync failed (instance {{ $labels.instance }})
description: "Mimir alertmanager {{ $labels.job }} failed initial state sync.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirAlertmanagerInstanceHasNoTenants
expr: '(cortex_alertmanager_tenants_owned == 0) and on (instance) (cortex_alertmanager_tenants_owned offset 1h > 0)'
for: 1h
labels:
severity: warning
annotations:
summary: Mimir alertmanager instance has no tenants (instance {{ $labels.instance }})
description: "Mimir alertmanager {{ $labels.instance }} has no tenants assigned.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirGossipMembersCountTooHigh
expr: 'avg(memberlist_client_cluster_members_count{job=~".*(mimir|cortex).*"}) by (job) * 1.15 + 10 < max(memberlist_client_cluster_members_count{job=~".*(mimir|cortex).*"}) by (job)'
for: 20m
labels:
severity: warning
annotations:
summary: Mimir gossip members count too high (instance {{ $labels.instance }})
description: "Mimir gossip cluster has more members than expected.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirGossipMembersCountTooLow
expr: 'avg(memberlist_client_cluster_members_count{job=~".*(mimir|cortex).*"}) by (job) * 0.5 > min(memberlist_client_cluster_members_count{job=~".*(mimir|cortex).*"}) by (job)'
for: 20m
labels:
severity: warning
annotations:
summary: Mimir gossip members count too low (instance {{ $labels.instance }})
description: "Mimir gossip cluster has fewer members than expected.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# A high number of Go threads may indicate a goroutine leak.
- alert: MimirGoThreadsTooHighWarning
expr: 'go_threads{job=~".*(mimir|cortex).*"} > 5000'
for: 15m
labels:
severity: warning
annotations:
summary: Mimir go threads too high warning (instance {{ $labels.instance }})
description: "Mimir {{ $labels.instance }} has {{ $value }} Go threads.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MimirGoThreadsTooHighCritical
expr: 'go_threads{job=~".*(mimir|cortex).*"} > 8000'
for: 15m
labels:
severity: critical
annotations:
summary: Mimir go threads too high critical (instance {{ $labels.instance }})
description: "Mimir {{ $labels.instance }} has {{ $value }} Go threads.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,175 @@
groups:
- name: EmbeddedExporter
rules:
- alert: TempoDistributorUnhealthy
expr: 'max by (job) (tempo_ring_members{state="Unhealthy", name="distributor"}) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Tempo distributor unhealthy (instance {{ $labels.instance }})
description: "Tempo has {{ $value }} unhealthy distributor(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoLiveStoreUnhealthy
expr: 'max by (job) (tempo_ring_members{state="Unhealthy", name="live-store"}) > 0'
for: 15m
labels:
severity: critical
annotations:
summary: Tempo live store unhealthy (instance {{ $labels.instance }})
description: "Tempo has {{ $value }} unhealthy live store(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoMetricsGeneratorUnhealthy
expr: 'max by (job) (tempo_ring_members{state="Unhealthy", name="metrics-generator"}) > 0'
for: 15m
labels:
severity: critical
annotations:
summary: Tempo metrics generator unhealthy (instance {{ $labels.instance }})
description: "Tempo has {{ $value }} unhealthy metrics generator(s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Uses a two-window approach: 1h for historical count and 5m to confirm the issue is ongoing.
- alert: TempoCompactionsFailing
expr: 'sum by (job) (increase(tempodb_compaction_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_compaction_errors_total[5m])) > 0'
for: 1h
labels:
severity: critical
annotations:
summary: Tempo compactions failing (instance {{ $labels.instance }})
description: "{{ $value }} compactions have failed in the past hour.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoPollsFailing
expr: 'sum by (job) (increase(tempodb_blocklist_poll_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_blocklist_poll_errors_total[5m])) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Tempo polls failing (instance {{ $labels.instance }})
description: "{{ $value }} blocklist polls have failed in the past hour.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoTenantIndexFailures
expr: 'sum by (job) (increase(tempodb_blocklist_tenant_index_errors_total[1h])) > 2 and sum by (job) (increase(tempodb_blocklist_tenant_index_errors_total[5m])) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Tempo tenant index failures (instance {{ $labels.instance }})
description: "{{ $value }} tenant index failures in the past hour.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoNoTenantIndexBuilders
expr: 'sum by (tenant) (tempodb_blocklist_tenant_index_builder) == 0 and on() max(tempodb_blocklist_length) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Tempo no tenant index builders (instance {{ $labels.instance }})
description: "No tenant index builders for tenant {{ $labels.tenant }}. Tenant index will quickly become stale.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 600s (10 minutes). Adjust based on your tenant index build interval.
- alert: TempoTenantIndexTooOld
expr: 'max by (tenant) (tempodb_blocklist_tenant_index_age_seconds) > 600'
for: 5m
labels:
severity: critical
annotations:
summary: Tempo tenant index too old (instance {{ $labels.instance }})
description: "Tenant index for {{ $labels.tenant }} is {{ $value }}s old.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Fires when the blocklist grows more than 40% over 7 days.
- alert: TempoBlockListRisingQuickly
expr: '(avg(tempodb_blocklist_length) / avg(tempodb_blocklist_length offset 7d) - 1) * 100 > 40 and avg(tempodb_blocklist_length offset 7d) > 0'
for: 15m
labels:
severity: critical
annotations:
summary: Tempo block list rising quickly (instance {{ $labels.instance }})
description: "Tempo blocklist length is up {{ printf \"%.0f\" $value }}% over the last 7 days. Consider scaling compactors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoBadOverrides
expr: 'sum by (job) (tempo_runtime_config_last_reload_successful == 0) > 0'
for: 15m
labels:
severity: critical
annotations:
summary: Tempo bad overrides (instance {{ $labels.instance }})
description: "{{ $labels.job }} failed to reload runtime overrides.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoUserConfigurableOverridesReloadFailing
expr: 'sum by (job) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total[1h])) > 5 and sum by (job) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total[5m])) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Tempo user configurable overrides reload failing (instance {{ $labels.instance }})
description: "{{ $value }} user-configurable overrides reloads have failed in the past hour.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 100 blocks per compactor instance. Adjust based on your environment.
- alert: TempoCompactionTooManyOutstandingBlocksWarning
expr: 'sum by (instance) (tempodb_compaction_outstanding_blocks) > 100'
for: 6h
labels:
severity: warning
annotations:
summary: Tempo compaction too many outstanding blocks warning (instance {{ $labels.instance }})
description: "There are too many outstanding compaction blocks for {{ $labels.instance }}. Consider increasing compactor resources.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 100 blocks per compactor instance. Normalize by backend-worker count if needed. Adjust based on your environment.
- alert: TempoCompactionTooManyOutstandingBlocksCritical
expr: 'sum by (instance) (tempodb_compaction_outstanding_blocks) > 250'
for: 24h
labels:
severity: critical
annotations:
summary: Tempo compaction too many outstanding blocks critical (instance {{ $labels.instance }})
description: "There are too many outstanding compaction blocks for {{ $labels.instance }}. Increase compactor resources immediately.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: TempoDistributorUsageTrackerErrors
expr: 'sum by (job, reason) (rate(tempo_distributor_usage_tracker_errors_total[5m])) > 0.05'
for: 30m
labels:
severity: critical
annotations:
summary: Tempo distributor usage tracker errors (instance {{ $labels.instance }})
description: "Tempo distributor usage tracker errors for {{ $labels.job }} at {{ $value | humanize }}/s (reason {{ $labels.reason }}).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoMetricsGeneratorProcessorUpdatesFailing
expr: 'sum by (job) (increase(tempo_metrics_generator_active_processors_update_failed_total[5m])) > 2'
for: 15m
labels:
severity: critical
annotations:
summary: Tempo metrics generator processor updates failing (instance {{ $labels.instance }})
description: "Tempo metrics generator processor updates are failing for {{ $labels.job }} ({{ $value }} failures in 5m).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoMetricsGeneratorServiceGraphsDroppingSpans
expr: '100 * sum by (job) (rate(tempo_metrics_generator_processor_service_graphs_dropped_spans_total[5m])) / sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0.5 and sum by (job) (rate(tempo_metrics_generator_spans_received_total[5m])) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Tempo metrics generator service graphs dropping spans (instance {{ $labels.instance }})
description: "Tempo metrics generator is dropping {{ printf \"%.2f\" $value }}% of spans in service graphs for {{ $labels.job }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: TempoMetricsGeneratorCollectionsFailing
expr: 'sum by (job) (increase(tempo_metrics_generator_registry_collections_failed_total[5m])) > 2'
for: 5m
labels:
severity: critical
annotations:
summary: Tempo metrics generator collections failing (instance {{ $labels.instance }})
description: "Tempo metrics generator collections are failing for {{ $labels.job }} ({{ $value }} failures in 5m).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Fires when the memcached error rate exceeds 20%. Only relevant if Tempo is configured with memcached caching.
- alert: TempoMemcachedErrorsElevated
expr: '100 * sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count{status_code="500"}[5m])) / sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count[5m])) > 20 and sum by (name, job) (rate(tempo_memcache_request_duration_seconds_count[5m])) > 0'
for: 10m
labels:
severity: warning
annotations:
summary: Tempo memcached errors elevated (instance {{ $labels.instance }})
description: "Tempo memcached error rate is {{ printf \"%.2f\" $value }}% for {{ $labels.name }} in {{ $labels.job }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: ProviderFailedBecauseNet_versionFailed
@ -40,20 +41,22 @@ groups:
summary: Provider failed because get genesis timeout (instance {{ $labels.instance }})
description: "Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: StoreConnectionIsTooSlow
# Threshold of 10ms. Adjust based on your expected database latency.
- alert: StoreConnectionSlow
expr: 'store_connection_wait_time_ms > 10'
for: 0m
labels:
severity: warning
annotations:
summary: Store connection is too slow (instance {{ $labels.instance }})
summary: Store connection slow (instance {{ $labels.instance }})
description: "Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: StoreConnectionIsTooSlow
# Threshold of 20ms. Adjust based on your expected database latency.
- alert: StoreConnectionVerySlow
expr: 'store_connection_wait_time_ms > 20'
for: 0m
labels:
severity: critical
annotations:
summary: Store connection is too slow (instance {{ $labels.instance }})
description: "Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Store connection very slow (instance {{ $labels.instance }})
description: "Store connection is very slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,8 +2,12 @@ groups:
- name: Jmx_exporter
rules:
# When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0,
# so this alert may not fire. Prefer application-level availability metrics if available.
# Rename job="hadoop-namenode" to match the actual job name in your Prometheus scrape config.
- alert: HadoopNameNodeDown
expr: 'up{job="hadoop-namenode"} == 0'
for: 5m
@ -13,6 +17,9 @@ groups:
summary: Hadoop Name Node Down (instance {{ $labels.instance }})
description: "The Hadoop NameNode service is unavailable.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# When targets are managed via service discovery, a disappeared target goes stale rather than reporting up==0,
# so this alert may not fire. Prefer application-level availability metrics if available.
# Rename job="hadoop-resourcemanager" to match the actual job name in your Prometheus scrape config.
- alert: HadoopResourceManagerDown
expr: 'up{job="hadoop-resourcemanager"} == 0'
for: 5m
@ -32,7 +39,7 @@ groups:
description: "The Hadoop DataNode is not sending heartbeats.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HadoopHdfsDiskSpaceLow
expr: '(hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1'
expr: '(hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1 and hadoop_hdfs_bytes_total > 0'
for: 15m
labels:
severity: warning
@ -41,7 +48,7 @@ groups:
description: "Available HDFS disk space is running low.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HadoopMapReduceTaskFailures
expr: 'hadoop_mapreduce_task_failures_total > 100'
expr: 'increase(hadoop_mapreduce_task_failures_total[1h]) > 100'
for: 10m
labels:
severity: critical
@ -50,7 +57,7 @@ groups:
description: "There is an unusually high number of MapReduce task failures.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HadoopResourceManagerMemoryHigh
expr: 'hadoop_resourcemanager_memory_bytes / hadoop_resourcemanager_memory_max_bytes > 0.8'
expr: 'hadoop_resourcemanager_memory_bytes / hadoop_resourcemanager_memory_max_bytes > 0.8 and hadoop_resourcemanager_memory_max_bytes > 0'
for: 15m
labels:
severity: warning
@ -59,7 +66,7 @@ groups:
description: "The Hadoop ResourceManager is approaching its memory limit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HadoopYarnContainerAllocationFailures
expr: 'hadoop_yarn_container_allocation_failures_total > 10'
expr: 'increase(hadoop_yarn_container_allocation_failures_total[1h]) > 10'
for: 10m
labels:
severity: warning
@ -77,10 +84,10 @@ groups:
description: "The HBase cluster has an unusually high number of regions.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HadoopHbaseRegionServerHeapLow
expr: 'hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes < 0.2'
expr: 'hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes > 0.8 and hadoop_hbase_region_server_max_heap_bytes > 0'
for: 10m
labels:
severity: critical
severity: warning
annotations:
summary: Hadoop HBase Region Server Heap Low (instance {{ $labels.instance }})
description: "HBase Region Servers are running low on heap space.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,28 +2,29 @@ groups:
- name: EmbeddedExporterV2
rules:
- alert: HaproxyHighHttp4xxErrorRateBackend
expr: '((sum by (proxy) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5'
expr: '((sum by (proxy) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (proxy) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.proxy }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyHighHttp5xxErrorRateBackend
expr: '((sum by (proxy) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5'
expr: '((sum by (proxy) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (proxy) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.proxy }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyHighHttp4xxErrorRateServer
expr: '((sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5'
expr: '((sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
@ -32,7 +33,7 @@ groups:
description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyHighHttp5xxErrorRateServer
expr: '((sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5'
expr: '((sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
@ -41,7 +42,7 @@ groups:
description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyServerResponseErrors
expr: '(sum by (server) (rate(haproxy_server_response_errors_total[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100 > 5'
expr: '(sum by (server) (rate(haproxy_server_response_errors_total[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100 > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
@ -56,7 +57,7 @@ groups:
severity: critical
annotations:
summary: HAProxy backend connection errors (instance {{ $labels.instance }})
description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Too many connection errors to {{ $labels.proxy }} backend (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyServerConnectionErrors
expr: '(sum by (proxy) (rate(haproxy_server_connection_errors_total[1m]))) > 100'
@ -65,19 +66,20 @@ groups:
severity: critical
annotations:
summary: HAProxy server connection errors (instance {{ $labels.instance }})
description: "Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Too many connection errors to {{ $labels.proxy }} (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyBackendMaxActiveSession>80%
expr: '((haproxy_server_max_sessions >0) * 100) / (haproxy_server_limit_sessions > 0) > 80'
expr: '(haproxy_backend_current_sessions / haproxy_backend_limit_sessions * 100) > 80 and haproxy_backend_limit_sessions > 0'
for: 2m
labels:
severity: warning
annotations:
summary: HAProxy backend max active session > 80% (instance {{ $labels.instance }})
description: "Session limit from backend {{ $labels.proxy }} to server {{ $labels.server }} reached 80% of limit - {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Session limit from backend {{ $labels.proxy }} reached 80% of limit - {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# haproxy_backend_current_queue is a gauge (current queue depth), not a counter.
- alert: HaproxyPendingRequests
expr: 'sum by (proxy) (rate(haproxy_backend_current_queue[2m])) > 0'
expr: 'sum by (proxy) (haproxy_backend_current_queue) > 0'
for: 2m
labels:
severity: warning
@ -92,7 +94,7 @@ groups:
severity: warning
annotations:
summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})
description: "Average request time is increasing - {{ $value | printf \"%.2f\"}}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "HAProxy backend max total time is above 1s on {{ $labels.proxy }} - {{ $value | printf \"%.2f\"}}s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyRetryHigh
expr: 'sum by (proxy) (rate(haproxy_backend_retry_warnings_total[1m])) > 10'
@ -122,10 +124,10 @@ groups:
description: "HAProxy is blocking requests for security reason\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyServerHealthcheckFailure
expr: 'increase(haproxy_server_check_failures_total[1m]) > 0'
for: 1m
expr: 'increase(haproxy_server_check_failures_total[1m]) > 2'
for: 0m
labels:
severity: warning
annotations:
summary: HAProxy server healthcheck failure (instance {{ $labels.instance }})
description: "Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: HaproxyExporterV1
rules:
- alert: HaproxyDown
@ -13,104 +14,104 @@ groups:
summary: HAProxy down (instance {{ $labels.instance }})
description: "HAProxy down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyHighHttp4xxErrorRateBackend
expr: 'sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5'
- alert: HaproxyHighHttp4xxErrorRateBackend(v1)
expr: 'sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) * 100 > 5 and sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: HAProxy high HTTP 4xx error rate backend (v1) (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyHighHttp5xxErrorRateBackend
expr: 'sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5'
- alert: HaproxyHighHttp5xxErrorRateBackend(v1)
expr: 'sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) * 100 > 5 and sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: HAProxy high HTTP 5xx error rate backend (v1) (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyHighHttp4xxErrorRateServer
expr: 'sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5'
- alert: HaproxyHighHttp4xxErrorRateServer(v1)
expr: 'sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})
summary: HAProxy high HTTP 4xx error rate server (v1) (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyHighHttp5xxErrorRateServer
expr: 'sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5'
- alert: HaproxyHighHttp5xxErrorRateServer(v1)
expr: 'sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})
summary: HAProxy high HTTP 5xx error rate server (v1) (instance {{ $labels.instance }})
description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyServerResponseErrors
expr: 'sum by (server) (rate(haproxy_server_response_errors_total[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5'
- alert: HaproxyServerResponseErrors(v1)
expr: 'sum by (server) (rate(haproxy_server_response_errors_total[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5 and sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: HAProxy server response errors (instance {{ $labels.instance }})
summary: HAProxy server response errors (v1) (instance {{ $labels.instance }})
description: "Too many response errors to {{ $labels.server }} server (> 5%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyBackendConnectionErrors
- alert: HaproxyBackendConnectionErrors(v1)
expr: 'sum by (backend) (rate(haproxy_backend_connection_errors_total[1m])) > 100'
for: 1m
labels:
severity: critical
annotations:
summary: HAProxy backend connection errors (instance {{ $labels.instance }})
description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: HAProxy backend connection errors (v1) (instance {{ $labels.instance }})
description: "Too many connection errors to {{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyServerConnectionErrors
- alert: HaproxyServerConnectionErrors(v1)
expr: 'sum by (server) (rate(haproxy_server_connection_errors_total[1m])) > 100'
for: 0m
labels:
severity: critical
annotations:
summary: HAProxy server connection errors (instance {{ $labels.instance }})
summary: HAProxy server connection errors (v1) (instance {{ $labels.instance }})
description: "Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyBackendMaxActiveSession
expr: '((sum by (backend) (avg_over_time(haproxy_backend_max_sessions[2m]) * 100) / sum by (backend) (avg_over_time(haproxy_backend_limit_sessions[2m])))) > 80'
expr: '((sum by (backend) (haproxy_backend_current_sessions * 100) / sum by (backend) (haproxy_backend_limit_sessions))) > 80 and sum by (backend) (haproxy_backend_limit_sessions) > 0'
for: 2m
labels:
severity: warning
annotations:
summary: HAProxy backend max active session (instance {{ $labels.instance }})
description: "HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "HAProxy backend {{ $labels.backend }} is reaching session limit (> 80%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyPendingRequests
- alert: HaproxyPendingRequests(v1)
expr: 'sum by (backend) (haproxy_backend_current_queue) > 0'
for: 2m
labels:
severity: warning
annotations:
summary: HAProxy pending requests (instance {{ $labels.instance }})
description: "Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: HAProxy pending requests (v1) (instance {{ $labels.instance }})
description: "Some HAProxy requests are pending on {{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyHttpSlowingDown
- alert: HaproxyHttpSlowingDown(v1)
expr: 'avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 1'
for: 1m
labels:
severity: warning
annotations:
summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})
summary: HAProxy HTTP slowing down (v1) (instance {{ $labels.instance }})
description: "Average request time is increasing\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyRetryHigh
- alert: HaproxyRetryHigh(v1)
expr: 'sum by (backend) (rate(haproxy_backend_retry_warnings_total[1m])) > 10'
for: 2m
labels:
severity: warning
annotations:
summary: HAProxy retry high (instance {{ $labels.instance }})
description: "High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: HAProxy retry high (v1) (instance {{ $labels.instance }})
description: "High rate of retry on {{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyBackendDown
expr: 'haproxy_backend_up == 0'
@ -130,20 +131,20 @@ groups:
summary: HAProxy server down (instance {{ $labels.instance }})
description: "HAProxy server is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyFrontendSecurityBlockedRequests
- alert: HaproxyFrontendSecurityBlockedRequests(v1)
expr: 'sum by (frontend) (rate(haproxy_frontend_requests_denied_total[2m])) > 10'
for: 2m
labels:
severity: warning
annotations:
summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }})
summary: HAProxy frontend security blocked requests (v1) (instance {{ $labels.instance }})
description: "HAProxy is blocking requests for security reason\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HaproxyServerHealthcheckFailure
expr: 'increase(haproxy_server_check_failures_total[1m]) > 0'
for: 1m
- alert: HaproxyServerHealthcheckFailure(v1)
expr: 'increase(haproxy_server_check_failures_total[1m]) > 2'
for: 0m
labels:
severity: warning
annotations:
summary: HAProxy server healthcheck failure (instance {{ $labels.instance }})
description: "Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: HAProxy server healthcheck failure (v1) (instance {{ $labels.instance }})
description: "Some server healthcheck are failing on {{ $labels.server }} ({{ $value }} in the last 1m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,11 +2,12 @@ groups:
- name: EmbeddedExporter
rules:
- alert: VaultSealed
expr: 'vault_core_unsealed == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -20,7 +21,7 @@ groups:
severity: warning
annotations:
summary: Vault too many pending tokens (instance {{ $labels.instance }})
description: "Too many pending tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Too many pending tokens on {{ $labels.instance }}: {{ $value }} tokens created but not yet stored.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: VaultTooManyInfinityTokens
expr: 'vault_token_count_by_ttl{creation_ttl="+Inf"} > 3'
@ -29,13 +30,13 @@ groups:
severity: warning
annotations:
summary: Vault too many infinity tokens (instance {{ $labels.instance }})
description: "Too many infinity tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Too many non-expiring tokens on {{ $labels.instance }}: {{ $value }} tokens with infinite TTL.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: VaultClusterHealth
expr: 'sum(vault_core_active) / count(vault_core_active) <= 0.5'
expr: 'sum(vault_core_active) / count(vault_core_active) <= 0.5 and count(vault_core_active) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Vault cluster health (instance {{ $labels.instance }})
description: "Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Vault cluster is not healthy: only {{ $value | humanizePercentage }} of nodes are active.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,10 +2,11 @@ groups:
- name: NodeExporter
rules:
- alert: HostOutOfMemory
expr: '(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < .10)'
for: 2m
labels:
severity: warning
@ -13,107 +14,106 @@ groups:
summary: Host out of memory (instance {{ $labels.instance }})
description: "Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# node_vmstat_pgmajfault is exposed as untyped/gauge by node_exporter (from /proc/vmstat), so deriv() is used instead of rate().
- alert: HostMemoryUnderMemoryPressure
expr: '(rate(node_vmstat_pgmajfault[1m]) > 1000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
for: 2m
expr: '(deriv(node_vmstat_pgmajfault[5m]) > 1000)'
for: 0m
labels:
severity: warning
annotations:
summary: Host memory under memory pressure (instance {{ $labels.instance }})
description: "The node is under heavy memory pressure. High rate of major page faults\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "The node is under heavy memory pressure. High rate of major page faults ({{ $value }}/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly
- alert: HostMemoryIsUnderutilized
expr: '(100 - (avg_over_time(node_memory_MemAvailable_bytes[30m]) / node_memory_MemTotal_bytes * 100) < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
for: 1w
expr: 'min_over_time(node_memory_MemFree_bytes[1w]) > node_memory_MemTotal_bytes * .8'
for: 0m
labels:
severity: info
annotations:
summary: Host Memory is underutilized (instance {{ $labels.instance }})
description: "Node memory is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Node memory usage is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostUnusualNetworkThroughputIn
expr: '(sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
for: 5m
expr: '((rate(node_network_receive_bytes_total[5m]) / node_network_speed_bytes) > .80) and node_network_speed_bytes > 0'
for: 0m
labels:
severity: warning
annotations:
summary: Host unusual network throughput in (instance {{ $labels.instance }})
description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Host receive bandwidth is high (>80%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostUnusualNetworkThroughputOut
expr: '(sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
for: 5m
expr: '((rate(node_network_transmit_bytes_total[5m]) / node_network_speed_bytes) > .80) and node_network_speed_bytes > 0'
for: 0m
labels:
severity: warning
annotations:
summary: Host unusual network throughput out (instance {{ $labels.instance }})
description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Host transmit bandwidth is high (>80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostUnusualDiskReadRate
expr: '(sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
for: 5m
- alert: HostDiskIoUtilizationHigh
expr: '(rate(node_disk_io_time_seconds_total[5m]) > .80)'
for: 0m
labels:
severity: warning
annotations:
summary: Host unusual disk read rate (instance {{ $labels.instance }})
description: "Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostUnusualDiskWriteRate
expr: '(sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
for: 2m
labels:
severity: warning
annotations:
summary: Host unusual disk write rate (instance {{ $labels.instance }})
description: "Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Host disk IO utilization high (instance {{ $labels.instance }})
description: "Disk utilization is high (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: HostOutOfDiskSpace
expr: '((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"} / node_filesystem_size_bytes < .10 and on (instance, device, mountpoint) node_filesystem_readonly == 0)'
for: 2m
labels:
severity: warning
severity: critical
annotations:
summary: Host out of disk space (instance {{ $labels.instance }})
description: "Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostDiskWillFillIn24Hours
expr: '((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: HostDiskMayFillIn24Hours
expr: 'predict_linear(node_filesystem_avail_bytes{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"}[3h], 86400) <= 0 and node_filesystem_avail_bytes > 0'
for: 2m
labels:
severity: warning
annotations:
summary: Host disk will fill in 24 hours (instance {{ $labels.instance }})
description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Host disk may fill in 24 hours (instance {{ $labels.instance }})
description: "Filesystem will likely run out of space within the next 24 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostOutOfInodes
expr: '(node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(node_filesystem_files_free / node_filesystem_files < .10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) and node_filesystem_files > 0'
for: 2m
labels:
severity: warning
severity: critical
annotations:
summary: Host out of inodes (instance {{ $labels.instance }})
description: "Disk is almost running out of available inodes (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostFilesystemDeviceError
expr: 'node_filesystem_device_error == 1'
expr: 'node_filesystem_device_error{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"} == 1'
for: 2m
labels:
severity: critical
annotations:
summary: Host filesystem device error (instance {{ $labels.instance }})
description: "{{ $labels.instance }}: Device error with the {{ $labels.mountpoint }} filesystem\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Error stat-ing the {{ $labels.mountpoint }} filesystem\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostInodesWillFillIn24Hours
expr: '(node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{fstype!="msdosfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{fstype!="msdosfs"} == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
- alert: HostInodesMayFillIn24Hours
expr: 'predict_linear(node_filesystem_files_free{fstype!~"^(fuse.*|tmpfs|cifs|nfs)"}[1h], 86400) <= 0 and node_filesystem_files_free > 0'
for: 2m
labels:
severity: warning
annotations:
summary: Host inodes will fill in 24 hours (instance {{ $labels.instance }})
description: "Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Host inodes may fill in 24 hours (instance {{ $labels.instance }})
description: "Filesystem will likely run out of inodes within the next 24 hours at current write rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostUnusualDiskReadLatency
expr: '(rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0)'
for: 2m
labels:
severity: warning
@ -122,7 +122,7 @@ groups:
description: "Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostUnusualDiskWriteLatency
expr: '(rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0)'
for: 2m
labels:
severity: warning
@ -131,7 +131,7 @@ groups:
description: "Disk latency is growing (write operations > 100ms)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostHighCpuLoad
expr: '(sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '1 - (avg without (cpu) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) > .80'
for: 10m
labels:
severity: warning
@ -139,17 +139,18 @@ groups:
summary: Host high CPU load (instance {{ $labels.instance }})
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly
- alert: HostCpuIsUnderutilized
expr: '(100 - (rate(node_cpu_seconds_total{mode="idle"}[30m]) * 100) < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(min without (cpu) (rate(node_cpu_seconds_total{mode="idle"}[1h]))) > 0.8'
for: 1w
labels:
severity: info
annotations:
summary: Host CPU is underutilized (instance {{ $labels.instance }})
description: "CPU load is < 20% for 1 week. Consider reducing the number of CPUs.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "CPU load has been < 20% for 1 week. Consider reducing the number of CPUs.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostCpuStealNoisyNeighbor
expr: '(avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: 'avg without (cpu) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10'
for: 0m
labels:
severity: warning
@ -158,34 +159,37 @@ groups:
description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostCpuHighIowait
expr: '(avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: 'avg without (cpu) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) > .10'
for: 0m
labels:
severity: warning
annotations:
summary: Host CPU high iowait (instance {{ $labels.instance }})
description: "CPU iowait > 10%. A high iowait means that you are disk or network bound.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "CPU iowait > 10%. Your CPU is idling waiting for storage to respond.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostUnusualDiskIo
expr: '(rate(node_disk_io_time_seconds_total[1m]) > 0.5) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: 'rate(node_disk_io_time_seconds_total[5m]) > 0.8'
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual disk IO (instance {{ $labels.instance }})
description: "Time spent in IO is too high on {{ $labels.instance }}. Check storage for issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Disk usage >80%. Check storage for issues or increase IOPS capabilities.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostContextSwitching
expr: '((rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 10000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
# x2 context switches is an arbitrary number.
# The alert threshold depends on the nature of the application.
# Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
- alert: HostContextSwitchingHigh
expr: '(rate(node_context_switches_total[15m])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) / (rate(node_context_switches_total[1d])/count without(mode,cpu) (node_cpu_seconds_total{mode="idle"})) > 2 and rate(node_context_switches_total[1d]) > 0'
for: 0m
labels:
severity: warning
annotations:
summary: Host context switching (instance {{ $labels.instance }})
description: "Context switching is growing on the node (> 10000 / CPU / s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Host context switching high (instance {{ $labels.instance }})
description: "Context switching is growing on the node (twice the daily average during the last 15m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostSwapIsFillingUp
expr: '((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) and node_memory_SwapTotal_bytes > 0'
for: 2m
labels:
severity: warning
@ -194,16 +198,16 @@ groups:
description: "Swap is filling up (>80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostSystemdServiceCrashed
expr: '(node_systemd_unit_state{state="failed"} == 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(node_systemd_unit_state{state="failed"} == 1)'
for: 0m
labels:
severity: warning
annotations:
summary: Host systemd service crashed (instance {{ $labels.instance }})
description: "systemd service crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "systemd service {{ $labels.name }} crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostPhysicalComponentTooHot
expr: '((node_hwmon_temp_celsius * ignoring(label) group_left(instance, job, node, sensor) node_hwmon_sensor_label{label!="tctl"} > 75)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: 'node_hwmon_temp_celsius > node_hwmon_temp_max_celsius'
for: 5m
labels:
severity: warning
@ -212,7 +216,7 @@ groups:
description: "Physical hardware component too hot\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostNodeOvertemperatureAlarm
expr: '(node_hwmon_temp_crit_alarm_celsius == 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '((node_hwmon_temp_crit_alarm_celsius == 1) or (node_hwmon_temp_alarm == 1))'
for: 0m
labels:
severity: critical
@ -220,35 +224,37 @@ groups:
summary: Host node overtemperature alarm (instance {{ $labels.instance }})
description: "Physical node temperature alarm triggered\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostRaidArrayGotInactive
expr: '(node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
# Uses ignoring(state) to handle additional labels on node_md_disks.
- alert: HostSoftwareRaidInsufficientDrives
expr: '((node_md_disks_required - ignoring(state) node_md_disks{state="active"}) > 0)'
for: 0m
labels:
severity: critical
annotations:
summary: Host RAID array got inactive (instance {{ $labels.instance }})
description: "RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Host software RAID insufficient drives (instance {{ $labels.instance }})
description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} has insufficient drives remaining.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostRaidDiskFailure
expr: '(node_md_disks{state="failed"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
- alert: HostSoftwareRaidDiskFailure
expr: '(node_md_disks{state="failed"} > 0)'
for: 2m
labels:
severity: warning
annotations:
summary: Host RAID disk failure (instance {{ $labels.instance }})
description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Host software RAID disk failure (instance {{ $labels.instance }})
description: "MD RAID array {{ $labels.device }} on {{ $labels.instance }} needs attention.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostKernelVersionDeviations
expr: '(count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
for: 6h
expr: 'changes(node_uname_info[1h]) > 0'
for: 0m
labels:
severity: warning
severity: info
annotations:
summary: Host kernel version deviations (instance {{ $labels.instance }})
description: "Different kernel versions are running\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Kernel version for {{ $labels.instance }} has changed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# When a machine runs out of memory, the node exporter can become unresponsive for several minutes. Even if the system takes 1520 minutes to recover, the alert should still trigger.
- alert: HostOomKillDetected
expr: '(increase(node_vmstat_oom_kill[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(delta(node_vmstat_oom_kill[30m]) > 0)'
for: 0m
labels:
severity: warning
@ -257,25 +263,25 @@ groups:
description: "OOM kill detected\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostEdacCorrectableErrorsDetected
expr: '(increase(node_edac_correctable_errors_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(increase(node_edac_correctable_errors_total[1m]) > 0)'
for: 0m
labels:
severity: info
annotations:
summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 1 minute.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostEdacUncorrectableErrorsDetected
expr: '(node_edac_uncorrectable_errors_total > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(node_edac_uncorrectable_errors_total > 0)'
for: 0m
labels:
severity: warning
annotations:
summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostNetworkReceiveErrors
expr: '(rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) and rate(node_network_receive_packets_total[2m]) > 0'
for: 2m
labels:
severity: warning
@ -284,7 +290,7 @@ groups:
description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostNetworkTransmitErrors
expr: '(rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) and rate(node_network_transmit_packets_total[2m]) > 0'
for: 2m
labels:
severity: warning
@ -292,17 +298,8 @@ groups:
summary: Host Network Transmit Errors (instance {{ $labels.instance }})
description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostNetworkInterfaceSaturated
expr: '((rate(node_network_receive_bytes_total{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"} > 0.8 < 10000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
for: 1m
labels:
severity: warning
annotations:
summary: Host Network Interface Saturated (instance {{ $labels.instance }})
description: "The network interface \"{{ $labels.device }}\" on \"{{ $labels.instance }}\" is getting overloaded.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostNetworkBondDegraded
expr: '((node_bonding_active - node_bonding_slaves) != 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '((node_bonding_active - node_bonding_slaves) != 0)'
for: 2m
labels:
severity: warning
@ -311,7 +308,7 @@ groups:
description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostConntrackLimit
expr: '(node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) and node_nf_conntrack_entries_limit > 0'
for: 5m
labels:
severity: warning
@ -320,7 +317,7 @@ groups:
description: "The number of conntrack is approaching limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostClockSkew
expr: '((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0))'
for: 10m
labels:
severity: warning
@ -329,19 +326,10 @@ groups:
description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostClockNotSynchronising
expr: '(min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
expr: '(min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16)'
for: 2m
labels:
severity: warning
annotations:
summary: Host clock not synchronising (instance {{ $labels.instance }})
description: "Clock not synchronising. Ensure NTP is configured on this host.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostRequiresReboot
expr: '(node_reboot_required > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
for: 4h
labels:
severity: info
annotations:
summary: Host requires reboot (instance {{ $labels.instance }})
description: "{{ $labels.instance }} requires a reboot.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

165
dist/rules/ipmi/ipmi-exporter.yml vendored Normal file
View file

@ -0,0 +1,165 @@
groups:
- name: IpmiExporter
rules:
# The ipmi_up metric is per-collector. A value of 0 means the collector could not retrieve data from the BMC.
- alert: IpmiCollectorDown
expr: 'ipmi_up == 0'
for: 5m
labels:
severity: warning
annotations:
summary: IPMI collector down (instance {{ $labels.instance }})
description: "IPMI collector {{ $labels.collector }} on {{ $labels.instance }} failed to scrape sensor data. Check FreeIPMI tools and BMC connectivity.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# State values: 0=nominal, 1=warning, 2=critical. Thresholds are defined in the BMC firmware.
- alert: IpmiTemperatureSensorWarning
expr: 'ipmi_temperature_state == 1'
for: 5m
labels:
severity: warning
annotations:
summary: IPMI temperature sensor warning (instance {{ $labels.instance }})
description: "IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiTemperatureSensorCritical
expr: 'ipmi_temperature_state == 2'
for: 0m
labels:
severity: critical
annotations:
summary: IPMI temperature sensor critical (instance {{ $labels.instance }})
description: "IPMI temperature sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Immediate attention required to prevent hardware damage.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiFanSpeedSensorWarning
expr: 'ipmi_fan_speed_state == 1'
for: 5m
labels:
severity: warning
annotations:
summary: IPMI fan speed sensor warning (instance {{ $labels.instance }})
description: "IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiFanSpeedSensorCritical
expr: 'ipmi_fan_speed_state == 2'
for: 0m
labels:
severity: critical
annotations:
summary: IPMI fan speed sensor critical (instance {{ $labels.instance }})
description: "IPMI fan sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. A fan may have failed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiFanSpeedZero
expr: 'ipmi_fan_speed_rpm == 0'
for: 5m
labels:
severity: critical
annotations:
summary: IPMI fan speed zero (instance {{ $labels.instance }})
description: "IPMI fan {{ $labels.name }} on {{ $labels.instance }} reports 0 RPM. The fan may have failed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiVoltageSensorWarning
expr: 'ipmi_voltage_state == 1'
for: 5m
labels:
severity: warning
annotations:
summary: IPMI voltage sensor warning (instance {{ $labels.instance }})
description: "IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiVoltageSensorCritical
expr: 'ipmi_voltage_state == 2'
for: 0m
labels:
severity: critical
annotations:
summary: IPMI voltage sensor critical (instance {{ $labels.instance }})
description: "IPMI voltage sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state. Power supply or motherboard issue possible.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiCurrentSensorWarning
expr: 'ipmi_current_state == 1'
for: 5m
labels:
severity: warning
annotations:
summary: IPMI current sensor warning (instance {{ $labels.instance }})
description: "IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiCurrentSensorCritical
expr: 'ipmi_current_state == 2'
for: 0m
labels:
severity: critical
annotations:
summary: IPMI current sensor critical (instance {{ $labels.instance }})
description: "IPMI current sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiPowerSensorWarning
expr: 'ipmi_power_state == 1'
for: 5m
labels:
severity: warning
annotations:
summary: IPMI power sensor warning (instance {{ $labels.instance }})
description: "IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in warning state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiPowerSensorCritical
expr: 'ipmi_power_state == 2'
for: 0m
labels:
severity: critical
annotations:
summary: IPMI power sensor critical (instance {{ $labels.instance }})
description: "IPMI power sensor {{ $labels.name }} on {{ $labels.instance }} is in critical state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Catches any sensor type not covered by the specific temperature/fan/voltage/current/power alerts.
- alert: IpmiGenericSensorCritical
expr: 'ipmi_sensor_state == 2'
for: 5m
labels:
severity: critical
annotations:
summary: IPMI generic sensor critical (instance {{ $labels.instance }})
description: "IPMI sensor {{ $labels.name }} (type={{ $labels.type }}) on {{ $labels.instance }} is in critical state.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IpmiChassisPowerOff
expr: 'ipmi_chassis_power_state == 0'
for: 0m
labels:
severity: critical
annotations:
summary: IPMI chassis power off (instance {{ $labels.instance }})
description: "IPMI reports chassis power is off on {{ $labels.instance }}. The server may have shut down unexpectedly.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# The metric uses inverted logic: 1=no fault, 0=fault detected.
- alert: IpmiChassisDriveFault
expr: 'ipmi_chassis_drive_fault_state == 0'
for: 0m
labels:
severity: critical
annotations:
summary: IPMI chassis drive fault (instance {{ $labels.instance }})
description: "IPMI reports a drive fault on {{ $labels.instance }}. Check disk health.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# The metric uses inverted logic: 1=no fault, 0=fault detected.
- alert: IpmiChassisCoolingFault
expr: 'ipmi_chassis_cooling_fault_state == 0'
for: 0m
labels:
severity: critical
annotations:
summary: IPMI chassis cooling fault (instance {{ $labels.instance }})
description: "IPMI reports a cooling/fan fault on {{ $labels.instance }}. Check fans and airflow.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# SEL storage is typically very limited (e.g., 16KB). When full, new events may be dropped.
- alert: IpmiSelAlmostFull
expr: 'ipmi_sel_free_space_bytes < 512'
for: 5m
labels:
severity: warning
annotations:
summary: IPMI SEL almost full (instance {{ $labels.instance }})
description: "IPMI System Event Log on {{ $labels.instance }} has only {{ printf \"%.0f\" $value }} bytes free. Clear the SEL to prevent loss of new events.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: IstioKubernetesGatewayAvailabilityDrop
@ -11,17 +12,18 @@ groups:
severity: warning
annotations:
summary: Istio Kubernetes gateway availability drop (instance {{ $labels.instance }})
description: "Gateway pods have dropped. Inbound traffic will likely be affected.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Istio ingress gateway has only {{ $value }} available pod(s). Inbound traffic will likely be affected.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IstioPilotHighTotalRequestRate
expr: 'sum(rate(pilot_xds_push_errors[1m])) / sum(rate(pilot_xds_pushes[1m])) * 100 > 5'
- alert: IstioPilotHighPushErrorRate
expr: 'sum(rate(pilot_xds_push_errors[1m])) / sum(rate(pilot_xds_pushes[1m])) * 100 > 5 and sum(rate(pilot_xds_pushes[1m])) > 0'
for: 1m
labels:
severity: warning
annotations:
summary: Istio Pilot high total request rate (instance {{ $labels.instance }})
summary: Istio Pilot high push error rate (instance {{ $labels.instance }})
description: "Number of Istio Pilot push errors is too high (> 5%). Envoy sidecars might have outdated configuration.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Mixer was deprecated in Istio 1.5 and removed in Istio 1.8+. This alert only applies to Istio < 1.8.
- alert: IstioMixerPrometheusDispatchesLow
expr: 'sum(rate(mixer_runtime_dispatches_total{adapter=~"prometheus"}[1m])) < 180'
for: 1m
@ -31,6 +33,7 @@ groups:
summary: Istio Mixer Prometheus dispatches low (instance {{ $labels.instance }})
description: "Number of Mixer dispatches to Prometheus is too low. Istio metrics might not be being exported properly.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 1000 req/s is a rough default. Adjust to your expected peak traffic.
- alert: IstioHighTotalRequestRate
expr: 'sum(rate(istio_requests_total{reporter="destination"}[5m])) > 1000'
for: 2m
@ -38,8 +41,9 @@ groups:
severity: warning
annotations:
summary: Istio high total request rate (instance {{ $labels.instance }})
description: "Global request rate in the service mesh is unusually high.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Global request rate in the service mesh is unusually high ({{ $value | printf \"%.2f\" }} req/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 100 req/s is a rough default. Adjust to your expected baseline traffic. This alert may fire on startup or low-traffic environments.
- alert: IstioLowTotalRequestRate
expr: 'sum(rate(istio_requests_total{reporter="destination"}[5m])) < 100'
for: 2m
@ -47,49 +51,49 @@ groups:
severity: warning
annotations:
summary: Istio low total request rate (instance {{ $labels.instance }})
description: "Global request rate in the service mesh is unusually low.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Global request rate in the service mesh is unusually low ({{ $value | printf \"%.2f\" }} req/s).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IstioHigh4xxErrorRate
expr: 'sum(rate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5'
expr: 'sum(rate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5 and sum(rate(istio_requests_total{reporter="destination"}[5m])) > 0'
for: 1m
labels:
severity: warning
annotations:
summary: Istio high 4xx error rate (instance {{ $labels.instance }})
description: "High percentage of HTTP 5xx responses in Istio (> 5%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "High percentage of HTTP 4xx responses in Istio ({{ $value | printf \"%.1f\" }}% > 5%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IstioHigh5xxErrorRate
expr: 'sum(rate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5'
expr: 'sum(rate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5 and sum(rate(istio_requests_total{reporter="destination"}[5m])) > 0'
for: 1m
labels:
severity: warning
annotations:
summary: Istio high 5xx error rate (instance {{ $labels.instance }})
description: "High percentage of HTTP 5xx responses in Istio (> 5%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "High percentage of HTTP 5xx responses in Istio ({{ $value | printf \"%.1f\" }}% > 5%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IstioHighRequestLatency
expr: 'rate(istio_request_duration_milliseconds_sum{reporter="destination"}[1m]) / rate(istio_request_duration_milliseconds_count{reporter="destination"}[1m]) > 100'
expr: 'rate(istio_request_duration_milliseconds_sum{reporter="destination"}[1m]) / rate(istio_request_duration_milliseconds_count{reporter="destination"}[1m]) > 100 and rate(istio_request_duration_milliseconds_count{reporter="destination"}[1m]) > 0'
for: 1m
labels:
severity: warning
annotations:
summary: Istio high request latency (instance {{ $labels.instance }})
description: "Istio average requests execution is longer than 100ms.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Istio average request duration is {{ $value }}ms (> 100ms).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IstioLatency99Percentile
expr: 'histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (destination_canonical_service, destination_workload_namespace, source_canonical_service, source_workload_namespace, le)) > 1000'
expr: 'histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (destination_canonical_service, destination_workload_namespace, le)) > 1000'
for: 1m
labels:
severity: warning
annotations:
summary: Istio latency 99 percentile (instance {{ $labels.instance }})
description: "Istio 1% slowest requests are longer than 1000ms.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Istio p99 request latency is {{ $value }}ms (threshold: 1000ms).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IstioPilotDuplicateEntry
expr: 'sum(rate(pilot_duplicate_envoy_clusters{}[5m])) > 0'
expr: 'sum(pilot_duplicate_envoy_clusters{}) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Istio Pilot Duplicate Entry (instance {{ $labels.instance }})
description: "Istio pilot duplicate entry error.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Istio Pilot has detected {{ $value }} duplicate Envoy cluster(s), indicating misconfigured DestinationRules or ServiceEntries.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,82 @@
groups:
- name: EmbeddedExporterLegacy
# These rules target Jaeger v1.x metrics (jaeger_* prefix).
# Jaeger v1 reached end-of-life on December 31, 2025.
# For Jaeger v2+, use the "Embedded exporter (v2+)" rules instead.
# Note: jaeger-agent was deprecated in v1.35 and removed in v2.0.
rules:
- alert: JaegerAgentHttpServerErrors
expr: '100 * sum(rate(jaeger_agent_http_server_errors_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_agent_http_server_total[1m])) by (instance, job, namespace) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Jaeger agent HTTP server errors (instance {{ $labels.instance }})
description: "Jaeger agent on {{ $labels.instance }} is experiencing {{ $value | humanize }}% HTTP server errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JaegerClientRpcRequestErrors
expr: '100 * sum(rate(jaeger_client_jaeger_rpc_http_requests{status_code=~"4xx|5xx"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_client_jaeger_rpc_http_requests[1m])) by (instance, job, namespace) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Jaeger client RPC request errors (instance {{ $labels.instance }})
description: "Jaeger client on {{ $labels.instance }} is experiencing {{ $value | humanize }}% RPC HTTP errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JaegerClientSpansDropped
expr: '100 * sum(rate(jaeger_reporter_spans{result=~"dropped|err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_reporter_spans[1m])) by (instance, job, namespace) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Jaeger client spans dropped (instance {{ $labels.instance }})
description: "Jaeger client on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JaegerAgentSpansDropped
expr: '100 * sum(rate(jaeger_agent_reporter_batches_failures_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_agent_reporter_batches_submitted_total[1m])) by (instance, job, namespace) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Jaeger agent spans dropped (instance {{ $labels.instance }})
description: "Jaeger agent on {{ $labels.instance }} is dropping {{ $value | humanize }}% of span batches.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JaegerCollectorDroppingSpans
expr: '100 * sum(rate(jaeger_collector_spans_dropped_total[1m])) by (instance, job, namespace) / sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_collector_spans_received_total[1m])) by (instance, job, namespace) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Jaeger collector dropping spans (instance {{ $labels.instance }})
description: "Jaeger collector on {{ $labels.instance }} is dropping {{ $value | humanize }}% of spans.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JaegerSamplingUpdateFailing
expr: '100 * sum(rate(jaeger_sampler_queries{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_sampler_queries[1m])) by (instance, job, namespace) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Jaeger sampling update failing (instance {{ $labels.instance }})
description: "Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of sampling policy updates.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JaegerThrottlingUpdateFailing
expr: '100 * sum(rate(jaeger_throttler_updates{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_throttler_updates[1m])) by (instance, job, namespace) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Jaeger throttling update failing (instance {{ $labels.instance }})
description: "Jaeger on {{ $labels.instance }} is failing {{ $value | humanize }}% of throttling policy updates.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JaegerQueryRequestFailures
expr: '100 * sum(rate(jaeger_query_requests_total{result="err"}[1m])) by (instance, job, namespace) / sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 1 and sum(rate(jaeger_query_requests_total[1m])) by (instance, job, namespace) > 0'
for: 15m
labels:
severity: warning
annotations:
summary: Jaeger query request failures (instance {{ $labels.instance }})
description: "Jaeger query on {{ $labels.instance }} is failing {{ $value | humanize }}% of requests.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

94
dist/rules/jaeger/embedded-exporter.yml vendored Normal file
View file

@ -0,0 +1,94 @@
groups:
- name: EmbeddedExporter
# Jaeger v2 is built on OpenTelemetry Collector and exposes metrics on port 8888 (/metrics).
# It emits standard otelcol_* pipeline metrics alongside Jaeger-specific storage and query metrics.
# For span ingestion pipeline alerts (refused spans, export failures, queue saturation),
# use the OpenTelemetry Collector rules instead.
rules:
- alert: JaegerHighStorageErrorRate
expr: '100 * sum(rate(jaeger_storage_requests_total{result="err"}[1m])) by (instance, job, namespace, operation) / sum(rate(jaeger_storage_requests_total[1m])) by (instance, job, namespace, operation) > 1 and sum(rate(jaeger_storage_requests_total[1m])) by (instance, job, namespace, operation) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Jaeger high storage error rate (instance {{ $labels.instance }})
description: "Jaeger on {{ $labels.instance }} is experiencing {{ $value | humanize }}% storage errors on {{ $labels.operation }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 1s is a rough default. Adjust based on your storage backend and data volume.
- alert: JaegerSlowStorageOperations
expr: 'histogram_quantile(0.99, sum(rate(jaeger_storage_latency_seconds_bucket[5m])) by (le, instance, job, namespace, operation)) > 1'
for: 5m
labels:
severity: warning
annotations:
summary: Jaeger slow storage operations (instance {{ $labels.instance }})
description: "Jaeger on {{ $labels.instance }} storage p99 latency for {{ $labels.operation }} is {{ $value | humanizeDuration }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Filters on http_route="/api/traces" (the trace search endpoint). The http_server_request_duration_seconds
# metric is emitted by the otelhttp middleware used by the Jaeger query service.
- alert: JaegerQueryServiceHighErrorRate
expr: '100 * sum(rate(http_server_request_duration_seconds_count{http_route="/api/traces",http_response_status_code=~"5.."}[1m])) by (instance, job, namespace) / sum(rate(http_server_request_duration_seconds_count{http_route="/api/traces"}[1m])) by (instance, job, namespace) > 1 and sum(rate(http_server_request_duration_seconds_count{http_route="/api/traces"}[1m])) by (instance, job, namespace) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Jaeger query service high error rate (instance {{ $labels.instance }})
description: "Jaeger query service on {{ $labels.instance }} is returning {{ $value | humanize }}% HTTP 5xx errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 2s is a rough default. Adjust based on your storage backend and data volume.
- alert: JaegerQueryServiceSlowResponses
expr: 'histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{http_route="/api/traces"}[5m])) by (le, instance, job, namespace)) > 2'
for: 5m
labels:
severity: warning
annotations:
summary: Jaeger query service slow responses (instance {{ $labels.instance }})
description: "Jaeger query service on {{ $labels.instance }} p99 response latency is {{ $value | humanizeDuration }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Fires when all storage operations for a given type are failing and none are succeeding.
# Indicates the storage backend (Cassandra, Elasticsearch, etc.) is likely unreachable or misconfigured.
- alert: JaegerStorageCompletelyUnavailable
expr: 'sum(rate(jaeger_storage_requests_total{result="err"}[1m])) by (instance, job, namespace, operation) > 0 and sum(rate(jaeger_storage_requests_total{result="ok"}[1m])) by (instance, job, namespace, operation) == 0'
for: 2m
labels:
severity: critical
annotations:
summary: Jaeger storage completely unavailable (instance {{ $labels.instance }})
description: "Jaeger on {{ $labels.instance }} has 100% storage errors for {{ $labels.operation }} — storage backend may be down.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Single trace retrieval (/api/traces/{traceID}) can be slower than search, especially for large traces.
# Threshold of 5s is a rough default.
- alert: JaegerSlowSingleTraceRetrieval
expr: 'histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{http_route="/api/traces/{traceID}"}[5m])) by (le, instance, job, namespace)) > 5'
for: 5m
labels:
severity: warning
annotations:
summary: Jaeger slow single trace retrieval (instance {{ $labels.instance }})
description: "Jaeger on {{ $labels.instance }} p99 latency for single trace retrieval is {{ $value | humanizeDuration }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Errors on /api/services indicate the storage backend cannot return the list of instrumented services,
# which breaks the Jaeger UI service selector.
- alert: JaegerServiceDiscoveryErrors
expr: '100 * sum(rate(http_server_request_duration_seconds_count{http_route="/api/services",http_response_status_code=~"5.."}[1m])) by (instance, job, namespace) / sum(rate(http_server_request_duration_seconds_count{http_route="/api/services"}[1m])) by (instance, job, namespace) > 1 and sum(rate(http_server_request_duration_seconds_count{http_route="/api/services"}[1m])) by (instance, job, namespace) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Jaeger service discovery errors (instance {{ $labels.instance }})
description: "Jaeger on {{ $labels.instance }} is returning {{ $value | humanize }}% HTTP 5xx errors on the services endpoint.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Fires when an operation (e.g. find_traces, get_services) has received requests but none succeeded.
# May indicate a persistent storage error or a backend that is slow to recover.
- alert: JaegerNoStorageReadsSucceeding
expr: 'sum(increase(jaeger_storage_requests_total{result="ok"}[15m])) by (instance, job, namespace, operation) == 0 and sum(increase(jaeger_storage_requests_total[15m])) by (instance, job, namespace, operation) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Jaeger no storage reads succeeding (instance {{ $labels.instance }})
description: "Jaeger on {{ $labels.instance }} has no successful storage reads for {{ $labels.operation }} in the past 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,16 +2,26 @@ groups:
- name: MetricPlugin
rules:
- alert: JenkinsOffline
expr: 'jenkins_node_offline_value > 1'
- alert: JenkinsNodeOffline
expr: 'jenkins_node_offline_value > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Jenkins node offline (instance {{ $labels.instance }})
description: "At least one Jenkins node offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JenkinsNoNodeOnline
expr: 'jenkins_node_online_value == 0'
for: 0m
labels:
severity: critical
annotations:
summary: Jenkins offline (instance {{ $labels.instance }})
description: "Jenkins offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Jenkins no node online (instance {{ $labels.instance }})
description: "No Jenkins nodes are online: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JenkinsHealthcheck
expr: 'jenkins_health_check_score < 1'
@ -41,7 +51,7 @@ groups:
description: "Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JenkinsRunFailureTotal
expr: 'delta(jenkins_runs_failure_total[1h]) > 100'
expr: 'increase(jenkins_runs_failure_total[1h]) > 100'
for: 0m
labels:
severity: warning
@ -58,6 +68,12 @@ groups:
summary: Jenkins build tests failing (instance {{ $labels.instance }})
description: "Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# * RUNNING -1 true - The build had no errors.
# * SUCCESS 0 true - The build had no errors.
# * UNSTABLE 1 true - The build had some errors but they were not fatal. For example, some tests failed.
# * FAILURE 2 false - The build had a fatal error.
# * NOT_BUILT 3 false - The module was not built.
# * ABORTED 4 false - The build was manually aborted.
- alert: JenkinsLastBuildFailed
expr: 'default_jenkins_builds_last_build_result_ordinal == 2'
for: 0m

View file

@ -2,6 +2,7 @@ groups:
- name: CzerwonkJunosExporter
rules:
- alert: JuniperSwitchDown
@ -13,20 +14,20 @@ groups:
summary: Juniper switch down (instance {{ $labels.instance }})
description: "The switch appears to be down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JuniperHighBandwidthUsage1gib
- alert: JuniperCriticalBandwidthUsage1gib
expr: 'rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90'
for: 1m
labels:
severity: critical
annotations:
summary: Juniper high Bandwidth Usage 1GiB (instance {{ $labels.instance }})
summary: Juniper critical Bandwidth Usage 1GiB (instance {{ $labels.instance }})
description: "Interface is highly saturated. (> 0.90GiB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JuniperHighBandwidthUsage1gib
- alert: JuniperWarningBandwidthUsage1gib
expr: 'rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80'
for: 1m
labels:
severity: warning
annotations:
summary: Juniper high Bandwidth Usage 1GiB (instance {{ $labels.instance }})
summary: Juniper warning Bandwidth Usage 1GiB (instance {{ $labels.instance }})
description: "Interface is getting saturated. (> 0.80GiB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,13 +2,119 @@ groups:
- name: JvmExporter
rules:
- alert: JvmMemoryFillingUp
expr: '(sum by (instance)(jvm_memory_used_bytes{area="heap"}) / sum by (instance)(jvm_memory_max_bytes{area="heap"})) * 100 > 80'
expr: '(sum by (instance)(jvm_memory_used_bytes{area="heap"}) / sum by (instance)(jvm_memory_max_bytes{area="heap"})) * 100 > 80 and sum by (instance)(jvm_memory_max_bytes{area="heap"}) > 0'
for: 2m
labels:
severity: warning
annotations:
summary: JVM memory filling up (instance {{ $labels.instance }})
description: "JVM memory is filling up (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Many JVM configurations leave metaspace unbounded, in which case jvm_memory_max_bytes{area="nonheap"} is -1 and this alert will not fire.
# The query filters out max_bytes <= 0 to avoid false negatives.
- alert: JvmNon-heapMemoryFillingUp
expr: '(sum by (instance)(jvm_memory_used_bytes{area="nonheap"}) / (sum by (instance)(jvm_memory_max_bytes{area="nonheap"}) > 0)) * 100 > 80'
for: 2m
labels:
severity: warning
annotations:
summary: JVM non-heap memory filling up (instance {{ $labels.instance }})
description: "JVM non-heap memory (metaspace/code cache) is filling up (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JvmGcTimeTooHigh
expr: 'sum by (instance)(rate(jvm_gc_collection_seconds_sum[5m])) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: JVM GC time too high (instance {{ $labels.instance }})
description: "JVM is spending too much time in garbage collection (> 5% of wall clock time)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JvmThreadsDeadlocked
expr: 'jvm_threads_deadlocked > 0'
for: 1m
labels:
severity: critical
annotations:
summary: JVM threads deadlocked (instance {{ $labels.instance }})
description: "JVM has deadlocked threads\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JvmThreadCountHigh
expr: 'jvm_threads_current > 300'
for: 5m
labels:
severity: warning
annotations:
summary: JVM thread count high (instance {{ $labels.instance }})
description: "JVM thread count is high (> 300), potential thread leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JvmThreadsBlocked
expr: 'jvm_threads_state{state="BLOCKED"} > 50'
for: 5m
labels:
severity: warning
annotations:
summary: JVM threads BLOCKED (instance {{ $labels.instance }})
description: "JVM has high number of BLOCKED threads, indicating lock contention\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This regex matches CMS, G1, and Parallel collector names. It will not match ZGC or Shenandoah cycle names.
# Adjust the gc label filter if you use a different collector.
- alert: JvmOldGenGcFrequency
expr: 'rate(jvm_gc_collection_seconds_count{gc=~".*old.*|.*major.*"}[5m]) > 0.3'
for: 5m
labels:
severity: warning
annotations:
summary: JVM old gen GC frequency (instance {{ $labels.instance }})
description: "Frequent old/major GC cycles, indicating memory pressure\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JvmDirectBufferPoolFillingUp
expr: '(jvm_buffer_pool_used_bytes / jvm_buffer_pool_capacity_bytes) * 100 > 90 and jvm_buffer_pool_capacity_bytes > 0'
for: 5m
labels:
severity: warning
annotations:
summary: JVM direct buffer pool filling up (instance {{ $labels.instance }})
description: "JVM direct buffer pool is filling up (> 90%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JvmObjectsPendingFinalization
expr: 'jvm_memory_objects_pending_finalization > 1000'
for: 5m
labels:
severity: warning
annotations:
summary: JVM objects pending finalization (instance {{ $labels.instance }})
description: "JVM has objects pending finalization, potential memory leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# process_open_fds and process_max_fds are generic metrics from the Prometheus client library, not JVM-specific.
# This alert will also fire for Go, Python, or any process exposing these metrics.
- alert: JvmFileDescriptorsExhaustion
expr: '(process_open_fds / process_max_fds) * 100 > 90 and process_max_fds > 0'
for: 5m
labels:
severity: warning
annotations:
summary: JVM file descriptors exhaustion (instance {{ $labels.instance }})
description: "JVM process is running out of file descriptors (> 90% used)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JvmClassLoadingAnomaly
expr: 'rate(jvm_classes_loaded_total[5m]) > 100'
for: 5m
labels:
severity: warning
annotations:
summary: JVM class loading anomaly (instance {{ $labels.instance }})
description: "Rapid class loading detected, potential classloader leak\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: JvmCompilationTimeSpike
expr: 'rate(jvm_compilation_time_seconds_total[5m]) > 0.1'
for: 5m
labels:
severity: warning
annotations:
summary: JVM compilation time spike (instance {{ $labels.instance }})
description: "Excessive JIT compilation time consuming CPU\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,22 +2,23 @@ groups:
- name: DanielqsjKafkaExporter
rules:
- alert: KafkaTopicsReplicas
expr: 'sum(kafka_topic_partition_in_sync_replica) by (topic) < 3'
expr: 'min(kafka_topic_partition_in_sync_replica) by (topic) < 3'
for: 0m
labels:
severity: critical
annotations:
summary: Kafka topics replicas (instance {{ $labels.instance }})
description: "Kafka topic in-sync partition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Kafka topic {{ $labels.topic }} has fewer than 3 in-sync replicas ({{ $value }}), data durability is at risk.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KafkaConsumersGroup
expr: 'sum(kafka_consumergroup_lag) by (consumergroup) > 50'
- alert: KafkaConsumerGroupLag
expr: 'sum(kafka_consumergroup_lag) by (consumergroup) > 10000'
for: 1m
labels:
severity: critical
severity: warning
annotations:
summary: Kafka consumers group (instance {{ $labels.instance }})
description: "Kafka consumers group\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Kafka consumer group lag (instance {{ $labels.instance }})
description: "Kafka consumer group {{ $labels.consumergroup }} is lagging behind ({{ $value }} messages)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: LinkedinKafkaExporter
rules:
- alert: KafkaTopicOffsetDecreased

View file

@ -0,0 +1,67 @@
groups:
- name: AerogearKeycloakMetricsSpi
rules:
# Threshold of 5% is a rough default. Adjust based on your user base and expected error rates.
# A spike in failed logins may indicate a brute-force attack or misconfigured client.
- alert: KeycloakHighLoginFailureRate
expr: '(sum by (realm) (rate(keycloak_failed_login_attempts_total[5m])) / (sum by (realm) (rate(keycloak_logins_total[5m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[5m])))) * 100 > 5 and (sum by (realm) (rate(keycloak_logins_total[5m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[5m]))) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Keycloak high login failure rate (instance {{ $labels.instance }})
description: "More than 5% of login attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \"%.1f\" }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Only fires when login attempts exist but none succeed — may indicate an authentication outage.
- alert: KeycloakNoSuccessfulLogins
expr: 'sum by (realm) (rate(keycloak_logins_total[15m])) == 0 and (sum by (realm) (rate(keycloak_logins_total[15m])) + sum by (realm) (rate(keycloak_failed_login_attempts_total[15m]))) > 0'
for: 5m
labels:
severity: critical
annotations:
summary: Keycloak no successful logins (instance {{ $labels.instance }})
description: "No successful logins in realm {{ $labels.realm }} for the last 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 10% is a rough default. High refresh token errors may indicate expired sessions or token store issues.
- alert: KeycloakHighTokenRefreshErrorRate
expr: '(sum by (realm) (rate(keycloak_refresh_tokens_errors_total[5m])) / sum by (realm) (rate(keycloak_refresh_tokens_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_refresh_tokens_total[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Keycloak high token refresh error rate (instance {{ $labels.instance }})
description: "More than 10% of token refresh attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \"%.1f\" }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 10% is a rough default. Code-to-token failures may indicate misconfigured OAuth clients or replay attacks.
- alert: KeycloakHighCode-to-tokenExchangeErrorRate
expr: '(sum by (realm) (rate(keycloak_code_to_tokens_errors_total[5m])) / sum by (realm) (rate(keycloak_code_to_tokens_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_code_to_tokens_total[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Keycloak high code-to-token exchange error rate (instance {{ $labels.instance }})
description: "More than 10% of code-to-token exchanges are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \"%.1f\" }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 10% is a rough default.
- alert: KeycloakHighRegistrationFailureRate
expr: '(sum by (realm) (rate(keycloak_registrations_errors_total[5m])) / sum by (realm) (rate(keycloak_registrations_total[5m]))) * 100 > 10 and sum by (realm) (rate(keycloak_registrations_total[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Keycloak high registration failure rate (instance {{ $labels.instance }})
description: "More than 10% of registration attempts are failing in realm {{ $labels.realm }} (current value: {{ $value | printf \"%.1f\" }}%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# keycloak_request_duration is in milliseconds. Threshold of 2000ms (2 seconds) is a rough default.
- alert: KeycloakSlowRequestResponseTime
expr: 'sum by (method) (rate(keycloak_request_duration_sum[5m])) / sum by (method) (rate(keycloak_request_duration_count[5m])) > 2000 and sum by (method) (rate(keycloak_request_duration_count[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Keycloak slow request response time (instance {{ $labels.instance }})
description: "Keycloak {{ $labels.method }} requests are taking more than 2 seconds on average.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: KubestateExporter
rules:
- alert: KubernetesNodeNotReady
@ -10,16 +11,27 @@ groups:
labels:
severity: critical
annotations:
summary: Kubernetes Node ready (node {{ $labels.node }})
summary: Kubernetes Node not ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Kubernetes Node with disabled schedules are fine.
# This alarm can be useful to get warned if there are nodes which are longer unscheduled.
- alert: KubernetesNodeSchedulingDisabled
expr: 'kube_node_spec_taint{key="node.kubernetes.io/unschedulable"} == 1'
for: 30m
labels:
severity: warning
annotations:
summary: Kubernetes Node scheduling disabled (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been marked as unschedulable for more than 30 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesNodeMemoryPressure
expr: 'kube_node_status_condition{condition="MemoryPressure",status="true"} == 1'
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes memory pressure (node {{ $labels.node }})
summary: Kubernetes Node memory pressure (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesNodeDiskPressure
@ -28,7 +40,7 @@ groups:
labels:
severity: critical
annotations:
summary: Kubernetes disk pressure (node {{ $labels.node }})
summary: Kubernetes Node disk pressure (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has DiskPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesNodeNetworkUnavailable
@ -41,7 +53,7 @@ groups:
description: "Node {{ $labels.node }} has NetworkUnavailable condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesNodeOutOfPodCapacity
expr: 'sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90'
expr: 'sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid, instance) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90'
for: 2m
labels:
severity: warning
@ -55,7 +67,7 @@ groups:
labels:
severity: warning
annotations:
summary: Kubernetes container oom killer ({{ $labels.namespace }}/{{ $labels.pod }}:{{ $labels.container }})
summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesJobFailed
@ -64,16 +76,34 @@ groups:
labels:
severity: warning
annotations:
summary: Kubernetes Job failed ({{ $labels.namespace }}/{{ $labels.job_name }})
summary: Kubernetes Job failed (instance {{ $labels.instance }})
description: "Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesJobNotStarting
expr: 'kube_job_status_active == 0 and kube_job_status_failed == 0 and kube_job_status_succeeded == 0 and (time() - kube_job_status_start_time) > 600'
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes Job not starting (instance {{ $labels.instance }})
description: "Job {{ $labels.namespace }}/{{ $labels.job_name }} did not start for 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesCronjobFailing
expr: '(kube_cronjob_status_last_schedule_time > kube_cronjob_status_last_successful_time) AND (kube_cronjob_status_active == 0) AND (kube_cronjob_spec_suspend == 0)'
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes CronJob failing (instance {{ $labels.instance }})
description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is failing\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesCronjobSuspended
expr: 'kube_cronjob_spec_suspend != 0'
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes CronJob suspended ({{ $labels.namespace }}/{{ $labels.cronjob }})
summary: Kubernetes CronJob suspended (instance {{ $labels.instance }})
description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesPersistentvolumeclaimPending
@ -82,11 +112,11 @@ groups:
labels:
severity: warning
annotations:
summary: Kubernetes PersistentVolumeClaim pending ({{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }})
summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }})
description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesVolumeOutOfDiskSpace
expr: 'kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10'
expr: 'kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10 and kubelet_volume_stats_capacity_bytes > 0'
for: 2m
labels:
severity: warning
@ -104,12 +134,12 @@ groups:
description: "Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesPersistentvolumeError
expr: 'kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0'
expr: 'kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes PersistentVolumeClaim pending ({{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }})
summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }})
description: "Persistent volume {{ $labels.persistentvolume }} is in bad state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesStatefulsetDown
@ -118,7 +148,7 @@ groups:
labels:
severity: critical
annotations:
summary: Kubernetes StatefulSet down ({{ $labels.namespace }}/{{ $labels.statefulset }})
summary: Kubernetes StatefulSet down (instance {{ $labels.instance }})
description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesHpaScaleInability
@ -140,7 +170,7 @@ groups:
description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to collect metrics\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesHpaScaleMaximum
expr: 'kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas'
expr: '(kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas) and (kube_horizontalpodautoscaler_spec_max_replicas > 1) and (kube_horizontalpodautoscaler_spec_min_replicas != kube_horizontalpodautoscaler_spec_max_replicas)'
for: 2m
labels:
severity: info
@ -163,7 +193,7 @@ groups:
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy ({{ $labels.namespace }}/{{ $labels.pod }})
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesPodCrashLooping
@ -172,7 +202,7 @@ groups:
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping ({{ $labels.namespace }}/{{ $labels.pod }})
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesReplicasetReplicasMismatch
@ -181,7 +211,7 @@ groups:
labels:
severity: warning
annotations:
summary: Kubernetes ReplicasSet mismatch ({{ $labels.namespace }}/{{ $labels.replicaset }})
summary: Kubernetes ReplicaSet replicas mismatch (instance {{ $labels.instance }})
description: "ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesDeploymentReplicasMismatch
@ -190,7 +220,7 @@ groups:
labels:
severity: warning
annotations:
summary: Kubernetes Deployment replicas mismatch ({{ $labels.namespace }}/{{ $labels.deployment }})
summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesStatefulsetReplicasMismatch
@ -208,7 +238,7 @@ groups:
labels:
severity: critical
annotations:
summary: Kubernetes Deployment generation mismatch ({{ $labels.namespace }}/{{ $labels.deployment }})
summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }})
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesStatefulsetGenerationMismatch
@ -217,7 +247,7 @@ groups:
labels:
severity: critical
annotations:
summary: Kubernetes StatefulSet generation mismatch ({{ $labels.namespace }}/{{ $labels.statefulset }})
summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }})
description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesStatefulsetUpdateNotRolledOut
@ -226,16 +256,16 @@ groups:
labels:
severity: warning
annotations:
summary: Kubernetes StatefulSet update not rolled out ({{ $labels.namespace }}/{{ $labels.statefulset }})
summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }})
description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesDaemonsetRolloutStuck
expr: 'kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0'
expr: '(kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 and kube_daemonset_status_desired_number_scheduled > 0) or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0'
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes DaemonSet rollout stuck ({{ $labels.namespace }}/{{ $labels.daemonset }})
summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})
description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesDaemonsetMisscheduled
@ -244,16 +274,17 @@ groups:
labels:
severity: critical
annotations:
summary: Kubernetes DaemonSet misscheduled ({{ $labels.namespace }}/{{ $labels.daemonset }})
summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }})
description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold should be customized for each cronjob name.
- alert: KubernetesCronjobTooLong
expr: 'time() - kube_cronjob_next_schedule_time > 3600'
expr: 'kube_job_status_start_time > 0 and absent(kube_job_status_completion_time) and (time() - kube_job_status_start_time) > 3600'
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes CronJob too long ({{ $labels.namespace }}/{{ $labels.cronjob }})
summary: Kubernetes CronJob too long (instance {{ $labels.instance }})
description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesJobSlowCompletion
@ -262,26 +293,26 @@ groups:
labels:
severity: critical
annotations:
summary: Kubernetes job slow completion ({{ $labels.namespace }}/{{ $labels.job_name }})
summary: Kubernetes Job slow completion (instance {{ $labels.instance }})
description: "Kubernetes Job {{ $labels.namespace }}/{{ $labels.job_name }} did not complete in time.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesApiServerErrors
expr: 'sum(rate(apiserver_request_total{job="apiserver",code=~"^(?:5..)$"}[1m])) / sum(rate(apiserver_request_total{job="apiserver"}[1m])) * 100 > 3'
expr: 'sum(rate(apiserver_request_total{job="apiserver",code=~"(?:5..)"}[1m])) by (instance, job) / sum(rate(apiserver_request_total{job="apiserver"}[1m])) by (instance, job) * 100 > 3 and sum(rate(apiserver_request_total{job="apiserver"}[1m])) by (instance, job) > 0'
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes API server errors (instance {{ $labels.instance }})
description: "Kubernetes API server is experiencing high error rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Kubernetes API server is experiencing {{ $value | humanize }}% error rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesApiClientErrors
expr: '(sum(rate(rest_client_requests_total{code=~"(4|5).."}[1m])) by (instance, job) / sum(rate(rest_client_requests_total[1m])) by (instance, job)) * 100 > 1'
expr: '(sum(rate(rest_client_requests_total{code=~"(4|5).."}[1m])) by (instance, job) / sum(rate(rest_client_requests_total[1m])) by (instance, job)) * 100 > 1 and sum(rate(rest_client_requests_total[1m])) by (instance, job) > 0'
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes API client errors (instance {{ $labels.instance }})
description: "Kubernetes API client is experiencing high error rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Kubernetes API client is experiencing {{ $value | humanize }}% error rate\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesClientCertificateExpiresNextWeek
expr: 'apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60'
@ -302,7 +333,7 @@ groups:
description: "A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesApiServerLatency
expr: 'histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!="log",verb!~"^(?:CONNECT|WATCHLIST|WATCH|PROXY)$"} [10m])) WITHOUT (instance, resource)) > 1'
expr: 'histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~"(?:CONNECT|WATCHLIST|WATCH|PROXY)"} [10m])) WITHOUT (subresource)) > 1'
for: 2m
labels:
severity: warning

View file

@ -2,13 +2,15 @@ groups:
- name: EmbeddedExporter
rules:
# Linkerd does not expose request_errors_total. Errors are tracked via response_total{classification="failure"}.
- alert: LinkerdHighErrorRate
expr: 'sum(rate(request_errors_total[1m])) by (deployment, statefulset, daemonset) / sum(rate(request_total[1m])) by (deployment, statefulset, daemonset) * 100 > 10'
expr: 'sum(rate(response_total{classification="failure"}[1m])) by (deployment, statefulset, daemonset) / sum(rate(response_total[1m])) by (deployment, statefulset, daemonset) * 100 > 10 and sum(rate(response_total[1m])) by (deployment, statefulset, daemonset) > 0'
for: 1m
labels:
severity: warning
annotations:
summary: Linkerd high error rate (instance {{ $labels.instance }})
description: "Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Linkerd error rate for {{ $labels.deployment }}{{ $labels.statefulset }}{{ $labels.daemonset }} is over 10%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,36 @@
groups:
- name: EmbeddedExporter
rules:
# The threshold (1) is in USD. The `model` label carries the resolved model-name (post-routing).
# PromQL `increase()` requires ≥2 datapoints with growth-difference to extrapolate positive —
# for brand-new counter series this needs ≥2 distinct request bursts ≥1 scrape-cycle apart.
- alert: LitellmProviderSpendOverBudget
expr: 'sum(increase(litellm_spend_metric_total{model=~"(claude-|anthropic/).*"}[24h])) > 1'
for: 5m
labels:
severity: warning
annotations:
summary: LiteLLM provider spend over budget (instance {{ $labels.instance }})
description: "Cumulative spend for an LLM provider has exceeded the daily budget threshold. Replace the regex `(claude-|anthropic/).*` with your provider's model-name pattern. Useful as a soft-warning when `provider_budget_config` hard-cap is unavailable or disabled.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: LitellmProxyFailedRequestsRateHigh
expr: 'sum(rate(litellm_proxy_failed_requests_metric_total[5m])) / sum(rate(litellm_proxy_total_requests_metric_total[5m])) > 0.05'
for: 10m
labels:
severity: warning
annotations:
summary: LiteLLM proxy failed requests rate high (instance {{ $labels.instance }})
description: "LiteLLM proxy is returning failed responses to clients (>5% error rate over 5min). Investigate downstream LLM provider availability or auth issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: LitellmRequestLatencyP95High
expr: 'histogram_quantile(0.95, sum(rate(litellm_request_total_latency_metric_bucket[5m])) by (le)) > 10'
for: 10m
labels:
severity: warning
annotations:
summary: LiteLLM request latency p95 high (instance {{ $labels.instance }})
description: "LiteLLM request total latency p95 exceeds 10 seconds over 5min. Check downstream LLM provider response-times and proxy queue-depth.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: LokiProcessTooManyRestarts
@ -14,28 +15,28 @@ groups:
description: "A loki process had too many restarts (target {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: LokiRequestErrors
expr: '100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10'
expr: '100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10 and sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 0'
for: 15m
labels:
severity: critical
annotations:
summary: Loki request errors (instance {{ $labels.instance }})
description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing errors\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing {{ printf \"%.2f\" $value }}% errors.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: LokiRequestPanic
expr: 'sum(increase(loki_panic_total[10m])) by (namespace, job) > 0'
for: 5m
expr: 'sum(increase(loki_panic_total[5m])) by (namespace, job) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Loki request panic (instance {{ $labels.instance }})
description: "The {{ $labels.job }} is experiencing {{ printf \"%.2f\" $value }}% increase of panics\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "{{ $labels.job }} is experiencing {{ $value | humanize }} panic(s) in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: LokiRequestLatency
expr: '(histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le))) > 1'
expr: 'histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (namespace, job, route, le)) > 1'
for: 5m
labels:
severity: critical
annotations:
summary: Loki request latency (instance {{ $labels.instance }})
description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,24 @@
groups:
- name: EmbeddedExporter
rules:
- alert: MeilisearchIndexIsEmpty
expr: 'meilisearch_index_docs_count == 0'
for: 0m
labels:
severity: warning
annotations:
summary: Meilisearch index is empty (instance {{ $labels.instance }})
description: "Meilisearch index {{ $labels.index }} has zero documents\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MeilisearchHttpResponseTime
expr: 'meilisearch_http_response_time_seconds > 0.5'
for: 0m
labels:
severity: warning
annotations:
summary: Meilisearch http response time (instance {{ $labels.instance }})
description: "Meilisearch http response time is too high\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,91 @@
groups:
- name: MemcachedExporter
rules:
# 1m delay allows a restart without triggering an alert.
- alert: MemcachedDown
expr: 'memcached_up == 0'
for: 1m
labels:
severity: critical
annotations:
summary: Memcached down (instance {{ $labels.instance }})
description: "Memcached instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MemcachedConnectionLimitApproaching(>80%)
expr: '(memcached_current_connections / memcached_max_connections * 100) > 80 and memcached_max_connections > 0'
for: 2m
labels:
severity: warning
annotations:
summary: Memcached connection limit approaching (> 80%) (instance {{ $labels.instance }})
description: "Memcached connection usage is above 80% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MemcachedConnectionLimitApproaching(>95%)
expr: '(memcached_current_connections / memcached_max_connections * 100) > 95 and memcached_max_connections > 0'
for: 2m
labels:
severity: critical
annotations:
summary: Memcached connection limit approaching (> 95%) (instance {{ $labels.instance }})
description: "Memcached connection usage is above 95% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MemcachedOutOfMemoryErrors
expr: 'sum without (slab) (rate(memcached_slab_items_outofmemory_total[5m])) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: Memcached out of memory errors (instance {{ $labels.instance }})
description: "Memcached is returning out-of-memory errors on {{ $labels.instance }} ({{ $value }} errors/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# High memory usage is expected if the cache is well-utilized. This alert fires when it approaches the configured limit, which may cause evictions.
- alert: MemcachedMemoryUsageHigh(>90%)
expr: '(memcached_current_bytes / memcached_limit_bytes * 100) > 90 and memcached_limit_bytes > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Memcached memory usage high (> 90%) (instance {{ $labels.instance }})
description: "Memcached memory usage is above 90% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# A sustained eviction rate indicates memory pressure. Consider increasing memcached memory limit or reducing cache usage. Threshold of 10 evictions/s is a rough default — adjust based on your workload.
- alert: MemcachedHighEvictionRate
expr: 'rate(memcached_items_evicted_total[5m]) > 10'
for: 5m
labels:
severity: warning
annotations:
summary: Memcached high eviction rate (instance {{ $labels.instance }})
description: "Memcached is evicting items at a high rate on {{ $labels.instance }} ({{ $value }} evictions/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# A low hit rate may indicate poor cache utilization, incorrect cache keys, or TTLs that are too short. Threshold of 80% is a rough default — adjust based on your workload and access patterns.
- alert: MemcachedLowCacheHitRate(<80%)
expr: '(rate(memcached_commands_total{command="get", status="hit"}[5m]) / (rate(memcached_commands_total{command="get", status="hit"}[5m]) + rate(memcached_commands_total{command="get", status="miss"}[5m])) * 100) < 80 and (rate(memcached_commands_total{command="get", status="hit"}[5m]) + rate(memcached_commands_total{command="get", status="miss"}[5m])) > 0'
for: 10m
labels:
severity: warning
annotations:
summary: Memcached low cache hit rate (< 80%) (instance {{ $labels.instance }})
description: "Memcached cache hit rate is below 80% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MemcachedConnectionsRejected
expr: 'increase(memcached_connections_rejected_total[5m]) > 3'
for: 5m
labels:
severity: warning
annotations:
summary: Memcached connections rejected (instance {{ $labels.instance }})
description: "Memcached is rejecting connections on {{ $labels.instance }} ({{ $value }} rejections in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MemcachedItemsTooLarge
expr: 'increase(memcached_item_too_large_total[5m]) > 3'
for: 5m
labels:
severity: info
annotations:
summary: Memcached items too large (instance {{ $labels.instance }})
description: "Memcached is rejecting items exceeding max-item-size on {{ $labels.instance }} ({{ $value }} rejections in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,10 +2,11 @@ groups:
- name: EmbeddedExporter
rules:
- alert: MinioClusterDiskOffline
expr: 'minio_cluster_disk_offline_total > 0'
expr: 'minio_cluster_drive_offline_total > 0'
for: 0m
labels:
severity: critical
@ -23,7 +24,7 @@ groups:
description: "Minio cluster node disk is offline\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MinioDiskSpaceUsage
expr: 'disk_storage_available / disk_storage_total * 100 < 10'
expr: 'minio_cluster_capacity_raw_free_bytes / minio_cluster_capacity_raw_total_bytes * 100 < 10 and minio_cluster_capacity_raw_total_bytes > 0'
for: 0m
labels:
severity: warning

View file

@ -2,15 +2,16 @@ groups:
- name: DcuMongodbExporter
rules:
- alert: MongodbReplicationLag
- alert: MongodbReplicationLag(dcu)
expr: 'avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10'
for: 0m
labels:
severity: critical
annotations:
summary: MongoDB replication lag (instance {{ $labels.instance }})
summary: MongoDB replication lag (DCU) (instance {{ $labels.instance }})
description: "Mongodb replication lag is more than 10s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbReplicationStatus3
@ -58,38 +59,29 @@ groups:
summary: MongoDB replication Status 10 (instance {{ $labels.instance }})
description: "MongoDB Replication set member was once in a replica set but was subsequently removed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbNumberCursorsOpen
- alert: MongodbNumberCursorsOpen(dcu)
expr: 'mongodb_metrics_cursor_open{state="total_open"} > 10000'
for: 2m
labels:
severity: warning
annotations:
summary: MongoDB number cursors open (instance {{ $labels.instance }})
summary: MongoDB number cursors open (DCU) (instance {{ $labels.instance }})
description: "Too many cursors opened by MongoDB for clients (> 10k)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbCursorsTimeouts
- alert: MongodbCursorsTimeouts(dcu)
expr: 'increase(mongodb_metrics_cursor_timed_out_total[1m]) > 100'
for: 2m
labels:
severity: warning
annotations:
summary: MongoDB cursors timeouts (instance {{ $labels.instance }})
description: "Too many cursors are timing out\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: MongoDB cursors timeouts (DCU) (instance {{ $labels.instance }})
description: "Too many cursors are timing out ({{ $value }} in the last minute)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbTooManyConnections
expr: 'avg by(instance) (rate(mongodb_connections{state="current"}[1m])) / avg by(instance) (sum (mongodb_connections) by (instance)) * 100 > 80'
- alert: MongodbTooManyConnections(dcu)
expr: 'mongodb_connections{state="current"} / (mongodb_connections{state="current"} + mongodb_connections{state="available"}) * 100 > 80 and (mongodb_connections{state="current"} + mongodb_connections{state="available"}) > 0'
for: 2m
labels:
severity: warning
annotations:
summary: MongoDB too many connections (instance {{ $labels.instance }})
summary: MongoDB too many connections (DCU) (instance {{ $labels.instance }})
description: "Too many connections (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbVirtualMemoryUsage
expr: '(sum(mongodb_memory{type="virtual"}) BY (instance) / sum(mongodb_memory{type="mapped"}) BY (instance)) > 3'
for: 2m
labels:
severity: warning
annotations:
summary: MongoDB virtual memory usage (instance {{ $labels.instance }})
description: "High memory usage\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,35 +2,39 @@ groups:
- name: PerconaMongodbExporter
rules:
# 1m delay allows a restart without triggering an alert.
- alert: MongodbDown
expr: 'mongodb_up == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
summary: MongoDB Down (instance {{ $labels.instance }})
description: "MongoDB instance is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# 1m delay allows a restart without triggering an alert.
- alert: MongodbReplicaMemberUnhealthy
expr: 'mongodb_rs_members_health == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
summary: Mongodb replica member unhealthy (instance {{ $labels.instance }})
description: "MongoDB replica member is not healthy\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbReplicationLag
- alert: MongodbReplicationLag(percona)
expr: '(mongodb_rs_members_optimeDate{member_state="PRIMARY"} - on (set) group_right mongodb_rs_members_optimeDate{member_state="SECONDARY"}) / 1000 > 10'
for: 0m
labels:
severity: critical
annotations:
summary: MongoDB replication lag (instance {{ $labels.instance }})
summary: MongoDB replication lag (Percona) (instance {{ $labels.instance }})
description: "Mongodb replication lag is more than 10s\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This query mixes old (mongodb_mongod_*) and new (mongodb_rs_*) metric names. It requires the Percona exporter to run with --compatible-mode to expose both.
- alert: MongodbReplicationHeadroom
expr: 'sum(avg(mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp)) - sum(avg(mongodb_rs_members_optimeDate{member_state="PRIMARY"} - on (set) group_right mongodb_rs_members_optimeDate{member_state="SECONDARY"})) <= 0'
for: 0m
@ -40,38 +44,29 @@ groups:
summary: MongoDB replication headroom (instance {{ $labels.instance }})
description: "MongoDB replication headroom is <= 0\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbNumberCursorsOpen
- alert: MongodbNumberCursorsOpen(percona)
expr: 'mongodb_ss_metrics_cursor_open{csr_type="total"} > 10 * 1000'
for: 2m
labels:
severity: warning
annotations:
summary: MongoDB number cursors open (instance {{ $labels.instance }})
summary: MongoDB number cursors open (Percona) (instance {{ $labels.instance }})
description: "Too many cursors opened by MongoDB for clients (> 10k)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbCursorsTimeouts
- alert: MongodbCursorsTimeouts(percona)
expr: 'increase(mongodb_ss_metrics_cursor_timedOut[1m]) > 100'
for: 2m
labels:
severity: warning
annotations:
summary: MongoDB cursors timeouts (instance {{ $labels.instance }})
description: "Too many cursors are timing out\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: MongoDB cursors timeouts (Percona) (instance {{ $labels.instance }})
description: "Too many cursors are timing out ({{ $value }} in the last minute)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbTooManyConnections
expr: 'avg by(instance) (rate(mongodb_ss_connections{conn_type="current"}[1m])) / avg by(instance) (sum (mongodb_ss_connections) by (instance)) * 100 > 80'
- alert: MongodbTooManyConnections(percona)
expr: 'mongodb_ss_connections{conn_type="current"} / (mongodb_ss_connections{conn_type="current"} + mongodb_ss_connections{conn_type="available"}) * 100 > 80 and (mongodb_ss_connections{conn_type="current"} + mongodb_ss_connections{conn_type="available"}) > 0'
for: 2m
labels:
severity: warning
annotations:
summary: MongoDB too many connections (instance {{ $labels.instance }})
summary: MongoDB too many connections (Percona) (instance {{ $labels.instance }})
description: "Too many connections (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MongodbVirtualMemoryUsage
expr: '(sum(mongodb_ss_mem_virtual) BY (instance) / sum(mongodb_ss_mem_resident) BY (instance)) > 3'
for: 2m
labels:
severity: warning
annotations:
summary: MongoDB virtual memory usage (instance {{ $labels.instance }})
description: "High memory usage\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: StefanprodanMgobExporter
rules:
- alert: MgobBackupFailed

View file

@ -2,11 +2,13 @@ groups:
- name: MysqldExporter
rules:
# 1m delay allows a restart without triggering an alert.
- alert: MysqlDown
expr: 'mysql_up == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -14,7 +16,7 @@ groups:
description: "MySQL instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlTooManyConnections(>80%)
expr: 'max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80'
expr: 'max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80 and mysql_global_variables_max_connections > 0'
for: 2m
labels:
severity: warning
@ -23,7 +25,7 @@ groups:
description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlHighPreparedStatementsUtilization(>80%)
expr: 'max_over_time(mysql_global_status_prepared_stmt_count[1m]) / mysql_global_variables_max_prepared_stmt_count * 100 > 80'
expr: 'max_over_time(mysql_global_status_prepared_stmt_count[1m]) / mysql_global_variables_max_prepared_stmt_count * 100 > 80 and mysql_global_variables_max_prepared_stmt_count > 0'
for: 2m
labels:
severity: warning
@ -32,7 +34,7 @@ groups:
description: "High utilization of prepared statements (>80%) on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlHighThreadsRunning
expr: 'max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60'
expr: 'max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60 and mysql_global_variables_max_connections > 0'
for: 2m
labels:
severity: warning
@ -40,18 +42,20 @@ groups:
summary: MySQL high threads running (instance {{ $labels.instance }})
description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# 1m delay allows a restart without triggering an alert.
- alert: MysqlSlaveIoThreadNotRunning
expr: '( mysql_slave_status_slave_io_running and ON (instance) mysql_slave_status_master_server_id > 0 ) == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
summary: MySQL Slave IO thread not running (instance {{ $labels.instance }})
description: "MySQL Slave IO thread not running on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# 1m delay allows a restart without triggering an alert.
- alert: MysqlSlaveSqlThreadNotRunning
expr: '( mysql_slave_status_slave_sql_running and ON (instance) mysql_slave_status_master_server_id > 0) == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -67,23 +71,25 @@ groups:
summary: MySQL Slave replication lag (instance {{ $labels.instance }})
description: "MySQL replication lag on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# mysqld_exporter exposes SHOW GLOBAL STATUS variables as untyped/gauge, so delta() is used instead of increase().
- alert: MysqlSlowQueries
expr: 'increase(mysql_global_status_slow_queries[1m]) > 0'
expr: 'delta(mysql_global_status_slow_queries[1m]) > 0'
for: 2m
labels:
severity: warning
annotations:
summary: MySQL slow queries (instance {{ $labels.instance }})
description: "MySQL server mysql has some new slow query.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "MySQL server has some new slow queries ({{ $value }} in the last minute).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# mysqld_exporter exposes SHOW GLOBAL STATUS variables as untyped/gauge, so deriv() is used instead of rate().
- alert: MysqlInnodbLogWaits
expr: 'rate(mysql_global_status_innodb_log_waits[15m]) > 10'
expr: 'deriv(mysql_global_status_innodb_log_waits[15m]) > 10'
for: 0m
labels:
severity: warning
annotations:
summary: MySQL InnoDB log waits (instance {{ $labels.instance }})
description: "MySQL innodb log writes stalling\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "MySQL innodb log writes stalling ({{ $value }} waits/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlRestarted
expr: 'mysql_global_status_uptime < 60'
@ -93,3 +99,40 @@ groups:
annotations:
summary: MySQL restarted (instance {{ $labels.instance }})
description: "MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# mysqld_exporter exposes SHOW GLOBAL STATUS variables as untyped/gauge, so deriv() is used instead of irate().
- alert: MysqlHighQps
expr: 'deriv(mysql_global_status_questions[1m]) > 10000'
for: 2m
labels:
severity: info
annotations:
summary: MySQL High QPS (instance {{ $labels.instance }})
description: "MySQL is being overload with unusual QPS (> 10k QPS).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlTooManyOpenFiles
expr: 'mysql_global_status_innodb_num_open_files / mysql_global_variables_open_files_limit * 100 > 75 and mysql_global_variables_open_files_limit > 0'
for: 2m
labels:
severity: warning
annotations:
summary: MySQL too many open files (instance {{ $labels.instance }})
description: "MySQL has too many open files, consider increase variables open_files_limit on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlInnodbForceRecoveryIsEnabled
expr: 'mysql_global_variables_innodb_force_recovery != 0'
for: 2m
labels:
severity: warning
annotations:
summary: MySQL InnoDB Force Recovery is enabled (instance {{ $labels.instance }})
description: "MySQL InnoDB force recovery is enabled on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: MysqlInnodbHistory_lenTooLong
expr: 'mysql_info_schema_innodb_metrics_transaction_trx_rseg_history_len > 50000'
for: 2m
labels:
severity: warning
annotations:
summary: MySQL InnoDB history_len too long (instance {{ $labels.instance }})
description: "MySQL history_len (undo log) too long on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,40 +2,126 @@ groups:
- name: NatsExporter
rules:
- alert: NatsHighConnectionCount
expr: 'gnatsd_varz_connections > 100'
for: 3m
labels:
severity: warning
annotations:
summary: Nats high connection count (instance {{ $labels.instance }})
description: "High number of NATS connections ({{ $value }}) for {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsHighPendingBytes
expr: 'gnatsd_connz_pending_bytes > 100000'
for: 3m
labels:
severity: warning
annotations:
summary: Nats high pending bytes (instance {{ $labels.instance }})
description: "High number of NATS pending bytes ({{ $value }}) for {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsHighSubscriptionsCount
expr: 'gnatsd_connz_subscriptions > 50'
for: 3m
labels:
severity: warning
annotations:
summary: Nats high subscriptions count (instance {{ $labels.instance }})
description: "High number of NATS subscriptions ({{ $value }}) for {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsHighRoutesCount
expr: 'gnatsd_routez_num_routes > 10'
expr: 'gnatsd_varz_routes > 10'
for: 3m
labels:
severity: warning
annotations:
summary: Nats high routes count (instance {{ $labels.instance }})
description: "High number of NATS routes ({{ $value }}) for {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsHighMemoryUsage
expr: 'gnatsd_varz_mem > 200 * 1024 * 1024'
for: 5m
labels:
severity: warning
annotations:
summary: Nats high memory usage (instance {{ $labels.instance }})
description: "NATS server memory usage is above 200MB for {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsSlowConsumers
expr: 'gnatsd_varz_slow_consumers > 0'
for: 3m
labels:
severity: critical
annotations:
summary: Nats slow consumers (instance {{ $labels.instance }})
description: "There are slow consumers in NATS for {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Replace job="nats" with the actual job name in your Prometheus configuration.
- alert: NatsServerDown
expr: 'absent(up{job="nats"})'
for: 5m
labels:
severity: critical
annotations:
summary: Nats server down (instance {{ $labels.instance }})
description: "NATS server has been down for more than 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# gnatsd_varz_cpu is a gauge reporting CPU percentage (0-100 scale).
- alert: NatsHighCpuUsage
expr: 'gnatsd_varz_cpu > 80'
for: 5m
labels:
severity: warning
annotations:
summary: Nats high CPU usage (instance {{ $labels.instance }})
description: "NATS server is using more than 80% CPU for the last 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsHighNumberOfConnections
expr: 'gnatsd_connz_num_connections > 1000'
for: 5m
labels:
severity: warning
annotations:
summary: Nats high number of connections (instance {{ $labels.instance }})
description: "NATS server has more than 1000 active connections\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsHighJetstreamStoreUsage
expr: 'gnatsd_varz_jetstream_stats_storage / gnatsd_varz_jetstream_config_max_storage > 0.8 and gnatsd_varz_jetstream_config_max_storage > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Nats high JetStream store usage (instance {{ $labels.instance }})
description: "JetStream store usage is over 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsHighJetstreamMemoryUsage
expr: 'gnatsd_varz_jetstream_stats_memory / gnatsd_varz_jetstream_config_max_memory > 0.8 and gnatsd_varz_jetstream_config_max_memory > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Nats high JetStream memory usage (instance {{ $labels.instance }})
description: "JetStream memory usage is over 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsHighNumberOfSubscriptions
expr: 'gnatsd_varz_subscriptions > 1000'
for: 5m
labels:
severity: warning
annotations:
summary: Nats high number of subscriptions (instance {{ $labels.instance }})
description: "NATS server has more than 1000 active subscriptions\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsHighPendingBytes
expr: 'gnatsd_connz_pending_bytes > 100000'
for: 5m
labels:
severity: warning
annotations:
summary: Nats high pending bytes (instance {{ $labels.instance }})
description: "NATS server has more than 100,000 pending bytes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsTooManyErrors
expr: 'increase(gnatsd_varz_jetstream_stats_api_errors[5m]) > 5'
for: 5m
labels:
severity: warning
annotations:
summary: Nats too many errors (instance {{ $labels.instance }})
description: "NATS server has encountered {{ $value }} JetStream API errors in the last 5 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NatsJetstreamAccountsExceeded
expr: 'sum(gnatsd_varz_jetstream_stats_accounts) > 100'
for: 5m
labels:
severity: warning
annotations:
summary: Nats JetStream accounts exceeded (instance {{ $labels.instance }})
description: "JetStream has more than 100 active accounts\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Only enable this alert if your deployment requires leaf node connections.
# This will fire spuriously if leaf nodes are not configured.
- alert: NatsLeafNodeConnectionIssue
expr: 'gnatsd_varz_leafnodes == 0'
for: 5m
labels:
severity: warning
annotations:
summary: Nats leaf node connection issue (instance {{ $labels.instance }})
description: "No leaf node connections on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,10 +2,12 @@ groups:
- name: EmbeddedExporter
rules:
# This is a gauge metric (not a counter). Checking idle < 20% means CPU usage > 80%.
- alert: NetdataHighCpuUsage
expr: 'rate(netdata_cpu_cpu_percentage_average{dimension="idle"}[1m]) > 80'
expr: 'netdata_cpu_cpu_percentage_average{dimension="idle"} < 20'
for: 5m
labels:
severity: warning
@ -13,17 +15,17 @@ groups:
summary: Netdata high cpu usage (instance {{ $labels.instance }})
description: "Netdata high CPU usage (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostCpuStealNoisyNeighbor
expr: 'rate(netdata_cpu_cpu_percentage_average{dimension="steal"}[1m]) > 10'
- alert: NetdataCpuStealNoisyNeighbor
expr: 'netdata_cpu_cpu_percentage_average{dimension="steal"} > 10'
for: 5m
labels:
severity: warning
annotations:
summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
summary: Netdata CPU steal noisy neighbor (instance {{ $labels.instance }})
description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NetdataHighMemoryUsage
expr: '100 / netdata_system_ram_MB_average * netdata_system_ram_MB_average{dimension=~"free|cached"} < 20'
expr: '100 / netdata_system_ram_MiB_average * netdata_system_ram_MiB_average{dimension=~"free|cached"} < 20 and netdata_system_ram_MiB_average > 0'
for: 5m
labels:
severity: warning
@ -32,7 +34,7 @@ groups:
description: "Netdata high memory usage (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NetdataLowDiskSpace
expr: '100 / netdata_disk_space_GB_average * netdata_disk_space_GB_average{dimension=~"avail|cached"} < 20'
expr: '100 / netdata_disk_space_GB_average * netdata_disk_space_GB_average{dimension=~"avail|cached"} < 20 and netdata_disk_space_GB_average > 0'
for: 5m
labels:
severity: warning
@ -65,7 +67,7 @@ groups:
severity: info
annotations:
summary: Netdata disk reallocated sectors (instance {{ $labels.instance }})
description: "Reallocated sectors on disk\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Disk reallocated sectors detected ({{ $value }} sectors)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NetdataDiskCurrentPendingSector
expr: 'netdata_smartd_log_current_pending_sector_count_sectors_average > 0'
@ -83,4 +85,4 @@ groups:
severity: warning
annotations:
summary: Netdata reported uncorrectable disk sectors (instance {{ $labels.instance }})
description: "Reported uncorrectable disk sectors\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Reported uncorrectable disk sectors ({{ $value }} sectors)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,10 +2,11 @@ groups:
- name: KnyarNginxExporter
rules:
- alert: NginxHighHttp4xxErrorRate
expr: 'sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5'
expr: 'sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 and sum(rate(nginx_http_requests_total[1m])) > 0'
for: 1m
labels:
severity: critical
@ -14,7 +15,7 @@ groups:
description: "Too many HTTP requests with status 4xx (> 5%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NginxHighHttp5xxErrorRate
expr: 'sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5'
expr: 'sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 and sum(rate(nginx_http_requests_total[1m])) > 0'
for: 1m
labels:
severity: critical

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: NomadJobFailed
@ -11,7 +12,7 @@ groups:
severity: warning
annotations:
summary: Nomad job failed (instance {{ $labels.instance }})
description: "Nomad job failed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Nomad job {{ $labels.job }} has {{ $value }} failed allocations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NomadJobLost
expr: 'nomad_nomad_job_summary_lost > 0'
@ -20,7 +21,7 @@ groups:
severity: warning
annotations:
summary: Nomad job lost (instance {{ $labels.instance }})
description: "Nomad job lost\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Nomad job {{ $labels.job }} has {{ $value }} lost allocations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NomadJobQueued
expr: 'nomad_nomad_job_summary_queued > 0'
@ -29,7 +30,7 @@ groups:
severity: warning
annotations:
summary: Nomad job queued (instance {{ $labels.instance }})
description: "Nomad job queued\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Nomad job {{ $labels.job }} has {{ $value }} queued allocations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: NomadBlockedEvaluation
expr: 'nomad_nomad_blocked_evals_total_blocked > 0'
@ -38,4 +39,4 @@ groups:
severity: warning
annotations:
summary: Nomad blocked evaluation (instance {{ $labels.instance }})
description: "Nomad blocked evaluation\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Nomad has {{ $value }} blocked evaluations. The cluster may lack resources to place allocations.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: OpenebsUsedPoolCapacity

View file

@ -0,0 +1,60 @@
groups:
- name: OpensearchProjectOpensearchPrometheusExporter
rules:
- alert: OpensearchIsUnhealthy
expr: 'opensearch_cluster_status != 0'
for: 0m
labels:
severity: critical
annotations:
summary: OpenSearch is unhealthy (instance {{ $labels.instance }})
description: "OpenSearch cluster {{ $labels.cluster }} is unhealthy\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpensearchHighHeapUsage
expr: 'opensearch_jvm_mem_heap_used_percent > 90'
for: 5m
labels:
severity: warning
annotations:
summary: OpenSearch high heap usage (instance {{ $labels.instance }})
description: "OpenSearch heap usage on cluster {{ $labels.cluster }} is too high\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpensearchCircuitbreakerTripped
expr: 'opensearch_circuitbreaker_tripped_count > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenSearch circuitbreaker tripped (instance {{ $labels.instance }})
description: "The circuitbreaker on OpenSearch cluster {{ $labels.cluster }} has tripped to prevent Java OutOfMemoryError\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpensearchHasPendingTasks
expr: 'opensearch_cluster_pending_tasks_number > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenSearch has pending tasks (instance {{ $labels.instance }})
description: "OpenSearch cluster {{ $labels.cluster }} has pending tasks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpensearchIndexingIsThrottled
expr: 'opensearch_indices_indexing_is_throttled_bool > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenSearch indexing is throttled (instance {{ $labels.instance }})
description: "Indexing on OpenSearch cluster {{ $labels.cluster }} is throttled\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpensearchHasInactiveShards
expr: 'opensearch_cluster_shards_active_percent < 100.0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenSearch has inactive shards (instance {{ $labels.instance }})
description: "OpenSearch cluster {{ $labels.cluster }} has inactive shards\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,192 @@
groups:
- name: OpenstackExporter
rules:
# Adjust the job label regex to match the actual job name in your Prometheus scrape config.
- alert: OpenstackExporterDown
expr: 'up{job=~".*openstack.*"} == 0'
for: 2m
labels:
severity: critical
annotations:
summary: OpenStack exporter down (instance {{ $labels.instance }})
description: "The OpenStack exporter is down. OpenStack cloud metrics are no longer being collected.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackNovaAgentDown
expr: 'openstack_nova_agent_state{adminState="enabled"} == 0'
for: 2m
labels:
severity: critical
annotations:
summary: OpenStack Nova agent down (instance {{ $labels.instance }})
description: "Nova agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackNeutronAgentDown
expr: 'openstack_neutron_agent_state{adminState="up"} == 0'
for: 2m
labels:
severity: critical
annotations:
summary: OpenStack Neutron agent down (instance {{ $labels.instance }})
description: "Neutron agent {{ $labels.hostname }} ({{ $labels.service }}) is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackCinderAgentDown
expr: 'openstack_cinder_agent_state{adminState="enabled"} == 0'
for: 2m
labels:
severity: critical
annotations:
summary: OpenStack Cinder agent down (instance {{ $labels.instance }})
description: "Cinder agent {{ $labels.hostname }} ({{ $labels.service }}) is down in zone {{ $labels.zone }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# The threshold of 90% is a rough default. Adjust based on your overcommit ratio and workload patterns.
- alert: OpenstackHypervisorHighVcpuUsage
expr: 'openstack_nova_vcpus_used / openstack_nova_vcpus_available > 0.9 and openstack_nova_vcpus_available > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack hypervisor high vCPU usage (instance {{ $labels.instance }})
description: "Hypervisor {{ $labels.hostname }} vCPU usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# The threshold of 90% is a rough default. Adjust based on your overcommit ratio and workload patterns.
- alert: OpenstackHypervisorHighMemoryUsage
expr: 'openstack_nova_memory_used_bytes / openstack_nova_memory_available_bytes > 0.9 and openstack_nova_memory_available_bytes > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack hypervisor high memory usage (instance {{ $labels.instance }})
description: "Hypervisor {{ $labels.hostname }} memory usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackHypervisorHighDiskUsage
expr: 'openstack_nova_local_storage_used_bytes / openstack_nova_local_storage_available_bytes > 0.9 and openstack_nova_local_storage_available_bytes > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack hypervisor high disk usage (instance {{ $labels.instance }})
description: "Hypervisor {{ $labels.hostname }} local disk usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# A value of -1 for limits_vcpus_max means unlimited quota (no limit set).
- alert: OpenstackNovaTenantVcpuQuotaNearlyExhausted
expr: 'openstack_nova_limits_vcpus_used / openstack_nova_limits_vcpus_max > 0.9 and openstack_nova_limits_vcpus_max > 0'
for: 0m
labels:
severity: warning
annotations:
summary: OpenStack Nova tenant vCPU quota nearly exhausted (instance {{ $labels.instance }})
description: "Tenant {{ $labels.tenant }} has used over 90% of its vCPU quota\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackNovaTenantMemoryQuotaNearlyExhausted
expr: 'openstack_nova_limits_memory_used / openstack_nova_limits_memory_max > 0.9 and openstack_nova_limits_memory_max > 0'
for: 0m
labels:
severity: warning
annotations:
summary: OpenStack Nova tenant memory quota nearly exhausted (instance {{ $labels.instance }})
description: "Tenant {{ $labels.tenant }} has used over 90% of its memory quota\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackNovaTenantInstanceQuotaNearlyExhausted
expr: 'openstack_nova_limits_instances_used / openstack_nova_limits_instances_max > 0.9 and openstack_nova_limits_instances_max > 0'
for: 0m
labels:
severity: warning
annotations:
summary: OpenStack Nova tenant instance quota nearly exhausted (instance {{ $labels.instance }})
description: "Tenant {{ $labels.tenant }} has used over 90% of its instance quota\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackCinderTenantVolumeQuotaNearlyExhausted
expr: 'openstack_cinder_limits_volume_used_gb / openstack_cinder_limits_volume_max_gb > 0.9 and openstack_cinder_limits_volume_max_gb > 0'
for: 0m
labels:
severity: warning
annotations:
summary: OpenStack Cinder tenant volume quota nearly exhausted (instance {{ $labels.instance }})
description: "Tenant {{ $labels.tenant }} has used over 90% of its volume storage quota\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackCinderPoolLowFreeCapacity
expr: 'openstack_cinder_pool_capacity_free_gb / openstack_cinder_pool_capacity_total_gb < 0.1 and openstack_cinder_pool_capacity_total_gb > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack Cinder pool low free capacity (instance {{ $labels.instance }})
description: "Cinder storage pool {{ $labels.name }} has less than 10% free capacity\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackNeutronFloatingIpsAssociatedButNotActive
expr: 'openstack_neutron_floating_ips_associated_not_active > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack Neutron floating IPs associated but not active (instance {{ $labels.instance }})
description: "{{ $value }} floating IPs are associated to a private IP but are not in ACTIVE state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackNeutronRoutersNotActive
expr: 'openstack_neutron_routers_not_active > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack Neutron routers not active (instance {{ $labels.instance }})
description: "{{ $value }} Neutron routers are not in ACTIVE state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackNeutronSubnetIpPoolExhaustion
expr: 'openstack_neutron_network_ip_availabilities_used / openstack_neutron_network_ip_availabilities_total > 0.9 and openstack_neutron_network_ip_availabilities_total > 0'
for: 0m
labels:
severity: warning
annotations:
summary: OpenStack Neutron subnet IP pool exhaustion (instance {{ $labels.instance }})
description: "Subnet {{ $labels.subnet_name }} on network {{ $labels.network_name }} has used over 90% of its IP pool\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackNeutronPortsWithoutIps
expr: 'openstack_neutron_ports_no_ips > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack Neutron ports without IPs (instance {{ $labels.instance }})
description: "{{ $value }} active ports have no IP addresses assigned\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackLoadBalancerNotOnline
expr: 'openstack_loadbalancer_loadbalancer_status{operating_status!="ONLINE"} > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack load balancer not online (instance {{ $labels.instance }})
description: "Load balancer {{ $labels.name }} ({{ $labels.id }}) operating status is {{ $labels.operating_status }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackNovaInstancesInErrorState
expr: 'sum(openstack_nova_server_status{status="ERROR"}) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack Nova instances in ERROR state (instance {{ $labels.instance }})
description: "{{ $value }} Nova instances are in ERROR state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpenstackCinderVolumesInErrorState
expr: 'openstack_cinder_volume_status_counter{status=~"error.*"} > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack Cinder volumes in error state (instance {{ $labels.instance }})
description: "{{ $value }} Cinder volumes are in an error state\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# This alert factors in the allocation ratio to compute effective capacity.
# The threshold of 90% is a rough default. Adjust based on your allocation ratios and workload patterns.
- alert: OpenstackPlacementResourceHighUsage
expr: 'openstack_placement_resource_usage / (openstack_placement_resource_total * openstack_placement_resource_allocation_ratio) > 0.9 and openstack_placement_resource_total > 0'
for: 5m
labels:
severity: warning
annotations:
summary: OpenStack placement resource high usage (instance {{ $labels.instance }})
description: "Resource {{ $labels.resourcetype }} on host {{ $labels.hostname }} usage exceeds 90% of its allocation\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,128 @@
groups:
- name: EmbeddedExporter
# OpenTelemetry Collector self-monitoring metrics are exposed on port 8888 by default at the /metrics endpoint.
# These alerts monitor the collector's health when metrics are ingested via the Prometheus OTLP endpoint or scraped directly.
# All collector internal metrics are prefixed with 'otelcol_'.
rules:
# Adjust the job label regex to match the actual job name in your Prometheus scrape config.
- alert: OpentelemetryCollectorDown
expr: 'up{job=~".*otel.*collector.*"} == 0'
for: 1m
labels:
severity: critical
annotations:
summary: OpenTelemetry Collector down (instance {{ $labels.instance }})
description: "OpenTelemetry Collector instance has disappeared or is not being scraped\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: OpentelemetryCollectorReceiverRefusedSpans
expr: 'rate(otelcol_receiver_refused_spans[5m]) > 0.05'
for: 5m
labels:
severity: critical
annotations:
summary: OpenTelemetry Collector receiver refused spans (instance {{ $labels.instance }})
description: "OpenTelemetry Collector is refusing {{ $value | humanize }}/s spans on {{ $labels.receiver }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: OpentelemetryCollectorReceiverRefusedMetricPoints
expr: 'rate(otelcol_receiver_refused_metric_points[5m]) > 0.05'
for: 5m
labels:
severity: critical
annotations:
summary: OpenTelemetry Collector receiver refused metric points (instance {{ $labels.instance }})
description: "OpenTelemetry Collector is refusing {{ $value | humanize }}/s metric points on {{ $labels.receiver }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: OpentelemetryCollectorReceiverRefusedLogRecords
expr: 'rate(otelcol_receiver_refused_log_records[5m]) > 0.05'
for: 5m
labels:
severity: critical
annotations:
summary: OpenTelemetry Collector receiver refused log records (instance {{ $labels.instance }})
description: "OpenTelemetry Collector is refusing {{ $value | humanize }}/s log records on {{ $labels.receiver }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: OpentelemetryCollectorExporterFailedSpans
expr: 'rate(otelcol_exporter_send_failed_spans[5m]) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: OpenTelemetry Collector exporter failed spans (instance {{ $labels.instance }})
description: "OpenTelemetry Collector failing to send {{ $value | humanize }}/s spans via {{ $labels.exporter }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: OpentelemetryCollectorExporterFailedMetricPoints
expr: 'rate(otelcol_exporter_send_failed_metric_points[5m]) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: OpenTelemetry Collector exporter failed metric points (instance {{ $labels.instance }})
description: "OpenTelemetry Collector failing to send {{ $value | humanize }}/s metric points via {{ $labels.exporter }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
- alert: OpentelemetryCollectorExporterFailedLogRecords
expr: 'rate(otelcol_exporter_send_failed_log_records[5m]) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: OpenTelemetry Collector exporter failed log records (instance {{ $labels.instance }})
description: "OpenTelemetry Collector failing to send {{ $value | humanize }}/s log records via {{ $labels.exporter }}.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpentelemetryCollectorExporterQueueNearlyFull
expr: '(otelcol_exporter_queue_size / on(instance, job, exporter) otelcol_exporter_queue_capacity) > 0.8 and otelcol_exporter_queue_capacity > 0'
for: 0m
labels:
severity: warning
annotations:
summary: OpenTelemetry Collector exporter queue nearly full (instance {{ $labels.instance }})
description: "OpenTelemetry Collector exporter {{ $labels.exporter }} queue is over 80% full\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
# These processor metrics are deprecated since collector v0.110.0.
- alert: OpentelemetryCollectorProcessorRefusedSpans
expr: 'rate(otelcol_processor_refused_spans[5m]) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: OpenTelemetry Collector processor refused spans (instance {{ $labels.instance }})
description: "OpenTelemetry Collector processor {{ $labels.processor }} is refusing spans ({{ $value | humanize }}/s), likely due to backpressure.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 0.05/s avoids firing on transient single-event spikes.
# These processor metrics are deprecated since collector v0.110.0.
- alert: OpentelemetryCollectorProcessorRefusedMetricPoints
expr: 'rate(otelcol_processor_refused_metric_points[5m]) > 0.05'
for: 5m
labels:
severity: warning
annotations:
summary: OpenTelemetry Collector processor refused metric points (instance {{ $labels.instance }})
description: "OpenTelemetry Collector processor {{ $labels.processor }} is refusing metric points ({{ $value | humanize }}/s), likely due to backpressure.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpentelemetryCollectorHighMemoryUsage
expr: '(otelcol_process_runtime_heap_alloc_bytes / on(instance, job) otelcol_process_runtime_total_sys_memory_bytes) > 0.9'
for: 5m
labels:
severity: warning
annotations:
summary: OpenTelemetry Collector high memory usage (instance {{ $labels.instance }})
description: "OpenTelemetry Collector memory usage is above 90%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OpentelemetryCollectorOtlpReceiverErrors
expr: 'rate(otelcol_receiver_accepted_spans{receiver=~"otlp"}[5m]) == 0 and rate(otelcol_receiver_refused_spans{receiver=~"otlp"}[5m]) > 0'
for: 2m
labels:
severity: critical
annotations:
summary: OpenTelemetry Collector OTLP receiver errors (instance {{ $labels.instance }})
description: "OpenTelemetry Collector OTLP receiver is completely failing - all spans are being refused\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,84 @@
groups:
- name: IamsethOracledbExporter
rules:
# 1m delay allows a restart without triggering an alert.
- alert: OracleDbDown
expr: 'oracledb_up == 0'
for: 1m
labels:
severity: critical
annotations:
summary: Oracle DB down (instance {{ $labels.instance }})
description: "Oracle Database instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold is workload-dependent. Adjust 85% to suit your environment.
- alert: OracleDbSessionsReachingLimit(>85%)
expr: 'oracledb_resource_current_utilization{resource_name="sessions"} / oracledb_resource_limit_value{resource_name="sessions"} * 100 > 85 and oracledb_resource_limit_value{resource_name="sessions"} > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Oracle DB sessions reaching limit (> 85%) (instance {{ $labels.instance }})
description: "Oracle Database session utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold is workload-dependent. Adjust 85% to suit your environment.
- alert: OracleDbProcessesReachingLimit(>85%)
expr: 'oracledb_resource_current_utilization{resource_name="processes"} / oracledb_resource_limit_value{resource_name="processes"} * 100 > 85 and oracledb_resource_limit_value{resource_name="processes"} > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Oracle DB processes reaching limit (> 85%) (instance {{ $labels.instance }})
description: "Oracle Database process utilization is above 85% on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OracleDbTablespaceReachingCapacity(>85%)
expr: 'oracledb_tablespace_used_percent > 85'
for: 5m
labels:
severity: warning
annotations:
summary: Oracle DB tablespace reaching capacity (> 85%) (instance {{ $labels.instance }})
description: "Oracle Database tablespace {{ $labels.tablespace }} is above 85% usage on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: OracleDbTablespaceFull(>95%)
expr: 'oracledb_tablespace_used_percent > 95'
for: 5m
labels:
severity: critical
annotations:
summary: Oracle DB tablespace full (> 95%) (instance {{ $labels.instance }})
description: "Oracle Database tablespace {{ $labels.tablespace }} is critically full on {{ $labels.instance }} (current value: {{ $value }}%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# A high rollback rate (>20%) often indicates application-level issues such as deadlocks, constraint violations, or poorly designed transactions.
- alert: OracleDbHighUserRollbacks
expr: 'rate(oracledb_activity_user_rollbacks[5m]) / (rate(oracledb_activity_user_commits[5m]) + rate(oracledb_activity_user_rollbacks[5m])) * 100 > 20 and (rate(oracledb_activity_user_commits[5m]) + rate(oracledb_activity_user_rollbacks[5m])) > 0'
for: 5m
labels:
severity: warning
annotations:
summary: Oracle DB high user rollbacks (instance {{ $labels.instance }})
description: "Oracle Database on {{ $labels.instance }} has a high rollback rate ({{ $value }}% of transactions are rolled back)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold is highly workload-dependent. Adjust 200 to suit your environment.
- alert: OracleDbTooManyActiveSessions
expr: 'oracledb_sessions_value{status="ACTIVE", type="USER"} > 200'
for: 5m
labels:
severity: warning
annotations:
summary: Oracle DB too many active sessions (instance {{ $labels.instance }})
description: "Oracle Database on {{ $labels.instance }} has too many active user sessions (current value: {{ $value }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# The metric from v$waitclassmetric is already a normalized rate (centiseconds per second). Threshold 300 means 3 seconds of I/O wait per second of wall time.
- alert: OracleDbHighWaitTime(userI/o)
expr: 'oracledb_wait_time_user_io > 300'
for: 5m
labels:
severity: warning
annotations:
summary: Oracle DB high wait time (user I/O) (instance {{ $labels.instance }})
description: "Oracle Database on {{ $labels.instance }} is experiencing high user I/O wait time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,11 +2,13 @@ groups:
- name: EmbeddedExporterPatroni
rules:
# 1m delay allows a restart without triggering an alert.
- alert: PatroniHasNoLeader
expr: '(max by (scope) (patroni_master) < 1) and (max by (scope) (patroni_standby_leader) < 1)'
for: 0m
expr: '(max by (scope) (patroni_primary) < 1) and (max by (scope) (patroni_standby_leader) < 1)'
for: 1m
labels:
severity: critical
annotations:

View file

@ -2,6 +2,7 @@ groups:
- name: SpreakerPgbouncerExporter
rules:
- alert: PgbouncerActiveConnections
@ -20,10 +21,10 @@ groups:
severity: warning
annotations:
summary: PGBouncer errors (instance {{ $labels.instance }})
description: "PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "PGBouncer is logging errors. This may be due to a server restart or an admin typing commands at the pgbouncer console.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PgbouncerMaxConnections
expr: 'increase(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[30s]) > 0'
expr: 'increase(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[2m]) > 0'
for: 0m
labels:
severity: critical

View file

@ -2,13 +2,14 @@ groups:
- name: BakinsFpmExporter
rules:
- alert: Php-fpmMax-childrenReached
expr: 'sum(phpfpm_max_children_reached_total) by (instance) > 0'
expr: 'sum(increase(phpfpm_max_children_reached_total[5m])) by (instance) > 3'
for: 0m
labels:
severity: warning
annotations:
summary: PHP-FPM max-children reached (instance {{ $labels.instance }})
description: "PHP-FPM reached max children - {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "PHP-FPM reached max children on {{ $labels.instance }} ({{ $value }} times in the last 5m)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,11 +2,13 @@ groups:
- name: PostgresExporter
rules:
# 1m delay allows a restart without triggering an alert.
- alert: PostgresqlDown
expr: 'pg_up == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -32,7 +34,7 @@ groups:
description: "Postgresql exporter is showing errors. A query may be buggy in query.yaml\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlTableNotAutoVacuumed
expr: '(pg_stat_user_tables_last_autovacuum > 0) and (time() - pg_stat_user_tables_last_autovacuum) > 60 * 60 * 24 * 10'
expr: '((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_vacuum_threshold) and (time() - pg_stat_user_tables_last_autovacuum) > 60 * 60 * 24 * 10'
for: 0m
labels:
severity: warning
@ -41,7 +43,7 @@ groups:
description: "Table {{ $labels.relname }} has not been auto vacuumed for 10 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlTableNotAutoAnalyzed
expr: '(pg_stat_user_tables_last_autoanalyze > 0) and (time() - pg_stat_user_tables_last_autoanalyze) > 24 * 60 * 60 * 10'
expr: '((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_analyze_threshold) and (time() - pg_stat_user_tables_last_autoanalyze) > 24 * 60 * 60 * 10'
for: 0m
labels:
severity: warning
@ -62,22 +64,22 @@ groups:
expr: 'sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5'
for: 2m
labels:
severity: warning
severity: critical
annotations:
summary: Postgresql not enough connections (instance {{ $labels.instance }})
description: "PostgreSQL instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlDeadLocks
expr: 'increase(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 5'
expr: 'increase(pg_stat_database_deadlocks{datname!~"template.*|postgres",datid!="0"}[1m]) > 5'
for: 0m
labels:
severity: warning
annotations:
summary: Postgresql dead locks (instance {{ $labels.instance }})
description: "PostgreSQL has dead-locks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "PostgreSQL has dead-locks ({{ $value }} in the last minute)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlHighRollbackRate
expr: 'sum by (namespace,datname) ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) / ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) + (rate(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[3m])))) > 0.02'
expr: 'sum by (namespace,datname,instance) (rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) / (sum by (namespace,datname,instance) (rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) + sum by (namespace,datname,instance) (rate(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[3m]))) > 0.02 and (sum by (namespace,datname,instance) (rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) + sum by (namespace,datname,instance) (rate(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[3m]))) > 0'
for: 0m
labels:
severity: warning
@ -86,7 +88,7 @@ groups:
description: "Ratio of transactions being aborted compared to committed is > 2 %\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlCommitRateLow
expr: 'rate(pg_stat_database_xact_commit[1m]) < 10'
expr: 'increase(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[5m]) < 5'
for: 2m
labels:
severity: critical
@ -94,6 +96,7 @@ groups:
summary: Postgresql commit rate low (instance {{ $labels.instance }})
description: "Postgresql seems to be processing very few transactions\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# pg_txid_current is not a default postgres_exporter metric. You need to define a custom query. See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
- alert: PostgresqlLowXidConsumption
expr: 'rate(pg_txid_current[1m]) < 5'
for: 2m
@ -103,26 +106,8 @@ groups:
summary: Postgresql low XID consumption (instance {{ $labels.instance }})
description: "Postgresql seems to be consuming transaction IDs very slowly\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlHighRateStatementTimeout
expr: 'rate(postgresql_errors_total{type="statement_timeout"}[1m]) > 3'
for: 0m
labels:
severity: critical
annotations:
summary: Postgresql high rate statement timeout (instance {{ $labels.instance }})
description: "Postgres transactions showing high rate of statement timeouts\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlHighRateDeadlock
expr: 'increase(postgresql_errors_total{type="deadlock_detected"}[1m]) > 1'
for: 0m
labels:
severity: critical
annotations:
summary: Postgresql high rate deadlock (instance {{ $labels.instance }})
description: "Postgres detected deadlocks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlUnusedReplicationSlot
expr: 'pg_replication_slots_active == 0'
expr: '(pg_replication_slots_active == 0) and (pg_replication_is_replica == 0)'
for: 1m
labels:
severity: warning
@ -131,7 +116,7 @@ groups:
description: "Unused Replication Slots\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlTooManyDeadTuples
expr: '((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1'
expr: '((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1 and (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup) > 0'
for: 2m
labels:
severity: warning
@ -140,7 +125,7 @@ groups:
description: "PostgreSQL dead tuples is too large\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlConfigurationChanged
expr: '{__name__=~"pg_settings_.*"} != ON(__name__) {__name__=~"pg_settings_([^t]|t[^r]|tr[^a]|tra[^n]|tran[^s]|trans[^a]|transa[^c]|transac[^t]|transact[^i]|transacti[^o]|transactio[^n]|transaction[^_]|transaction_[^r]|transaction_r[^e]|transaction_re[^a]|transaction_rea[^d]|transaction_read[^_]|transaction_read_[^o]|transaction_read_o[^n]|transaction_read_on[^l]|transaction_read_onl[^y]).*"} OFFSET 5m'
expr: '{__name__=~"pg_settings_.*",__name__!="pg_settings_transaction_read_only"} != ON(__name__, instance) {__name__=~"pg_settings_.*",__name__!="pg_settings_transaction_read_only"} OFFSET 5m'
for: 0m
labels:
severity: info
@ -148,17 +133,18 @@ groups:
summary: Postgresql configuration changed (instance {{ $labels.instance }})
description: "Postgres Database configuration change has occurred\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# pg_stat_ssl_compression is not a default postgres_exporter metric and is only available on PostgreSQL 9.5-13 (removed in PG 14). See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
- alert: PostgresqlSslCompressionActive
expr: 'sum(pg_stat_ssl_compression) > 0'
expr: 'sum by (instance) (pg_stat_ssl_compression) > 0'
for: 0m
labels:
severity: critical
severity: warning
annotations:
summary: Postgresql SSL compression active (instance {{ $labels.instance }})
description: "Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Database allows connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlTooManyLocksAcquired
expr: '((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20'
expr: '((sum by (instance) (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20 and (pg_settings_max_locks_per_transaction * pg_settings_max_connections) > 0'
for: 2m
labels:
severity: critical
@ -166,6 +152,7 @@ groups:
summary: Postgresql too many locks acquired (instance {{ $labels.instance }})
description: "Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
- alert: PostgresqlBloatIndexHigh(>80%)
expr: 'pg_bloat_btree_bloat_pct > 80 and on (idxname) (pg_bloat_btree_real_size > 100000000)'
for: 1h
@ -175,6 +162,7 @@ groups:
summary: Postgresql bloat index high (> 80%) (instance {{ $labels.instance }})
description: "The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
- alert: PostgresqlBloatTableHigh(>80%)
expr: 'pg_bloat_table_bloat_pct > 80 and on (relname) (pg_bloat_table_real_size > 200000000)'
for: 1h
@ -184,11 +172,21 @@ groups:
summary: Postgresql bloat table high (> 80%) (instance {{ $labels.instance }})
description: "The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
- alert: PostgresqlInvalidIndex
expr: 'pg_genaral_index_info_pg_relation_size{indexrelname=~".*ccnew.*"}'
expr: 'pg_general_index_info_pg_relation_size{indexrelname=~".*ccnew.*"}'
for: 6h
labels:
severity: warning
annotations:
summary: Postgresql invalid index (instance {{ $labels.instance }})
description: "The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PostgresqlReplicationLag
expr: 'pg_replication_lag_seconds > 5'
for: 30s
labels:
severity: warning
annotations:
summary: Postgresql replication lag (instance {{ $labels.instance }})
description: "The PostgreSQL replication lag is high (> 5s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -0,0 +1,102 @@
groups:
- name: ProcessExporter
rules:
- alert: ProcessExporterGroupDown
expr: 'namedprocess_namegroup_num_procs == 0'
for: 5m
labels:
severity: warning
annotations:
summary: Process exporter group down (instance {{ $labels.instance }})
description: "No processes found for group {{ $labels.groupname }}. The service may have stopped. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 4GB is arbitrary and depends on the process being monitored. Adjust per group.
- alert: ProcessExporterHighMemoryUsage
expr: 'namedprocess_namegroup_memory_bytes{memtype="resident"} > 4e+09'
for: 5m
labels:
severity: warning
annotations:
summary: Process exporter high memory usage (instance {{ $labels.instance }})
description: "Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of resident memory. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Value is core-equivalent %: 100% = 1 full core, 200% = 2 cores, etc. Threshold of 80% is per-core. Adjust based on expected workload.
- alert: ProcessExporterHighCpuUsage
expr: 'rate(namedprocess_namegroup_cpu_seconds_total[5m]) * 100 > 80'
for: 5m
labels:
severity: warning
annotations:
summary: Process exporter high CPU usage (instance {{ $labels.instance }})
description: "Process group {{ $labels.groupname }} is using {{ $value }}% CPU (core-equivalent). (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ProcessExporterHighFileDescriptorUsage
expr: 'namedprocess_namegroup_worst_fd_ratio > 0.8'
for: 5m
labels:
severity: warning
annotations:
summary: Process exporter high file descriptor usage (instance {{ $labels.instance }})
description: "Process group {{ $labels.groupname }} is using more than 80% of its file descriptor limit. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ProcessExporterFileDescriptorsExhausted
expr: 'namedprocess_namegroup_worst_fd_ratio > 0.95'
for: 2m
labels:
severity: critical
annotations:
summary: Process exporter file descriptors exhausted (instance {{ $labels.instance }})
description: "Process group {{ $labels.groupname }} has nearly exhausted its file descriptor limit. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 512MB is arbitrary. Adjust per group and environment.
- alert: ProcessExporterHighSwapUsage
expr: 'namedprocess_namegroup_memory_bytes{memtype="swapped"} > 512e+06'
for: 5m
labels:
severity: warning
annotations:
summary: Process exporter high swap usage (instance {{ $labels.instance }})
description: "Process group {{ $labels.groupname }} is using {{ $value | humanize }}B of swap. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ProcessExporterZombieProcesses
expr: 'namedprocess_namegroup_states{state="Zombie"} > 5'
for: 5m
labels:
severity: warning
annotations:
summary: Process exporter zombie processes (instance {{ $labels.instance }})
description: "Process group {{ $labels.groupname }} has {{ $value }} zombie processes. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Filters to voluntary switches only — involuntary switches are normal under CPU contention. Threshold of 50000/s is a rough default. Adjust based on workload.
- alert: ProcessExporterHighContextSwitching
expr: 'rate(namedprocess_namegroup_context_switches_total{ctxswitchtype="voluntary"}[5m]) > 50000'
for: 5m
labels:
severity: warning
annotations:
summary: Process exporter high context switching (instance {{ $labels.instance }})
description: "Process group {{ $labels.groupname }} has a high rate of context switches ({{ $value }}/s). (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Threshold of 100MB/s is arbitrary. Adjust per group.
- alert: ProcessExporterHighDiskWriteIo
expr: 'rate(namedprocess_namegroup_write_bytes_total[5m]) > 100e+06'
for: 5m
labels:
severity: warning
annotations:
summary: Process exporter high disk write IO (instance {{ $labels.instance }})
description: "Process group {{ $labels.groupname }} is performing {{ $value | humanize }}B/s of disk writes. (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Detects restarts by watching for changes in the oldest process start time within the group.
- alert: ProcessExporterProcessRestarting
expr: 'changes(namedprocess_namegroup_oldest_start_time_seconds[5m]) > 0 and namedprocess_namegroup_num_procs > 0'
for: 0m
labels:
severity: info
annotations:
summary: Process exporter process restarting (instance {{ $labels.instance }})
description: "Process group {{ $labels.groupname }} has restarted (oldest process start time changed). (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

View file

@ -2,6 +2,7 @@ groups:
- name: EmbeddedExporter
rules:
- alert: PrometheusJobMissing
@ -13,9 +14,11 @@ groups:
summary: Prometheus job missing (instance {{ $labels.instance }})
description: "A Prometheus job has disappeared\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
# Only fire if at least one target in the job is still up.
# If all targets are down, PrometheusJobMissing or PrometheusAllTargetsMissing will fire instead.
- alert: PrometheusTargetMissing
expr: 'up == 0'
for: 0m
expr: 'up == 0 unless on(job) (sum by (job) (up) == 0)'
for: 1m
labels:
severity: critical
annotations:
@ -24,7 +27,7 @@ groups:
- alert: PrometheusAllTargetsMissing
expr: 'sum by (job) (up) == 0'
for: 0m
for: 1m
labels:
severity: critical
annotations:
@ -32,8 +35,8 @@ groups:
description: "A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PrometheusTargetMissingWithWarmupTime
expr: 'sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))'
for: 0m
expr: 'sum by (instance, job) ((up == 0) * on (instance) group_left(__name__) (node_time_seconds - node_boot_time_seconds > 600))'
for: 1m
labels:
severity: critical
annotations:
@ -140,13 +143,13 @@ groups:
description: "The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PrometheusAlertmanagerNotificationFailing
expr: 'rate(alertmanager_notifications_failed_total[1m]) > 0'
expr: 'rate(alertmanager_notifications_failed_total[3m]) > 0.05'
for: 0m
labels:
severity: critical
annotations:
summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
description: "Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Alertmanager is failing sending notifications ({{ $value }} notifications/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PrometheusTargetEmpty
expr: 'prometheus_sd_discovered_targets == 0'
@ -173,16 +176,16 @@ groups:
severity: warning
annotations:
summary: Prometheus large scrape (instance {{ $labels.instance }})
description: "Prometheus has many scrapes that exceed the sample limit\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Prometheus has many scrapes that exceed the sample limit ({{ $value }} scrapes)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PrometheusTargetScrapeDuplicate
expr: 'increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0'
expr: 'increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 3'
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Prometheus has many samples rejected due to duplicate timestamps but different values ({{ $value }} samples)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: PrometheusTsdbCheckpointCreationFailures
expr: 'increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0'

Some files were not shown because too many files have changed in this diff Show more