Commit graph

935 commits

Author SHA1 Message Date
Samuel Berthe
aa7d93ce95
chore: migrate assets/ to site/public/images/ (#549)
Remove legacy assets/ directory (pre-Astro era). Images were already
duplicated under site/public/images/; update README sponsor URLs to
point to the new location.
2026-04-10 21:28:38 +02:00
Samuel Berthe
a4d0b1370c
ci: add site build workflow (#548) 2026-04-10 21:21:04 +02:00
dependabot[bot]
d31b3f9ba0
build(deps): bump @iconify-json/lucide from 1.2.101 to 1.2.102 in /site (#545)
Bumps [@iconify-json/lucide](https://github.com/iconify/icon-sets) from 1.2.101 to 1.2.102.
- [Commits](https://github.com/iconify/icon-sets/commits)

---
updated-dependencies:
- dependency-name: "@iconify-json/lucide"
  dependency-version: 1.2.102
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:19:49 +02:00
Samuel Berthe
89d8423d93
build(deps): migrate from @astrojs/tailwind to @tailwindcss/vite for Tailwind v4 (#547)
@astrojs/tailwind v6 still requires tailwindcss@^3; replace it with the
official @tailwindcss/vite Vite plugin. Update global.css to v4 syntax
(@import "tailwindcss", @custom-variant dark, @theme tokens) and drop
the now-unused tailwind.config.mjs.
2026-04-10 21:18:13 +02:00
dependabot[bot]
814dd5d3fb
build(deps): bump @astrojs/tailwind from 5.1.5 to 6.0.2 in /site (#543)
Bumps [@astrojs/tailwind](https://github.com/withastro/astro/tree/HEAD/packages/integrations/tailwind) from 5.1.5 to 6.0.2.
- [Release notes](https://github.com/withastro/astro/releases)
- [Changelog](https://github.com/withastro/astro/blob/main/packages/integrations/tailwind/CHANGELOG.md)
- [Commits](https://github.com/withastro/astro/commits/@astrojs/tailwind@6.0.2/packages/integrations/tailwind)

---
updated-dependencies:
- dependency-name: "@astrojs/tailwind"
  dependency-version: 6.0.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:11:05 +02:00
dependabot[bot]
e6ea45aec1
build(deps): bump tailwindcss from 3.4.19 to 4.2.2 in /site (#544)
Bumps [tailwindcss](https://github.com/tailwindlabs/tailwindcss/tree/HEAD/packages/tailwindcss) from 3.4.19 to 4.2.2.
- [Release notes](https://github.com/tailwindlabs/tailwindcss/releases)
- [Changelog](https://github.com/tailwindlabs/tailwindcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/tailwindlabs/tailwindcss/commits/v4.2.2/packages/tailwindcss)

---
updated-dependencies:
- dependency-name: tailwindcss
  dependency-version: 4.2.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:56 +02:00
dependabot[bot]
bea2dc45b4
build(deps): bump actions/upload-pages-artifact from 3 to 4 (#540)
Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 3 to 4.
- [Release notes](https://github.com/actions/upload-pages-artifact/releases)
- [Commits](https://github.com/actions/upload-pages-artifact/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/upload-pages-artifact
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:38 +02:00
dependabot[bot]
dd0c8372f9
build(deps): bump actions/setup-node from 4 to 6 (#541)
Bumps [actions/setup-node](https://github.com/actions/setup-node) from 4 to 6.
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/setup-node
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:31 +02:00
dependabot[bot]
132329abd8
build(deps): bump actions/deploy-pages from 4 to 5 (#542)
Bumps [actions/deploy-pages](https://github.com/actions/deploy-pages) from 4 to 5.
- [Release notes](https://github.com/actions/deploy-pages/releases)
- [Commits](https://github.com/actions/deploy-pages/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/deploy-pages
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:24 +02:00
dependabot[bot]
9e80bb910e
build(deps): bump actions/checkout from 4 to 6 (#539)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:09 +02:00
Samuel Berthe
79afa21610
feat/astro migration (#538)
* feat: migrate website from Jekyll to Astro

Rebuilds the site using Astro (SSG) with Tailwind CSS v4, replacing the
Jekyll/Cayman theme. Key changes:

- Splits the monolithic /rules page into 110 statically-generated pages
  (92 per-service + 13 group index + homepage + guide pages) for SEO
- URL structure: /rules/[group-slug]/[service-slug]/ with backward-
  compatibility redirect map for old anchor-based URLs (/rules#redis)
- Modern UI: Prometheus-orange accent, dark mode (system + toggle),
  sticky sidebar, responsive layout, copy-to-clipboard per rule/section
- SEO: per-page <title>, <meta description>, Open Graph, Twitter Card,
  canonical URLs, sitemap.xml via @astrojs/sitemap
- GEO: FAQPage JSON-LD schema on each service page (rules as Q&A pairs
  for AI search engines), TechArticle schema, BreadcrumbList
- Search: Pagefind (build-time index, lazy-loaded, ~200KB)
- Zero JS by default; copy buttons and theme toggle use inline scripts
- New CI: .github/workflows/deploy.yml builds Astro + Pagefind and
  deploys to GitHub Pages via actions/deploy-pages
- Existing dist.yml and test.yml workflows are untouched
- _data/rules.yml remains the single source of truth

Note: GitHub Pages source must be changed from "Build from branch"
(Jekyll) to "GitHub Actions" in repository settings.

* doc: new website based on astro

* refactor: remove previous website

* chore: add npm dependabot for Astro site + scope CI to _data changes

* Update site/astro.config.mjs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update site/src/components/CopyButton.astro

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* oops

* fix: strip trailing slash from BASE_URL to prevent double slashes in URLs

Agent-Logs-Url: https://github.com/samber/awesome-prometheus-alerts/sessions/c85937ba-1855-4b8a-a72b-847eab1c8639

Co-authored-by: samber <2951285+samber@users.noreply.github.com>

* fix: resolve Astro build errors in astro.config.mjs

- Remove assetsInclude yml which caused Vite to treat YAML files as static assets instead of running them through the custom YAML transform plugin; data.groups was undefined at runtime because the import resolved to a URL rather than parsed content
- Deduplicate old-path redirects: emit only the slash-less variant per service to avoid Astro router collision warnings (trailing-slash variant is handled automatically)

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: samber <2951285+samber@users.noreply.github.com>
2026-04-10 21:08:06 +02:00
samber
0d148832d3 Publish 2026-04-06 19:14:46 +00:00
Samuel Berthe
c2615fae52
fix/promql rules review 2 (#534)
* fix(data): fix queries and thresholds across multiple exporters

- Ceph: fix OSD latency metric name (ceph_osd_apply_latency_ms), replace
  ceph_osd_utilization with ceph_health_detail{name="OSD_NEARFULL"}, add for: durations
- ZFS: improve description, remove incorrect ON() join on readonly check
- Thanos: filter gRPC errors to actual error codes only (drop NotFound, Cancelled, etc.)
- Loki/Promtail: fix histogram_quantile to aggregate by (namespace, job, route, le)
- Mimir: raise rate()>0 thresholds to >0.05, add missing for: durations
- OTel Collector: raise rate()>0 thresholds to >0.05, add deprecation comments
- Tempo/Cortex: raise >0 thresholds to avoid transient spikes
- APC UPS: add division-by-zero guard on battery voltage ratio
- DigitalOcean: raise increase()>0 to >3
- Grafana Alloy: fix missing name: field on exporter
- Graph Node: add threshold comments

* fix(data): remove official mixin reference from Ceph OSD comment

* fix(data): remove official mixin references from comments
2026-04-06 21:14:15 +02:00
Samuel Berthe
72c9e922c0
docs: update CLAUDE.md with lessons from PromQL review
Add guidance on untyped metrics with counter semantics (node_vmstat_*,
MySQL SHOW GLOBAL STATUS), YAML duplicate key pitfall, permanent-firing
cumulative counter alerts, and updated category structure.
2026-04-06 21:08:48 +02:00
samber
ed1515015a Publish 2026-04-06 18:38:45 +00:00
Samuel Berthe
2258835c30
fix/promql rules review (#533)
* fix(data): comprehensive PromQL review across all ~937 rules

Query fixes:
- Replace rate()/increase() with deriv()/delta() on gauge metrics exposed
  as untyped by exporters (node_vmstat_oom_kill, mysql_global_status_*,
  systemd_socket_refused_connections_total)
- Fix Ceph OSD latency metric name: ceph_osd_perf_apply_latency_seconds
  → ceph_osd_apply_latency_ms (Ceph MGR Prometheus module)
- Fix NATS subscriptions metric: gnatsd_connz_subscriptions (per-conn)
  → gnatsd_varz_subscriptions (server total)
- Fix Caddy reverse proxy down query: count()==0 → direct gauge == 0
- Fix RabbitMQ total connections metric: connectionsTotal → connections
- Fix Cilium ClusterMesh/KVStoreMesh: deriv() on failure gauge → direct
  gauge comparison (deriv > 0 misses stable non-zero failure states)
- Fix cert-manager ACME metric name: certmanager_http_acme_client_request_count
  → certmanager_acme_client_request_count (renamed in v1.19+)
- Fix Thanos Query gRPC filter: grpc_code!="OK" → explicit error codes
- Fix Flink duplicate comments: field (YAML last-write-wins bug)
- Add datid!="0" filter to PostgreSQL dead locks query
- Fix PostgreSQL high rollback rate: restructure division-by-zero guard
  and move ratio calculation outside sum()
- Add division-by-zero guards: Container Low CPU, Hadoop ResourceManager
  memory, Hadoop HBase heap, Vault cluster health
- Add for: 1m to Blackbox probe failed/HTTP failure and Ceph State/
  OSD Down/PG unavailable

Threshold fixes:
- Replace > 0 with meaningful thresholds on rate()/increase() queries
  across: Alertmanager, eBPF decoder errors, systemd refused connections,
  Memcached, Cassandra (Instaclustr + Criteo), ClickHouse distributed
  inserts, CouchDB log entries, HAProxy healthcheck failures, RabbitMQ
  unroutable messages, Spinnaker, Cilium, Mimir TSDB/alertmanager,
  OTel Collector receiver refused metrics
- Fix Elasticsearch High Indexing Latency threshold: 0.0005s → 0.01s
  (0.5ms was below normal operating range; 10ms is more realistic)

Description fixes:
- Fix MySQL slow queries: remove duplicate "mysql" word
- Fix SMART device description: remove trailing stray ")" (6 rules)
- Fix host disk IO description: remove duplicate "Check storage for issues."
- Fix EDAC correctable errors: "last 5 minutes" → "last 1 minute"
- Fix EDAC uncorrectable errors: remove time-window claim (raw counter)
- Fix Mimir store-gateway sync description: said "10 minutes" but
  threshold is 1800s (30 minutes)
- Fix Vault description false "%" suffix on count values
- Improve descriptions across RabbitMQ, Zookeeper, Kafka, Pulsar, Envoy,
  Istio rules to include {{ $labels }} and {{ $value }} template vars
- Downgrade Cassandra key cache hit rate: critical → warning

Comments:
- Add note on node_vmstat_oom_kill gauge type (delta vs increase)
- Add note on systemd_socket_refused_connections_total gauge type
- Add note on mysql_global_status_* gauge type (delta/deriv vs rate)
- Add note on pg_txid_current requiring a custom postgres_exporter query
- Add note on pg_stat_ssl_compression availability (PG 9.5-13 only)
- Add note on cert-manager legacy metric name for users on v1.18 and older
- Add threshold rationale for Elasticsearch, Cassandra, CouchDB rules
- Add note on NATS leaf node spurious fires when leaf nodes not configured

* fix(data): PromQL type fixes, job filter cleanup, query correctness review

- Replace rate()/increase() with deriv()/delta() on gauge metrics:
  node_vmstat_pgmajfault, cassandra_stats (criteo exporter),
  gitlab_ci_pipeline_failure_reasons, flink_taskmanager_job_task_numRecordsIn
- Fix histogram_quantile on non-_bucket metric: cilium_policy_implementation_delay
- Fix Thanos bucket replicate latency: use _count instead of _bucket for guard clause
- Fix Thanos query latency: use _count instead of _bucket for guard clause
- Restore job filter in Thanos objstore guard clauses (compact + store)
- Remove redundant job= filters from unique metrics: ~30 Thanos rules,
  kube_persistentvolume_status_phase, otelcol_process_runtime_*
- Fix high-cardinality Istio latency grouping (drop source labels from by())
- Add division-by-zero guard to host context switch ratio
- Raise noisy ClickHouse thresholds: RejectedInserts > 2, DelayedInserts > 10
- Remove redundant for: 1m from HAProxy check failure rules
- Add job rename comments to up{job=...} rules (Hadoop, OpenStack, SNMP, OTel)
- Remove external mixin references from comments
- Fix Tempo dropped spans metric name: add missing _total suffix
- Fix Thanos bucket replicate run latency: add missing le label in by()
2026-04-06 20:38:12 +02:00
Samuel Berthe
b8fd051a55
Update README.md 2026-03-31 16:41:19 +02:00
samber
87d0610246 Publish 2026-03-31 14:40:08 +00:00
Emil Bostijancic
7ba6b2d367
feat: add OpenSearch alerting rules (OpenSearch exporter plugin) (#532) 2026-03-31 16:39:38 +02:00
dependabot[bot]
b13d59bce6
build(deps-dev): bump activesupport from 7.2.3 to 7.2.3.1 (#531)
Bumps [activesupport](https://github.com/rails/rails) from 7.2.3 to 7.2.3.1.
- [Release notes](https://github.com/rails/rails/releases)
- [Changelog](https://github.com/rails/rails/blob/v8.1.2.1/activesupport/CHANGELOG.md)
- [Commits](https://github.com/rails/rails/compare/v7.2.3...v7.2.3.1)

---
updated-dependencies:
- dependency-name: activesupport
  dependency-version: 7.2.3.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 08:24:41 +01:00
dependabot[bot]
9d9c648cdd
build(deps-dev): bump json from 2.18.1 to 2.19.2 (#530)
Bumps [json](https://github.com/ruby/json) from 2.18.1 to 2.19.2.
- [Release notes](https://github.com/ruby/json/releases)
- [Changelog](https://github.com/ruby/json/blob/master/CHANGES.md)
- [Commits](https://github.com/ruby/json/compare/v2.18.1...v2.19.2)

---
updated-dependencies:
- dependency-name: json
  dependency-version: 2.19.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-20 18:22:18 +01:00
samber
af2f277830 Publish 2026-03-18 20:41:01 +00:00
Samuel Berthe
e3a7165a65
fix(data): remove malformed summary fields, replace increase() by rate(), remove redundant avg_over_time 2026-03-18 21:40:30 +01:00
samber
c0e1f7a5f5 Publish 2026-03-18 17:06:34 +00:00
Samuel Berthe
1aafa40913
fix(data): prevent division by 0 2026-03-18 18:06:00 +01:00
samber
4fb1aa9ae4 Publish 2026-03-18 11:23:25 +00:00
Samuel Berthe
a4581ed322
fix(data): fix tresholds, comments, intervals, units... (#529) 2026-03-18 12:22:55 +01:00
samber
f36c23e393 Publish 2026-03-17 12:30:42 +00:00
Samuel Berthe
03963ef6f9
refactor(categories): change categories and move some exporters (#528) 2026-03-17 13:30:13 +01:00
Samuel Berthe
06f8b048a3
fix ci 2026-03-16 19:17:05 +01:00
Samuel Berthe
5d099fcae1
fix ci 2026-03-16 17:44:00 +01:00
samber
9d00396bc8 Publish 2026-03-16 16:11:31 +00:00
Samuel Berthe
2b99cf1f76
Feat/cilium alerting rules (#526)
* Add .worktrees/ to .gitignore

* feat: add Cilium alerting rules (32 rules across agent, operator, ClusterMesh, KVStoreMesh, Hubble)

* fix: use job label instead of k8s_app, switch to single-quoted YAML strings

* remove Cilium agent high restart rate alert
2026-03-16 17:10:59 +01:00
samber
e8eb75c2e2 Publish 2026-03-16 15:53:03 +00:00
Samuel Berthe
5071e01ad9
Feature/spinnaker alerts (#527)
* Add .worktrees/ to .gitignore

* feat: add Spinnaker alerting rules (12 rules)

Add Prometheus alerting rules for Spinnaker built-in exporter
covering Orca queue health, circuit breakers, Igor polling monitors,
Gate API throttling, Clouddriver errors, and AWS rate limiting.

Metric names validated against uneeq-oss/spinnaker-mixin dashboards.

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 16:52:31 +01:00
samber
6423f93ba7 Publish 2026-03-16 15:40:08 +00:00
Samuel Berthe
1455e0fd77
feat: add Oracle Database alerting rules (8 rules) (#525)
Add Prometheus alerting rules for Oracle Database using iamseth/oracledb_exporter.
Rules based on Grafana oracledb-mixin and exporter default metrics:
- DB down, session/process limit, tablespace capacity (warning+critical),
  high rollbacks, active sessions, user I/O wait time.
2026-03-16 16:39:35 +01:00
samber
ba5c9a3280 Publish 2026-03-16 14:01:45 +00:00
Samuel Berthe
d8315eb3bc
Feature/cert manager rules (#524)
* Add .worktrees/ to .gitignore

* feat: add cert-manager alerting rules (4 rules)

Add Prometheus alerting rules for cert-manager under the
"Network, security and storage" category:
- Cert-Manager absent (service down detection)
- Certificate expiring soon (21-day threshold)
- Certificate not ready (readiness check)
- Hitting ACME rate limits (rate limit detection)

Based on imusmanmalik/cert-manager-mixin and official
cert-manager metrics documentation.

* docs: add cert-manager to README
2026-03-16 15:01:07 +01:00
samber
7f346ede99 Publish 2026-03-16 13:37:19 +00:00
Samuel Berthe
b58b498bbb
feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules) (#523)
* feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules)

Add 18 Tempo rules and 49 Mimir rules based on official upstream mixins.
Covers ring health, compaction, TSDB, instance limits, ruler, alertmanager, and more.

* fix: address PR review comments on Tempo/Mimir rules

- Fix Tempo no tenant index builders: add on() for cross-label-set and
- Fix Tempo block list rising: output percentage instead of ratio
- Fix Mimir memory map areas: multiply by 100 to match % description
- Fix all instance limit rules: multiply by 100 to match % descriptions
- Fix distributor inflight requests: add % to description
2026-03-16 14:36:50 +01:00
samber
ff17e9c69b Publish 2026-03-16 13:20:46 +00:00
Samuel Berthe
7ee16641ac
feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter) (#520)
* feat: add WireGuard alerting rules (3 rules, MindFlavor/prometheus_wireguard_exporter)

* fix: grammar in WireGuard rule comment
2026-03-16 14:20:17 +01:00
samber
4da60669d0 Publish 2026-03-16 13:09:31 +00:00
Samuel Berthe
f974552ef1
Feat/jaeger alerting rules (#521)
* Add .worktrees/ to .gitignore

* feat: add Jaeger alerting rules (8 rules from official jaeger-mixin)

Rules cover agent HTTP errors, RPC errors, client/agent/collector span drops,
sampling update failures, throttling update failures, and query request failures.
All rules sourced from https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin

* fix: rename Jaeger agent RPC alert to Jaeger client RPC

The jaeger_client_jaeger_rpc_http_requests metric is client-side,
not agent-side. Rename alert to match the actual metric source.
2026-03-16 14:09:03 +01:00
samber
eeba1ebbaa Publish 2026-03-16 13:07:45 +00:00
Samuel Berthe
8b443be6d2
feat: add systemd_exporter alerting rules (7 rules) (#522)
* feat: add systemd_exporter alerting rules (7 rules)

Add new Systemd service under Basic resource monitoring with rules for:
- Unit failed/inactive state detection
- Service crash loop detection
- Task limit exhaustion
- Socket refused/high connections
- Timer missed trigger

* fix: narrow systemd unit inactive query to reduce noise

Add type="service" and name filter to the inactive unit alert
to avoid false positives from legitimately inactive units.
2026-03-16 14:07:14 +01:00
Samuel Berthe
30bbedbc79
feat: add Cloud providers alerting rules (33 rules across 4 exporters) (#519)
* feat: add Cloud providers alerting rules (33 rules across 4 exporters)

New "Cloud providers" category with rules for:
- AWS CloudWatch (13 rules): exporter health + EC2, RDS, SQS, ALB, Lambda
- Google Cloud / Stackdriver (5 rules): scrape health, API quotas, staleness
- DigitalOcean (10 rules): droplets, databases, k8s, load balancers, incidents
- Azure (5 rules): API errors, rate limits, collection performance

* fix: address PR review - move Cloud providers before Other, fix service name

- Move "Cloud providers" group before "Other" in rules.yml for consistent ordering
- Rename "Google Cloud / Stackdriver" to "Google Cloud Stackdriver" to avoid
  awkward /-/ in generated anchors and dist/rules/ paths
- Fix README anchor link to match the new service name
2026-03-16 14:06:59 +01:00
samber
577c36d9ae Publish 2026-03-16 03:50:27 +00:00
Samuel Berthe
fd3bfb02c0
Some fix (#516)
* fix: use proper zero-traffic guard in Envoy ratio alerts (#511)

Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.

* feat: add alerting rules for prometheus/memcached_exporter

* fix: add division-by-zero guards and improve quoting in memcached rules (#512)

- Add `and memcached_max_connections > 0` to connection limit queries
- Add `and memcached_limit_bytes > 0` to memory usage query
- Switch hit-rate query to single quotes for cleaner PromQL readability

* fix: fix SNMP interface down query and add job scoping (#507)

- Fix ifOperStatus query to use vector matching instead of label filter
  since ifAdminStatus is a separate metric in snmp_exporter output
- Add job=~"snmp.*" matcher to interface error rate, bandwidth usage,
  and interface down rules to prevent matching non-SNMP series

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-16 04:50:01 +01:00