Commit graph

952 commits

Author SHA1 Message Date
Samuel Berthe
6b2a5af9f9
oops 2026-04-17 20:01:55 +02:00
Samuel Berthe
b58c180dcb
improve seo 2026-04-17 19:59:24 +02:00
Samuel Berthe
6070e81097
improve seo 2026-04-17 19:53:36 +02:00
Samuel Berthe
4481bb3276
oops 2026-04-17 13:40:37 +02:00
Samuel Berthe
b4324742be
feat: replace Tinybird tracking with PostHog
- Remove Tinybird fetch pipeline from pipe.ts, keep only session/lifetime copy counters
- Wire session_copy_count and lifetime_copy_count into posthog.capture calls
- Remove Tinybird calls from sponsor click tracking, use posthog only
- Hardcode PostHog project ID and reverse proxy host (hogpost.samber.dev)
2026-04-17 12:07:50 +02:00
Samuel Berthe
5a5976c9a3
feat: track sponsor clicks with blocking event before navigation
- Add recordCopy() for copy events (bumps session/lifetime counters)
- Add recordAndWait() for blocking events (1500ms timeout, errors swallowed)
- Extract shared sponsor click handler into site/src/scripts/sponsor.ts
- Plain left-click blocks navigation until HTTP response; modifier/middle
  clicks track fire-and-forget and let the browser navigate natively
- Distinguish header vs footer placement via data-sponsor-slot attribute
2026-04-15 16:28:25 +02:00
Samuel Berthe
1c5f626046
feat: add first-party copy event pipe to Tinybird
Sends rule_copy, wget_copy events on clipboard interactions,
bypassing ad blockers. Tracks user_id (localStorage apa_uid),
session_id (sessionStorage apa_sid), session/lifetime copy counts,
full rule coordinates (group/service/exporter/rule slugs + indices),
page context, and browser environment. Event name is the Tinybird
data source name, scoped to "rule" or "exporter" per copy type.
2026-04-15 11:34:12 +02:00
Samuel Berthe
bb055773b4
feat: add GitHub star nudges across the site
- Prepend attribution comment to "Copy all" exporter clipboard
- Show inline  Star nudge on individual rule copy (3s, dismisses automatically)
- Change StatsBar stars label to "engineers starred" for social proof
- Add milestone progress bar toward 10k stars in StatsBar
- Fix header/StatsBar showing "0" when SSR GitHub API fetch fails (use "—" placeholder)
2026-04-14 21:52:27 +02:00
Samuel Berthe
d38511d7cb
chore: generate pagefind index at build time, not committed to git
- Add pagefind run step to build script in site/package.json
- Add site/public/pagefind/ to .gitignore (generated at deploy time)
2026-04-14 20:33:29 +02:00
Samuel Berthe
a56d8cf2a4
feat: refine star toast — brand orange, idle trigger, 15s auto-hide
- Style: brand orange background with white text (visible on any bg)
- Trigger: every 5 copies OR after 10 minutes of inactivity on page
- Auto-hide: 15s (reset if toast re-triggers before expiry)
- Idle timer resets on each copy
2026-04-14 20:30:08 +02:00
Samuel Berthe
25418c5db2
feat: add star nudge toast after every 5 rule copies
Show a dismissible toast (bottom-right, 20s auto-hide) nudging users
to star the GitHub repo. Fires every 5 copies via a sessionStorage
counter. CopyButton dispatches a copy-success custom event; StarToast
listens for it and manages display logic.
2026-04-14 20:09:30 +02:00
Samuel Berthe
5366d4b9ae
fix: replace invalid top-level return with isFresh flag in star scripts
Top-level return is a syntax error in ES modules. Replace the early
return pattern with an isFresh boolean guard. Also revert the hero
"Star on GitHub" button change.
2026-04-14 19:59:36 +02:00
Samuel Berthe
1f8bcca779
feat: add GitHub stars to StatsBar and fix cache early-return
Add a 4th stat ( GitHub stars) to StatsBar with build-time fallback
and live client-side fetch. Both Header and StatsBar share the same
sessionStorage cache key and skip the API call when the cache is fresh
(1h TTL), reducing fetches to at most one per session.
2026-04-14 19:51:12 +02:00
Samuel Berthe
954999dfa9
feat: replace GitHub icon with Star button and live star count
Replace the plain GitHub icon+count in the header with a proper two-zone
star button (★ Star | 8.4k). The count is seeded at build time from the
GitHub API and refreshed client-side on page load with a 1-hour
sessionStorage cache.
2026-04-14 19:47:49 +02:00
Samuel Berthe
297fd9864c
fix: use https in CC BY URL and trigger site build on _data changes 2026-04-14 16:27:01 +02:00
Samuel Berthe
5c166e8403
docs: update tagline and clean up README 2026-04-10 21:45:27 +02:00
Samuel Berthe
ab87fdcf30
feat/dual license (#550)
* ci: remove node version pin in site build workflow

* docs: clarify dual license (CC BY 4.0 for content, MIT for site code)

Alert rules and content (_data/rules.yml, dist/) are licensed under
Creative Commons CC BY 4.0. The site source code (site/) is licensed
under MIT. Both are now documented in LICENSE, site/LICENSE, the footer,
and the FAQ.
2026-04-10 21:36:57 +02:00
Samuel Berthe
aa7d93ce95
chore: migrate assets/ to site/public/images/ (#549)
Remove legacy assets/ directory (pre-Astro era). Images were already
duplicated under site/public/images/; update README sponsor URLs to
point to the new location.
2026-04-10 21:28:38 +02:00
Samuel Berthe
a4d0b1370c
ci: add site build workflow (#548) 2026-04-10 21:21:04 +02:00
dependabot[bot]
d31b3f9ba0
build(deps): bump @iconify-json/lucide from 1.2.101 to 1.2.102 in /site (#545)
Bumps [@iconify-json/lucide](https://github.com/iconify/icon-sets) from 1.2.101 to 1.2.102.
- [Commits](https://github.com/iconify/icon-sets/commits)

---
updated-dependencies:
- dependency-name: "@iconify-json/lucide"
  dependency-version: 1.2.102
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:19:49 +02:00
Samuel Berthe
89d8423d93
build(deps): migrate from @astrojs/tailwind to @tailwindcss/vite for Tailwind v4 (#547)
@astrojs/tailwind v6 still requires tailwindcss@^3; replace it with the
official @tailwindcss/vite Vite plugin. Update global.css to v4 syntax
(@import "tailwindcss", @custom-variant dark, @theme tokens) and drop
the now-unused tailwind.config.mjs.
2026-04-10 21:18:13 +02:00
dependabot[bot]
814dd5d3fb
build(deps): bump @astrojs/tailwind from 5.1.5 to 6.0.2 in /site (#543)
Bumps [@astrojs/tailwind](https://github.com/withastro/astro/tree/HEAD/packages/integrations/tailwind) from 5.1.5 to 6.0.2.
- [Release notes](https://github.com/withastro/astro/releases)
- [Changelog](https://github.com/withastro/astro/blob/main/packages/integrations/tailwind/CHANGELOG.md)
- [Commits](https://github.com/withastro/astro/commits/@astrojs/tailwind@6.0.2/packages/integrations/tailwind)

---
updated-dependencies:
- dependency-name: "@astrojs/tailwind"
  dependency-version: 6.0.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:11:05 +02:00
dependabot[bot]
e6ea45aec1
build(deps): bump tailwindcss from 3.4.19 to 4.2.2 in /site (#544)
Bumps [tailwindcss](https://github.com/tailwindlabs/tailwindcss/tree/HEAD/packages/tailwindcss) from 3.4.19 to 4.2.2.
- [Release notes](https://github.com/tailwindlabs/tailwindcss/releases)
- [Changelog](https://github.com/tailwindlabs/tailwindcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/tailwindlabs/tailwindcss/commits/v4.2.2/packages/tailwindcss)

---
updated-dependencies:
- dependency-name: tailwindcss
  dependency-version: 4.2.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:56 +02:00
dependabot[bot]
bea2dc45b4
build(deps): bump actions/upload-pages-artifact from 3 to 4 (#540)
Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 3 to 4.
- [Release notes](https://github.com/actions/upload-pages-artifact/releases)
- [Commits](https://github.com/actions/upload-pages-artifact/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/upload-pages-artifact
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:38 +02:00
dependabot[bot]
dd0c8372f9
build(deps): bump actions/setup-node from 4 to 6 (#541)
Bumps [actions/setup-node](https://github.com/actions/setup-node) from 4 to 6.
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/setup-node
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:31 +02:00
dependabot[bot]
132329abd8
build(deps): bump actions/deploy-pages from 4 to 5 (#542)
Bumps [actions/deploy-pages](https://github.com/actions/deploy-pages) from 4 to 5.
- [Release notes](https://github.com/actions/deploy-pages/releases)
- [Commits](https://github.com/actions/deploy-pages/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/deploy-pages
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:24 +02:00
dependabot[bot]
9e80bb910e
build(deps): bump actions/checkout from 4 to 6 (#539)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-10 21:10:09 +02:00
Samuel Berthe
79afa21610
feat/astro migration (#538)
* feat: migrate website from Jekyll to Astro

Rebuilds the site using Astro (SSG) with Tailwind CSS v4, replacing the
Jekyll/Cayman theme. Key changes:

- Splits the monolithic /rules page into 110 statically-generated pages
  (92 per-service + 13 group index + homepage + guide pages) for SEO
- URL structure: /rules/[group-slug]/[service-slug]/ with backward-
  compatibility redirect map for old anchor-based URLs (/rules#redis)
- Modern UI: Prometheus-orange accent, dark mode (system + toggle),
  sticky sidebar, responsive layout, copy-to-clipboard per rule/section
- SEO: per-page <title>, <meta description>, Open Graph, Twitter Card,
  canonical URLs, sitemap.xml via @astrojs/sitemap
- GEO: FAQPage JSON-LD schema on each service page (rules as Q&A pairs
  for AI search engines), TechArticle schema, BreadcrumbList
- Search: Pagefind (build-time index, lazy-loaded, ~200KB)
- Zero JS by default; copy buttons and theme toggle use inline scripts
- New CI: .github/workflows/deploy.yml builds Astro + Pagefind and
  deploys to GitHub Pages via actions/deploy-pages
- Existing dist.yml and test.yml workflows are untouched
- _data/rules.yml remains the single source of truth

Note: GitHub Pages source must be changed from "Build from branch"
(Jekyll) to "GitHub Actions" in repository settings.

* doc: new website based on astro

* refactor: remove previous website

* chore: add npm dependabot for Astro site + scope CI to _data changes

* Update site/astro.config.mjs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update site/src/components/CopyButton.astro

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* oops

* fix: strip trailing slash from BASE_URL to prevent double slashes in URLs

Agent-Logs-Url: https://github.com/samber/awesome-prometheus-alerts/sessions/c85937ba-1855-4b8a-a72b-847eab1c8639

Co-authored-by: samber <2951285+samber@users.noreply.github.com>

* fix: resolve Astro build errors in astro.config.mjs

- Remove assetsInclude yml which caused Vite to treat YAML files as static assets instead of running them through the custom YAML transform plugin; data.groups was undefined at runtime because the import resolved to a URL rather than parsed content
- Deduplicate old-path redirects: emit only the slash-less variant per service to avoid Astro router collision warnings (trailing-slash variant is handled automatically)

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: samber <2951285+samber@users.noreply.github.com>
2026-04-10 21:08:06 +02:00
samber
0d148832d3 Publish 2026-04-06 19:14:46 +00:00
Samuel Berthe
c2615fae52
fix/promql rules review 2 (#534)
* fix(data): fix queries and thresholds across multiple exporters

- Ceph: fix OSD latency metric name (ceph_osd_apply_latency_ms), replace
  ceph_osd_utilization with ceph_health_detail{name="OSD_NEARFULL"}, add for: durations
- ZFS: improve description, remove incorrect ON() join on readonly check
- Thanos: filter gRPC errors to actual error codes only (drop NotFound, Cancelled, etc.)
- Loki/Promtail: fix histogram_quantile to aggregate by (namespace, job, route, le)
- Mimir: raise rate()>0 thresholds to >0.05, add missing for: durations
- OTel Collector: raise rate()>0 thresholds to >0.05, add deprecation comments
- Tempo/Cortex: raise >0 thresholds to avoid transient spikes
- APC UPS: add division-by-zero guard on battery voltage ratio
- DigitalOcean: raise increase()>0 to >3
- Grafana Alloy: fix missing name: field on exporter
- Graph Node: add threshold comments

* fix(data): remove official mixin reference from Ceph OSD comment

* fix(data): remove official mixin references from comments
2026-04-06 21:14:15 +02:00
Samuel Berthe
72c9e922c0
docs: update CLAUDE.md with lessons from PromQL review
Add guidance on untyped metrics with counter semantics (node_vmstat_*,
MySQL SHOW GLOBAL STATUS), YAML duplicate key pitfall, permanent-firing
cumulative counter alerts, and updated category structure.
2026-04-06 21:08:48 +02:00
samber
ed1515015a Publish 2026-04-06 18:38:45 +00:00
Samuel Berthe
2258835c30
fix/promql rules review (#533)
* fix(data): comprehensive PromQL review across all ~937 rules

Query fixes:
- Replace rate()/increase() with deriv()/delta() on gauge metrics exposed
  as untyped by exporters (node_vmstat_oom_kill, mysql_global_status_*,
  systemd_socket_refused_connections_total)
- Fix Ceph OSD latency metric name: ceph_osd_perf_apply_latency_seconds
  → ceph_osd_apply_latency_ms (Ceph MGR Prometheus module)
- Fix NATS subscriptions metric: gnatsd_connz_subscriptions (per-conn)
  → gnatsd_varz_subscriptions (server total)
- Fix Caddy reverse proxy down query: count()==0 → direct gauge == 0
- Fix RabbitMQ total connections metric: connectionsTotal → connections
- Fix Cilium ClusterMesh/KVStoreMesh: deriv() on failure gauge → direct
  gauge comparison (deriv > 0 misses stable non-zero failure states)
- Fix cert-manager ACME metric name: certmanager_http_acme_client_request_count
  → certmanager_acme_client_request_count (renamed in v1.19+)
- Fix Thanos Query gRPC filter: grpc_code!="OK" → explicit error codes
- Fix Flink duplicate comments: field (YAML last-write-wins bug)
- Add datid!="0" filter to PostgreSQL dead locks query
- Fix PostgreSQL high rollback rate: restructure division-by-zero guard
  and move ratio calculation outside sum()
- Add division-by-zero guards: Container Low CPU, Hadoop ResourceManager
  memory, Hadoop HBase heap, Vault cluster health
- Add for: 1m to Blackbox probe failed/HTTP failure and Ceph State/
  OSD Down/PG unavailable

Threshold fixes:
- Replace > 0 with meaningful thresholds on rate()/increase() queries
  across: Alertmanager, eBPF decoder errors, systemd refused connections,
  Memcached, Cassandra (Instaclustr + Criteo), ClickHouse distributed
  inserts, CouchDB log entries, HAProxy healthcheck failures, RabbitMQ
  unroutable messages, Spinnaker, Cilium, Mimir TSDB/alertmanager,
  OTel Collector receiver refused metrics
- Fix Elasticsearch High Indexing Latency threshold: 0.0005s → 0.01s
  (0.5ms was below normal operating range; 10ms is more realistic)

Description fixes:
- Fix MySQL slow queries: remove duplicate "mysql" word
- Fix SMART device description: remove trailing stray ")" (6 rules)
- Fix host disk IO description: remove duplicate "Check storage for issues."
- Fix EDAC correctable errors: "last 5 minutes" → "last 1 minute"
- Fix EDAC uncorrectable errors: remove time-window claim (raw counter)
- Fix Mimir store-gateway sync description: said "10 minutes" but
  threshold is 1800s (30 minutes)
- Fix Vault description false "%" suffix on count values
- Improve descriptions across RabbitMQ, Zookeeper, Kafka, Pulsar, Envoy,
  Istio rules to include {{ $labels }} and {{ $value }} template vars
- Downgrade Cassandra key cache hit rate: critical → warning

Comments:
- Add note on node_vmstat_oom_kill gauge type (delta vs increase)
- Add note on systemd_socket_refused_connections_total gauge type
- Add note on mysql_global_status_* gauge type (delta/deriv vs rate)
- Add note on pg_txid_current requiring a custom postgres_exporter query
- Add note on pg_stat_ssl_compression availability (PG 9.5-13 only)
- Add note on cert-manager legacy metric name for users on v1.18 and older
- Add threshold rationale for Elasticsearch, Cassandra, CouchDB rules
- Add note on NATS leaf node spurious fires when leaf nodes not configured

* fix(data): PromQL type fixes, job filter cleanup, query correctness review

- Replace rate()/increase() with deriv()/delta() on gauge metrics:
  node_vmstat_pgmajfault, cassandra_stats (criteo exporter),
  gitlab_ci_pipeline_failure_reasons, flink_taskmanager_job_task_numRecordsIn
- Fix histogram_quantile on non-_bucket metric: cilium_policy_implementation_delay
- Fix Thanos bucket replicate latency: use _count instead of _bucket for guard clause
- Fix Thanos query latency: use _count instead of _bucket for guard clause
- Restore job filter in Thanos objstore guard clauses (compact + store)
- Remove redundant job= filters from unique metrics: ~30 Thanos rules,
  kube_persistentvolume_status_phase, otelcol_process_runtime_*
- Fix high-cardinality Istio latency grouping (drop source labels from by())
- Add division-by-zero guard to host context switch ratio
- Raise noisy ClickHouse thresholds: RejectedInserts > 2, DelayedInserts > 10
- Remove redundant for: 1m from HAProxy check failure rules
- Add job rename comments to up{job=...} rules (Hadoop, OpenStack, SNMP, OTel)
- Remove external mixin references from comments
- Fix Tempo dropped spans metric name: add missing _total suffix
- Fix Thanos bucket replicate run latency: add missing le label in by()
2026-04-06 20:38:12 +02:00
Samuel Berthe
b8fd051a55
Update README.md 2026-03-31 16:41:19 +02:00
samber
87d0610246 Publish 2026-03-31 14:40:08 +00:00
Emil Bostijancic
7ba6b2d367
feat: add OpenSearch alerting rules (OpenSearch exporter plugin) (#532) 2026-03-31 16:39:38 +02:00
dependabot[bot]
b13d59bce6
build(deps-dev): bump activesupport from 7.2.3 to 7.2.3.1 (#531)
Bumps [activesupport](https://github.com/rails/rails) from 7.2.3 to 7.2.3.1.
- [Release notes](https://github.com/rails/rails/releases)
- [Changelog](https://github.com/rails/rails/blob/v8.1.2.1/activesupport/CHANGELOG.md)
- [Commits](https://github.com/rails/rails/compare/v7.2.3...v7.2.3.1)

---
updated-dependencies:
- dependency-name: activesupport
  dependency-version: 7.2.3.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-24 08:24:41 +01:00
dependabot[bot]
9d9c648cdd
build(deps-dev): bump json from 2.18.1 to 2.19.2 (#530)
Bumps [json](https://github.com/ruby/json) from 2.18.1 to 2.19.2.
- [Release notes](https://github.com/ruby/json/releases)
- [Changelog](https://github.com/ruby/json/blob/master/CHANGES.md)
- [Commits](https://github.com/ruby/json/compare/v2.18.1...v2.19.2)

---
updated-dependencies:
- dependency-name: json
  dependency-version: 2.19.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-20 18:22:18 +01:00
samber
af2f277830 Publish 2026-03-18 20:41:01 +00:00
Samuel Berthe
e3a7165a65
fix(data): remove malformed summary fields, replace increase() by rate(), remove redundant avg_over_time 2026-03-18 21:40:30 +01:00
samber
c0e1f7a5f5 Publish 2026-03-18 17:06:34 +00:00
Samuel Berthe
1aafa40913
fix(data): prevent division by 0 2026-03-18 18:06:00 +01:00
samber
4fb1aa9ae4 Publish 2026-03-18 11:23:25 +00:00
Samuel Berthe
a4581ed322
fix(data): fix tresholds, comments, intervals, units... (#529) 2026-03-18 12:22:55 +01:00
samber
f36c23e393 Publish 2026-03-17 12:30:42 +00:00
Samuel Berthe
03963ef6f9
refactor(categories): change categories and move some exporters (#528) 2026-03-17 13:30:13 +01:00
Samuel Berthe
06f8b048a3
fix ci 2026-03-16 19:17:05 +01:00
Samuel Berthe
5d099fcae1
fix ci 2026-03-16 17:44:00 +01:00
samber
9d00396bc8 Publish 2026-03-16 16:11:31 +00:00
Samuel Berthe
2b99cf1f76
Feat/cilium alerting rules (#526)
* Add .worktrees/ to .gitignore

* feat: add Cilium alerting rules (32 rules across agent, operator, ClusterMesh, KVStoreMesh, Hubble)

* fix: use job label instead of k8s_app, switch to single-quoted YAML strings

* remove Cilium agent high restart rate alert
2026-03-16 17:10:59 +01:00