Samuel Berthe
89e703d763
feat: add alerting rules for cloudflare/ebpf_exporter ( #508 )
...
* feat: add alerting rules for cloudflare/ebpf_exporter
* docs: add eBPF to README service list
2026-03-16 02:56:04 +01:00
samber
9f6d4fd2a2
Publish
2026-03-16 01:34:59 +00:00
Samuel Berthe
3db9281508
feat: add SNMP exporter alerting rules ( #507 )
...
Add 7 alerting rules for prometheus/snmp_exporter covering device
availability, interface status, error rates, bandwidth utilization,
and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.
2026-03-16 02:34:34 +01:00
dependabot[bot]
b039066277
build(deps-dev): bump nokogiri from 1.18.10 to 1.19.1 ( #506 )
...
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri ) from 1.18.10 to 1.19.1.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases )
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md )
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.18.10...v1.19.1 )
---
updated-dependencies:
- dependency-name: nokogiri
dependency-version: 1.19.1
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-16 01:51:52 +01:00
Samuel Berthe
01a5791376
fix: fix GitHub Actions workflow issues ( #505 )
...
- Replace deprecated ::set-output with $GITHUB_OUTPUT
- Pin mikefarah/yq from @master to @v4
- Add explicit permissions: contents: write to publish workflow
- Limit test workflow push trigger to master branch only
2026-03-16 01:47:09 +01:00
samber
e2af1325c6
Publish
2026-03-16 00:27:40 +00:00
Samuel Berthe
c37ef8f50c
fix: review and fix 74 database & broker alert rules ( #504 )
...
* fix: review and fix 74 database & broker alert rules
Comprehensive review of all database and broker alerts covering 16 services.
Typos & descriptions (8 fixes):
- PGBouncer: "a a server" → "a server"
- RabbitMQ: "instace" → "instance", "RabbmitMQ" → "RabbitMQ",
"unactive" → "inactive"
- Cassandra: write failure said "Read failures", "bad hacker" →
"authentication failures"
- Solr: replication errors said "failed updates"
- Meilisearch: "index is empty" said "instance is down"
Duplicates removed (5 fixes):
- PostgreSQL: 2 rules using wrong exporter metric (postgresql_errors_total)
- ClickHouse: "High Network Traffic" (thread counts) duplicated byte-rate rule
- NATS: 2 rules with low thresholds duplicated better rules
Broken queries (20 fixes):
- Patroni: patroni_master → patroni_primary (renamed in v3)
- MongoDB: rate() on gauge → direct ratio for connection queries
- MongoDB: removed WiredTiger-incompatible virtual memory rule
- Cassandra instaclustr: avg() on counter → rate()[5m]
- Cassandra criteo: increase() on JMX rate metric → direct threshold
- ClickHouse: increase() on gauge → direct threshold
- NATS: rate() on gauge → direct comparison, removed 4 config-value rules
- SQL Server: increase() on gauge → direct threshold
- Pulsar: moved comparison outside sum() (4 rules)
- Hadoop: inverted comparison < 0.2 → > 0.8, counters → increase()[1h]
Severity adjustments (7 fixes):
- Redis: backup threshold 24h → 48h, rejected connections → warning > 5
- RabbitMQ: no consumer for: 5m with comment
- Elasticsearch: unassigned shards added for: 2m
- CouchDB: process restarted critical → info
- Kafka: consumer group lag → warning, threshold 10000, better description
- Hadoop: HBase heap low critical → warning
Missing for duration (18 fixes):
- Added for: 1m to service-down alerts across MySQL, PostgreSQL,
SQL Server, Patroni, Redis, MongoDB, RabbitMQ, Elasticsearch,
Cassandra, Zookeeper with restart-tolerance comments
Division by zero guards (9 fixes):
- Added denominator > 0 guards to ratio queries in PostgreSQL,
RabbitMQ, Elasticsearch, ClickHouse, CouchDB, NATS
Query design improvements (5 fixes):
- Cassandra: removed unnecessary sum() and redundant avg_over_time()
- ClickHouse: ZooKeeper avg() → per-instance check
- PostgreSQL: sum() → sum by (instance) for SSL and locks
- PGBouncer: 30s range window → 2m
Hardcoded labels (2 fixes):
- ClickHouse: added comment about job="clickhouse"
- Cassandra criteo: removed hardcoded service="cas"
* fix: address PR review comments
- Cassandra connection timeouts: wrap rate() in sum by() (rate() by() is invalid PromQL)
- Elasticsearch query latency: add division-by-zero guard
- Redis backup: "backuped" → "backed up"
2026-03-16 01:27:18 +01:00
Samuel Berthe
89842beb5c
fix: fix favicon path
2026-03-15 23:54:05 +01:00
Samuel Berthe
8f462ce962
adding claude.md
2026-03-15 19:59:01 +01:00
samber
879436f440
Publish
2026-03-15 18:47:04 +00:00
Samuel Berthe
080a792777
data: adding python/ruby/golang ( #502 )
...
* data: adding python/ruby/golang
* fix: address review feedback on runtime alerts
- JVM non-heap: guard against unbounded metaspace (max_bytes = -1)
- JVM old gen GC: note regex only matches CMS/G1/Parallel collectors
- JVM/Python file descriptors: note process_* metrics are generic
- Go memory usage: fix description (sys_bytes is runtime memory, not host)
- Go goroutine spike: use deriv() instead of rate() on gauge
- Go GC CPU fraction: note deprecation since Go 1.20
- Go GC duration: clarify quantile="1" is max, not p99
- Python uncollectable: use increase() on counter instead of raw threshold
- Add threshold comments for workload-dependent defaults
2026-03-15 19:46:39 +01:00
samber
1e4e3d17bc
Publish
2026-03-15 17:08:32 +00:00
Samuel Berthe
9ae17eca97
Fix broken and misleading alert rules ( #503 )
...
- Remove 7 meaningless `for: 0m` (ClickHouse, Caddy, Thanos)
- Fix Minio obsolete metrics (disk_storage_* -> minio_cluster_capacity_*)
- Rename duplicate Blackbox SSL cert rule to disambiguate warning/critical
- Simplify PostgreSQL config change query (giant regex -> negative matcher)
- Downgrade PostgreSQL SSL compression severity from critical to warning
- Fix misleading "Host unusual disk read rate" name and description
2026-03-15 18:08:06 +01:00
Mattias Bengtsson
bc41215c8f
Website: Support dark mode ( #501 )
...
* Update Gemfile.lock
Running Jekyll according to `CONTRIBUTING.md` fails complaining about
missing a `nokogiri` dependency. Updating `Gemfile.lock` seems to solve
this issue.
Fixes : #500
* Website: Support dark mode
Support `prefers-color-scheme: dark` by employing some more or less
hacky CSS overrides.
One should perhaps just use a different off-the-shelf Jekyll theme that
does this properly from the start.
2026-03-01 22:54:42 +01:00
samber
80400e9a56
Publish
2026-03-01 19:15:42 +00:00
Marcin Morawski
eeebb90e6f
Add systemd service name to HostSystemdServiceCrashed summary ( #499 )
...
* Add systemd service name to HostSystemdServiceCrashed summary
* Modify systemd service crash rule description
Updated the description for the systemd service crash rule to include the service name.
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-03-01 20:15:17 +01:00
samber
0693ed168e
Publish
2026-02-21 18:40:35 +00:00
dxrayz
e60601fdcd
tune Targets Missing rules ( #497 )
...
* tune Targets Missing rules
* reworked query logic
* Update rules.yml
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-02-21 19:40:10 +01:00
dependabot[bot]
9998e22145
build(deps-dev): bump nokogiri from 1.18.9 to 1.19.1 ( #498 )
...
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri ) from 1.18.9 to 1.19.1.
- [Release notes](https://github.com/sparklemotion/nokogiri/releases )
- [Changelog](https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md )
- [Commits](https://github.com/sparklemotion/nokogiri/compare/v1.18.9...v1.19.1 )
---
updated-dependencies:
- dependency-name: nokogiri
dependency-version: 1.19.1
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-20 01:58:02 +01:00
dependabot[bot]
52cc00fc4c
build(deps-dev): bump faraday from 2.12.0 to 2.14.1 ( #496 )
...
Bumps [faraday](https://github.com/lostisland/faraday ) from 2.12.0 to 2.14.1.
- [Release notes](https://github.com/lostisland/faraday/releases )
- [Changelog](https://github.com/lostisland/faraday/blob/main/CHANGELOG.md )
- [Commits](https://github.com/lostisland/faraday/compare/v2.12.0...v2.14.1 )
---
updated-dependencies:
- dependency-name: faraday
dependency-version: 2.14.1
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-02-10 00:26:42 +01:00
samber
dd10c7ef05
Publish
2026-01-30 11:15:52 +00:00
Per Lundberg
51aea96ba7
Adjust OOM kill detected rule ( #495 )
...
* Adjust OOM kill detected rule
When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.
* Update rules.yml
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
2026-01-30 12:15:27 +01:00
Andreyev Dias de Melo
1d69457017
fix: corrects download URL for rules files ( #494 )
2026-01-30 01:40:38 +01:00
Samuel Berthe
f0107caf9e
Update README.md
2026-01-15 12:33:35 +01:00
Samuel Berthe
34cc80ffea
Update app.css
2026-01-15 02:48:16 +01:00
Samuel Berthe
a5d1c04955
Update default.html
2026-01-15 02:43:57 +01:00
Samuel Berthe
65551ae19f
Update README.md
2026-01-15 02:42:42 +01:00
Samuel Berthe
570521429e
Update default.html
2026-01-15 02:42:00 +01:00
Samuel Berthe
55f16705eb
Add files via upload
2026-01-15 02:40:58 +01:00
Samuel Berthe
2b5c8b0ec7
Update README.md
2026-01-15 02:39:24 +01:00
samber
81081bdda5
Publish
2026-01-07 12:58:08 +00:00
Samuel Berthe
d400e3e64d
feat(k8s): cronjob rule ( #491 )
2026-01-07 13:57:42 +01:00
Samuel Berthe
1136aa3a87
remove file
2026-01-07 13:29:12 +01:00
Simon Matic Langford
f810ff531d
Node exporter rules to preserve instance labels ( #488 )
...
* Jenkins node offline for clause (#2 )
* Convert cpu alert expressions to without() rather than on()
* Remove on() expression from network throughput alerts as labels fully match
---------
Co-authored-by: Simon Matic Langford <simon@longshotsystems.co.uk>
2026-01-06 16:24:18 +01:00
dependabot[bot]
74ba870f05
build(deps-dev): bump uri from 0.13.2 to 0.13.3 ( #489 )
...
Bumps [uri](https://github.com/ruby/uri ) from 0.13.2 to 0.13.3.
- [Release notes](https://github.com/ruby/uri/releases )
- [Commits](https://github.com/ruby/uri/compare/v0.13.2...v0.13.3 )
---
updated-dependencies:
- dependency-name: uri
dependency-version: 0.13.3
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-01-06 00:55:03 +01:00
5bentz
ffa260b39d
Update sleep-peacefully.md ( #487 )
...
Fix business hours (9:00 to 18:00)
2025-12-08 15:19:11 +01:00
dependabot[bot]
766b224c67
build(deps): bump actions/checkout from 5 to 6 ( #485 )
...
Bumps [actions/checkout](https://github.com/actions/checkout ) from 5 to 6.
- [Release notes](https://github.com/actions/checkout/releases )
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md )
- [Commits](https://github.com/actions/checkout/compare/v5...v6 )
---
updated-dependencies:
- dependency-name: actions/checkout
dependency-version: '6'
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-01 21:34:15 +01:00
Simon Matic Langford
79f2858037
Improve Jenkins node alerts to better handle servers with multiple nodes ( #484 )
2025-11-17 14:56:04 +01:00
Samuel Berthe
d6589237e1
Update CONTRIBUTING.md
2025-11-13 16:24:49 +01:00
Samuel Berthe
d0d1b00a7b
Fix typo in OpenTelemetry Collector link
2025-11-05 17:15:10 +01:00
Samuel Berthe
e617c07179
Update README.md
2025-11-05 17:14:47 +01:00
Samuel Berthe
48f2dde80c
feat: use /ref/head/ instead of /master/ for yaml url ( #482 )
2025-11-05 17:12:50 +01:00
samber
cea78d7fd6
Publish
2025-11-05 16:08:52 +00:00
Arve Knudsen
d58bc324ad
Add OpenTelemetry Collector monitoring alerts ( #480 )
...
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-11-05 17:08:26 +01:00
samber
4acbddb21a
Publish
2025-11-05 16:04:56 +00:00
Samuel Berthe
6e2db98590
feat: add support for exporter-level comments ( #481 )
2025-11-05 17:04:30 +01:00
samber
ae8cfb0366
Publish
2025-10-13 12:24:59 +00:00
andrii.k
9edef74e73
update kafka alerts ( #478 )
2025-10-13 14:24:37 +02:00
dependabot[bot]
2f9279d707
build(deps-dev): bump rexml from 3.3.9 to 3.4.2 ( #476 )
...
Bumps [rexml](https://github.com/ruby/rexml ) from 3.3.9 to 3.4.2.
- [Release notes](https://github.com/ruby/rexml/releases )
- [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md )
- [Commits](https://github.com/ruby/rexml/compare/v3.3.9...v3.4.2 )
---
updated-dependencies:
- dependency-name: rexml
dependency-version: 3.4.2
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-09-19 04:17:09 +02:00
samber
606d6fc592
Publish
2025-09-15 13:04:10 +00:00