* feat: add Cloud providers alerting rules (33 rules across 4 exporters)
New "Cloud providers" category with rules for:
- AWS CloudWatch (13 rules): exporter health + EC2, RDS, SQS, ALB, Lambda
- Google Cloud / Stackdriver (5 rules): scrape health, API quotas, staleness
- DigitalOcean (10 rules): droplets, databases, k8s, load balancers, incidents
- Azure (5 rules): API errors, rate limits, collection performance
* fix: address PR review - move Cloud providers before Other, fix service name
- Move "Cloud providers" group before "Other" in rules.yml for consistent ordering
- Rename "Google Cloud / Stackdriver" to "Google Cloud Stackdriver" to avoid
awkward /-/ in generated anchors and dist/rules/ paths
- Fix README anchor link to match the new service name
* fix: use proper zero-traffic guard in Envoy ratio alerts (#511)
Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.
* feat: add alerting rules for prometheus/memcached_exporter
* fix: add division-by-zero guards and improve quoting in memcached rules (#512)
- Add `and memcached_max_connections > 0` to connection limit queries
- Add `and memcached_limit_bytes > 0` to memory usage query
- Switch hit-rate query to single quotes for cleaner PromQL readability
* fix: fix SNMP interface down query and add job scoping (#507)
- Fix ifOperStatus query to use vector matching instead of label filter
since ifAdminStatus is a separate metric in snmp_exporter output
- Add job=~"snmp.*" matcher to interface error rate, bandwidth usage,
and interface down rules to prevent matching non-SNMP series
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* feat: add OpenStack alerting rules (openstack-exporter)
Add 20 alerting rules for openstack-exporter/openstack-exporter covering
Nova, Neutron, Cinder, Octavia, and Placement services.
* docs: add OpenStack to README services list
* fix: align OpenStack load balancer alert name with operating_status semantics
The operating_status label uses ONLINE/OFFLINE/DEGRADED/ERROR values,
not ACTIVE. Rename alert to "not online" and use the label in the
description for clarity.
* feat: add process-exporter alerting rules (ncabatoff/process-exporter)
* docs: add Process to README services list
* fix: address PR review feedback for process-exporter rules
- Rename service from "Process" to "Process Exporter" for clarity
- Fix grammar: "file descriptors usage" → "file descriptor usage"
- Clarify CPU alert description as core-equivalent percentage
- Rename "high disk IO" to "high disk write IO" for accuracy
* feat: add IPMI exporter alerting rules
Add 17 alerting rules for prometheus-community/ipmi_exporter covering
temperature, fan, voltage, current, power sensors, chassis status,
and system event log monitoring.
* docs: add IPMI to README service list
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Add 7 alerting rules for prometheus/snmp_exporter covering device
availability, interface status, error rates, bandwidth utilization,
and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.
- Replace deprecated ::set-output with $GITHUB_OUTPUT
- Pin mikefarah/yq from @master to @v4
- Add explicit permissions: contents: write to publish workflow
- Limit test workflow push trigger to master branch only
* data: adding python/ruby/golang
* fix: address review feedback on runtime alerts
- JVM non-heap: guard against unbounded metaspace (max_bytes = -1)
- JVM old gen GC: note regex only matches CMS/G1/Parallel collectors
- JVM/Python file descriptors: note process_* metrics are generic
- Go memory usage: fix description (sys_bytes is runtime memory, not host)
- Go goroutine spike: use deriv() instead of rate() on gauge
- Go GC CPU fraction: note deprecation since Go 1.20
- Go GC duration: clarify quantile="1" is max, not p99
- Python uncollectable: use increase() on counter instead of raw threshold
- Add threshold comments for workload-dependent defaults
* Update Gemfile.lock
Running Jekyll according to `CONTRIBUTING.md` fails complaining about
missing a `nokogiri` dependency. Updating `Gemfile.lock` seems to solve
this issue.
Fixes: #500
* Website: Support dark mode
Support `prefers-color-scheme: dark` by employing some more or less
hacky CSS overrides.
One should perhaps just use a different off-the-shelf Jekyll theme that
does this properly from the start.
* Add systemd service name to HostSystemdServiceCrashed summary
* Modify systemd service crash rule description
Updated the description for the systemd service crash rule to include the service name.
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
* Adjust OOM kill detected rule
When a machine runs out of memory, it happens that the node
exporter stops responding for multiple minutes. I've adjusted
the rule now to take this into account: even if it takes 15-20
minutes before the machine becomes responsive again, the
alert should still fire.
* Update rules.yml
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>