* feat: add Grafana Tempo and Grafana Mimir alerting rules (67 rules)
Add 18 Tempo rules and 49 Mimir rules based on official upstream mixins.
Covers ring health, compaction, TSDB, instance limits, ruler, alertmanager, and more.
* fix: address PR review comments on Tempo/Mimir rules
- Fix Tempo no tenant index builders: add on() for cross-label-set and
- Fix Tempo block list rising: output percentage instead of ratio
- Fix Mimir memory map areas: multiply by 100 to match % description
- Fix all instance limit rules: multiply by 100 to match % descriptions
- Fix distributor inflight requests: add % to description
* Add .worktrees/ to .gitignore
* feat: add Jaeger alerting rules (8 rules from official jaeger-mixin)
Rules cover agent HTTP errors, RPC errors, client/agent/collector span drops,
sampling update failures, throttling update failures, and query request failures.
All rules sourced from https://github.com/jaegertracing/jaeger/tree/main/monitoring/jaeger-mixin
* fix: rename Jaeger agent RPC alert to Jaeger client RPC
The jaeger_client_jaeger_rpc_http_requests metric is client-side,
not agent-side. Rename alert to match the actual metric source.
* feat: add systemd_exporter alerting rules (7 rules)
Add new Systemd service under Basic resource monitoring with rules for:
- Unit failed/inactive state detection
- Service crash loop detection
- Task limit exhaustion
- Socket refused/high connections
- Timer missed trigger
* fix: narrow systemd unit inactive query to reduce noise
Add type="service" and name filter to the inactive unit alert
to avoid false positives from legitimately inactive units.
* feat: add Cloud providers alerting rules (33 rules across 4 exporters)
New "Cloud providers" category with rules for:
- AWS CloudWatch (13 rules): exporter health + EC2, RDS, SQS, ALB, Lambda
- Google Cloud / Stackdriver (5 rules): scrape health, API quotas, staleness
- DigitalOcean (10 rules): droplets, databases, k8s, load balancers, incidents
- Azure (5 rules): API errors, rate limits, collection performance
* fix: address PR review - move Cloud providers before Other, fix service name
- Move "Cloud providers" group before "Other" in rules.yml for consistent ordering
- Rename "Google Cloud / Stackdriver" to "Google Cloud Stackdriver" to avoid
awkward /-/ in generated anchors and dist/rules/ paths
- Fix README anchor link to match the new service name
* fix: use proper zero-traffic guard in Envoy ratio alerts (#511)
Replace `+ 1` denominator hack with `and ... > 0` filter in upstream
timeout rate and upstream 5xx error rate queries for mathematical
correctness and repo consistency.
* feat: add alerting rules for prometheus/memcached_exporter
* fix: add division-by-zero guards and improve quoting in memcached rules (#512)
- Add `and memcached_max_connections > 0` to connection limit queries
- Add `and memcached_limit_bytes > 0` to memory usage query
- Switch hit-rate query to single quotes for cleaner PromQL readability
* fix: fix SNMP interface down query and add job scoping (#507)
- Fix ifOperStatus query to use vector matching instead of label filter
since ifAdminStatus is a separate metric in snmp_exporter output
- Add job=~"snmp.*" matcher to interface error rate, bandwidth usage,
and interface down rules to prevent matching non-SNMP series
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* feat: add OpenStack alerting rules (openstack-exporter)
Add 20 alerting rules for openstack-exporter/openstack-exporter covering
Nova, Neutron, Cinder, Octavia, and Placement services.
* docs: add OpenStack to README services list
* fix: align OpenStack load balancer alert name with operating_status semantics
The operating_status label uses ONLINE/OFFLINE/DEGRADED/ERROR values,
not ACTIVE. Rename alert to "not online" and use the label in the
description for clarity.
* feat: add process-exporter alerting rules (ncabatoff/process-exporter)
* docs: add Process to README services list
* fix: address PR review feedback for process-exporter rules
- Rename service from "Process" to "Process Exporter" for clarity
- Fix grammar: "file descriptors usage" → "file descriptor usage"
- Clarify CPU alert description as core-equivalent percentage
- Rename "high disk IO" to "high disk write IO" for accuracy
* feat: add IPMI exporter alerting rules
Add 17 alerting rules for prometheus-community/ipmi_exporter covering
temperature, fan, voltage, current, power sensors, chassis status,
and system event log monitoring.
* docs: add IPMI to README service list
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Add 7 alerting rules for prometheus/snmp_exporter covering device
availability, interface status, error rates, bandwidth utilization,
and device restarts. Rules use standard IF-MIB and SNMPv2-MIB metrics.
- Replace deprecated ::set-output with $GITHUB_OUTPUT
- Pin mikefarah/yq from @master to @v4
- Add explicit permissions: contents: write to publish workflow
- Limit test workflow push trigger to master branch only
* data: adding python/ruby/golang
* fix: address review feedback on runtime alerts
- JVM non-heap: guard against unbounded metaspace (max_bytes = -1)
- JVM old gen GC: note regex only matches CMS/G1/Parallel collectors
- JVM/Python file descriptors: note process_* metrics are generic
- Go memory usage: fix description (sys_bytes is runtime memory, not host)
- Go goroutine spike: use deriv() instead of rate() on gauge
- Go GC CPU fraction: note deprecation since Go 1.20
- Go GC duration: clarify quantile="1" is max, not p99
- Python uncollectable: use increase() on counter instead of raw threshold
- Add threshold comments for workload-dependent defaults
* Update Gemfile.lock
Running Jekyll according to `CONTRIBUTING.md` fails complaining about
missing a `nokogiri` dependency. Updating `Gemfile.lock` seems to solve
this issue.
Fixes: #500
* Website: Support dark mode
Support `prefers-color-scheme: dark` by employing some more or less
hacky CSS overrides.
One should perhaps just use a different off-the-shelf Jekyll theme that
does this properly from the start.
* Add systemd service name to HostSystemdServiceCrashed summary
* Modify systemd service crash rule description
Updated the description for the systemd service crash rule to include the service name.
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>