* fix: hpa alerts are using label but the queries remove it
Signed-off-by: R.Sicart <roger.sicart@gmail.com>
* fix: hpa alert is using label but the query removes it
Signed-off-by: R.Sicart <roger.sicart@gmail.com>
* feat: hpa scale max should not alert when min and max are the same
Signed-off-by: R.Sicart <roger.sicart@gmail.com>
---------
Signed-off-by: R.Sicart <roger.sicart@gmail.com>
* fix: apiserver regexp matchers are automatically fully anchored
Signed-off-by: R.Sicart <roger.sicart@gmail.com>
* fix: apiserver errors alert is using label but the query removes it
Signed-off-by: R.Sicart <roger.sicart@gmail.com>
* fix: apiserver latency alert is using label but the query removes it
Signed-off-by: R.Sicart <roger.sicart@gmail.com>
---------
Signed-off-by: R.Sicart <roger.sicart@gmail.com>
haproxy_backend_max_sessions is the maximum number of sessions ever encountered during the lifetime of the HAProxy process. That is, it will never go down until HAProxy is restarted, so the alert continues to fire even though the situation has cleared!
This doesn't make sense. Look at the currently active sessions instead.
* Added Clickhouse
* Update rules.yml
Added reasonable time periods for each query to avoid false positives and in some cased give the system a short window to try to solve the issue.
Also changed the severity level of authentication alerts from critical to info which seems more appropriate
* Modified time period for alerts embedded-exporter.yml
I made a few adjustments in time periods.
See if they seem reasonable or not
* Replication alerts time periods were adjusted
IMHO, replication alerts must be sent right away.
This commit adds new Prometheus alert definitions to monitor indexing and query metrics in Elasticsearch clusters. These alerts are essential for detecting performance issues related to indexing and querying activities.
* Rework kube-state-metrics alerts:
- provide meaningful labels in summary as 'instance' label hardly makes sense in most of them
- rename some alerts to tell more accurate what the problem is
- adjust description trying to follow some kind of the message schema found in other alerts
* move changes to _data/rules.yml
* Update rules.yml
---------
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>
The query was changed to use `mongodb_oplog_stats_start` and
`mongodb_oplog_stats_end` in #291 but these metrics does not represent
the start and end of the oplog. The original head and tail metrics are
calculated from the oplog and are consistent with the output of
`db.getReplicationInfo()`.
Use `elasticsearch_indices_indexing_index_total` instead of
`elasticsearch_indices_docs` because `elasticsearch_indices_docs` might
not update without an index refresh [1]. Refreshes happen every second
by default, *but* only if there have been search requests within the
last 30 seconds [2]. If there are no search requests for a sufficiently
long duration, the alert based on `elasticsearch_indices_docs` will fire
mistakenly.
Apart from that, `elasticsearch_indices_docs` has the gauge metric type
(while `elasticsearch_indices_indexing_index_total` is of the counter
type) and the `increase` function is not intended to be used with
gauges. Drops in the document count would be treated as a reset to 0,
thus showing an increase by all remaining documents.
[1]: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html#index-stats-api-path-params
[2]: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
This alert should inform when HPAs are scaled more than half the time at their minReplicas, which is an indication of possible cost savings.
In addition, it is assumed that a minimum number of replicas should still be running for redundancy.
* add comment for BlackboxSslCertificateExpired rule
* use last_over_time to make certificate rules less prone to flapping
* add lower bound thresholds on BlackboxSslCertificateWillExpireSoon rules to avoid overlap
* changed upper bound threshold for BlackboxSslCertificateWillExpireSoon to 20 days
* make BlackboxSslCertificateWillExpireSoon description clearer
* use days in certificate rules queries to improve notification values
Co-authored-by: Panos Rontogiannis <pronto@admin.grnet.gr>
* Changed alert names to match new alert names.
* Added MongodbReplicaMemberHealth to check health of replica members health which is added in new metrics
Co-authored-by: Pooya Dowlatabadi <pooya.dowlatabadi@arvancloud.com>
Co-authored-by: Samuel Berthe <dev@samuel-berthe.fr>