prometheus self-monitoring: alertmanager monitoring and dead man switch

Closed Issue created 10 months ago by anarcat

Prometheus should monitor itself and its [Alertmanager][] for outages, by scraping their metrics endpoints and checking for up metrics, but, for Alertmanager, possibly also alertmanager_config_last_reload_successful and alertmanager_notifications_failed_total (source).

Prometheus calls this metamonitoring, which also includes the "monitoring server is up, but your configuration is empty" scenario. For example, they suggest a blackbox test that a metric pushed to the PushGateway will trigger an outgoing alert.

Some mechanism may be set to make sure alerts can and do get delivered, probably through a "dead man's switch" that continuously sends alerts and makes sure they get delivered. Karma has support for such alerts, for example, and prommsd is a standalone daemon that's designed to act as a webhook receiver for Alertmanager that will raise an alert back into the Alertmanager if it doesn't receive alerts.

MVP is alerts on the alertmanager, but that doesn't make much sense without HA... A dead man's switch on Karma might therefore be higher priority (requires #41640 (closed) and of course #41630 (closed)).

Also investigate the other options above.

Edited 10 months ago by anarcat

prometheus self-monitoring: alertmanager monitoring and dead man switch

Linked items ... 0

Activity