prometheus self-monitoring: alertmanager monitoring and dead man switch
Quote from TPA-RFC-33:
Prometheus should monitor itself and its [Alertmanager][] for outages, by scraping their metrics endpoints and checking for
up
metrics, but, for Alertmanager, possibly alsoalertmanager_config_last_reload_successful
andalertmanager_notifications_failed_total
(source).Prometheus calls this metamonitoring, which also includes the "monitoring server is up, but your configuration is empty" scenario. For example, they suggest a blackbox test that a metric pushed to the PushGateway will trigger an outgoing alert.
Some mechanism may be set to make sure alerts can and do get delivered, probably through a "dead man's switch" that continuously sends alerts and makes sure they get delivered. Karma has support for such alerts, for example, and prommsd is a standalone daemon that's designed to act as a webhook receiver for Alertmanager that will raise an alert back into the Alertmanager if it doesn't receive alerts.
MVP is alerts on the alertmanager, but that doesn't make much sense without HA... A dead man's switch on Karma might therefore be higher priority (requires #41640 (closed) and of course #41630 (closed)).
Also investigate the other options above.