From 9460db41fb0080dcf3d465eac877b4eb42da063c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org> Date: Wed, 8 May 2024 17:50:13 -0400 Subject: [PATCH] styling (tpo/tpa/team#40755) --- policy/tpa-rfc-33-monitoring.md | 34 +++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index c6e34f52..3cc4b081 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -510,30 +510,40 @@ servers](tpa-rfc-33-monitoring/architecture-after.png) The above shows a diagram of a highly available Prometheus server setup. Each server has its own set of services running: - * Prometheus: the primary pulls metrics from exporters including a + * **Prometheus**: the primary pulls metrics from exporters including a node exporter on every machine but also other exporters defined by service admins, for which configuration is a mix of Puppet and a - GitLab repository pulled by Puppet. The secondary server keeps long - term metrics and pulls all the metrics from the primary server - using a longer scrape interval. Bother Prometheus server monitor - each other. + GitLab repository pulled by Puppet. + + The secondary server keeps long term metrics and pulls all the + metrics from the primary server using a longer scrape + interval. Bother Prometheus server monitor each other. - * blackbox exporter: this exporter runs on the primary Prometheus + * **blackbox exporter**: this exporter runs on the primary Prometheus server and is scraped by the primary Prometheus server for arbitrary metrics like ICMP, HTTP or TLS response times - * Grafana: the primary server runs a Grafana service which should be + * **Grafana**: the primary server runs a Grafana service which should be fully configured in Puppet, with some dashboards being pulled from a GitLab repository. Local configuration is completely ephemeral - and discouraged. It pulls metrics from the local Prometheus server - which has a "remote read" interface to pull backlog from the - secondary server. - - * Alertmanager: each server also runs its own Alertmanager which + and discouraged. + + It pulls metrics from the local Prometheus server which has a + "remote read" interface to pull backlog from the secondary + server. + + In the above diagram, it is shown as pulling directly from Prom2, + but that's a symbolic shortcut, it would only use `localhost` as an + actual data source. + + * **Alertmanager**: each server also runs its own Alertmanager which fires off notifications to IRC, email, or (eventually) GitLab, deduplicating alerts between the two servers using its gossip protocol. + * **Karma**: alerting dashboard which pulls alerts from Alertmanager + and can issue silences. + The current prometheus1/prometheus2 server will actually be retired in favor of two *new* servers which will be rebuilt from scratch, entirely from Puppet, LDAP, and GitLab repository, ensuring they are -- GitLab