Loading policy/tpa-rfc-33-monitoring.md +22 −12 Original line number Diff line number Diff line Loading @@ -510,30 +510,40 @@ servers](tpa-rfc-33-monitoring/architecture-after.png) The above shows a diagram of a highly available Prometheus server setup. Each server has its own set of services running: * Prometheus: the primary pulls metrics from exporters including a * **Prometheus**: the primary pulls metrics from exporters including a node exporter on every machine but also other exporters defined by service admins, for which configuration is a mix of Puppet and a GitLab repository pulled by Puppet. The secondary server keeps long term metrics and pulls all the metrics from the primary server using a longer scrape interval. Bother Prometheus server monitor each other. GitLab repository pulled by Puppet. * blackbox exporter: this exporter runs on the primary Prometheus The secondary server keeps long term metrics and pulls all the metrics from the primary server using a longer scrape interval. Bother Prometheus server monitor each other. * **blackbox exporter**: this exporter runs on the primary Prometheus server and is scraped by the primary Prometheus server for arbitrary metrics like ICMP, HTTP or TLS response times * Grafana: the primary server runs a Grafana service which should be * **Grafana**: the primary server runs a Grafana service which should be fully configured in Puppet, with some dashboards being pulled from a GitLab repository. Local configuration is completely ephemeral and discouraged. It pulls metrics from the local Prometheus server which has a "remote read" interface to pull backlog from the secondary server. and discouraged. * Alertmanager: each server also runs its own Alertmanager which It pulls metrics from the local Prometheus server which has a "remote read" interface to pull backlog from the secondary server. In the above diagram, it is shown as pulling directly from Prom2, but that's a symbolic shortcut, it would only use `localhost` as an actual data source. * **Alertmanager**: each server also runs its own Alertmanager which fires off notifications to IRC, email, or (eventually) GitLab, deduplicating alerts between the two servers using its gossip protocol. * **Karma**: alerting dashboard which pulls alerts from Alertmanager and can issue silences. The current prometheus1/prometheus2 server will actually be retired in favor of two *new* servers which will be rebuilt from scratch, entirely from Puppet, LDAP, and GitLab repository, ensuring they are Loading Loading
policy/tpa-rfc-33-monitoring.md +22 −12 Original line number Diff line number Diff line Loading @@ -510,30 +510,40 @@ servers](tpa-rfc-33-monitoring/architecture-after.png) The above shows a diagram of a highly available Prometheus server setup. Each server has its own set of services running: * Prometheus: the primary pulls metrics from exporters including a * **Prometheus**: the primary pulls metrics from exporters including a node exporter on every machine but also other exporters defined by service admins, for which configuration is a mix of Puppet and a GitLab repository pulled by Puppet. The secondary server keeps long term metrics and pulls all the metrics from the primary server using a longer scrape interval. Bother Prometheus server monitor each other. GitLab repository pulled by Puppet. * blackbox exporter: this exporter runs on the primary Prometheus The secondary server keeps long term metrics and pulls all the metrics from the primary server using a longer scrape interval. Bother Prometheus server monitor each other. * **blackbox exporter**: this exporter runs on the primary Prometheus server and is scraped by the primary Prometheus server for arbitrary metrics like ICMP, HTTP or TLS response times * Grafana: the primary server runs a Grafana service which should be * **Grafana**: the primary server runs a Grafana service which should be fully configured in Puppet, with some dashboards being pulled from a GitLab repository. Local configuration is completely ephemeral and discouraged. It pulls metrics from the local Prometheus server which has a "remote read" interface to pull backlog from the secondary server. and discouraged. * Alertmanager: each server also runs its own Alertmanager which It pulls metrics from the local Prometheus server which has a "remote read" interface to pull backlog from the secondary server. In the above diagram, it is shown as pulling directly from Prom2, but that's a symbolic shortcut, it would only use `localhost` as an actual data source. * **Alertmanager**: each server also runs its own Alertmanager which fires off notifications to IRC, email, or (eventually) GitLab, deduplicating alerts between the two servers using its gossip protocol. * **Karma**: alerting dashboard which pulls alerts from Alertmanager and can issue silences. The current prometheus1/prometheus2 server will actually be retired in favor of two *new* servers which will be rebuilt from scratch, entirely from Puppet, LDAP, and GitLab repository, ensuring they are Loading