styling (tpo/tpa/team#40755) (9460db41) · Commits · The Tor Project / TPA / Wiki Replica

policy/tpa-rfc-33-monitoring.md

+22 −12

Original line number	Diff line number	Diff line
		@@ -510,30 +510,40 @@ servers](tpa-rfc-33-monitoring/architecture-after.png)
		The above shows a diagram of a highly available Prometheus server
		setup. Each server has its own set of services running:

		* Prometheus: the primary pulls metrics from exporters including a
		* Prometheus: the primary pulls metrics from exporters including a
		node exporter on every machine but also other exporters defined by
		service admins, for which configuration is a mix of Puppet and a
		GitLab repository pulled by Puppet. The secondary server keeps long
		term metrics and pulls all the metrics from the primary server
		using a longer scrape interval. Bother Prometheus server monitor
		each other.
		GitLab repository pulled by Puppet.

		* blackbox exporter: this exporter runs on the primary Prometheus
		The secondary server keeps long term metrics and pulls all the
		metrics from the primary server using a longer scrape
		interval. Bother Prometheus server monitor each other.

		* blackbox exporter: this exporter runs on the primary Prometheus
		server and is scraped by the primary Prometheus server for
		arbitrary metrics like ICMP, HTTP or TLS response times

		* Grafana: the primary server runs a Grafana service which should be
		* Grafana: the primary server runs a Grafana service which should be
		fully configured in Puppet, with some dashboards being pulled from
		a GitLab repository. Local configuration is completely ephemeral
		and discouraged. It pulls metrics from the local Prometheus server
		which has a "remote read" interface to pull backlog from the
		secondary server.
		and discouraged.

		* Alertmanager: each server also runs its own Alertmanager which
		It pulls metrics from the local Prometheus server which has a
		"remote read" interface to pull backlog from the secondary
		server.

		In the above diagram, it is shown as pulling directly from Prom2,
		but that's a symbolic shortcut, it would only use `localhost` as an
		actual data source.

		* Alertmanager: each server also runs its own Alertmanager which
		fires off notifications to IRC, email, or (eventually) GitLab,
		deduplicating alerts between the two servers using its gossip
		protocol.

		* Karma: alerting dashboard which pulls alerts from Alertmanager
		and can issue silences.

		The current prometheus1/prometheus2 server will actually be retired in
		favor of two new servers which will be rebuilt from scratch,
		entirely from Puppet, LDAP, and GitLab repository, ensuring they are