Unverified Commit 1d7e5aef authored by anarcat's avatar anarcat
Browse files

throw together a quick ref of what is monitored by prometheus

parent e95df3f6
Loading
Loading
Loading
Loading
+33 −0
Original line number Diff line number Diff line
@@ -249,6 +249,39 @@ Then TPA needs to hook those as part of a new node `job` in the
`scrape_configs`, in `prometheus.yml`, from Puppet, in
`profile::prometheus::server`.

## Monitored services

Those are the actual services monitored by Prometheus.

### Internal server (prometheus1)

The "internal" server scrapes all hosts managed by Puppet for
TPA. Puppet installs a [`node_exporter`](https://github.com/prometheus/node_exporter) on *all* servers, which
takes care of metrics like CPU, memory, disk usage, time accuracy, and
so on. Then other exporters might be enabled on specific services,
like email or web servers.

Access to the internal server is fairly public: the metrics there are
not considered to be security sensitive and protected by
authentication only to keep bots away.

### External server (prometheus2)

The "external" server, on the other hand, is more restrictive and does
not allow public access. This is out of concern that specific metrics
might lead to timing attacks against the network and/or leak sensitive
information. The external server also explicitly does *not* scrape TPA
servers automatically: it only scrapes certain services that are
manually configured by TPA.

Those are the services currently monitored by the external server:

 * [bridgestrap](https://bridges.torproject.org/bridgestrap-metrics)
 * [rdsys](https://bridges.torproject.org/rdsys-backend-metrics)
 * OnionPerf external nodes' `node_exporter`s
 * connectivity test on (some?) bridges (using the
   [`blackbox_exporter`](https://github.com/prometheus/blackbox_exporter/))

## SLA

Prometheus is currently not doing alerting so it doesn't have any sort