prom: merge with template (#41655) authored by anarcat's avatar anarcat
......@@ -1096,7 +1096,7 @@ This section details how the alerting setup mentioned above works.
Note that the [Icinga][] service is still in service, but it
is planned to eventually be shut down and replaced by the Prometheus +
Alertmanager setup ([ticket 29864][]).
Alertmanager setup ([issue 29864][]).
In general, the upstream documentation for alerting starts from [the
Alerting Overview][] but it can be lacking at times. [This tutorial][]
......@@ -1111,6 +1111,7 @@ TPA-RFC-33 proposal][].
[This tutorial]: https://ashish.one/blogs/setup-alertmanager/
[alerting system]: https://grafana.torproject.org/alerting/
[Grafana for alerting section of the TPA-RFC-33 proposal]: policy/tpa-rfc-33-monitoring#grafana-for-alerting
[issue 29864]: https://bugs.torproject.org/29864
### Diagnosing alerting failures
......@@ -1934,7 +1935,7 @@ changed.
The [Alertmanager][] is configured on the external Prometheus server
for the metrics and anti-censorship teams to monitor the health of the
network. It may eventually also be used to replace or enhance
[Nagios][] ([ticket 29864][]).
[Nagios][] ([issue 29864][]).
It is installed through Puppet, in
`profile::prometheus::server::external`, but could be moved to its own
......@@ -2007,76 +2008,11 @@ See also [Adding metrics to applications][], above.
[Adding metrics to applications]: #adding-metrics-to-applications
## Monitored services
## Upgrades
Those are the actual services monitored by Prometheus.
### Internal server (`prometheus1`)
The "internal" server scrapes all hosts managed by Puppet for
TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
takes care of metrics like CPU, memory, disk usage, time accuracy, and
so on. Then other exporters might be enabled on specific services,
like email or web servers.
Access to the internal server is fairly public: the metrics there are
not considered to be security sensitive and protected by
authentication only to keep bots away.
[`node_exporter`]: https://github.com/prometheus/node_exporter
### External server (`prometheus2`)
The "external" server, on the other hand, is more restrictive and does
not allow public access. This is out of concern that specific metrics
might lead to timing attacks against the network and/or leak sensitive
information. The external server also explicitly does *not* scrape TPA
servers automatically: it only scrapes certain services that are
manually configured by TPA.
Those are the services currently monitored by the external server:
* [`bridgestrap`][]
* [`rdsys`][]
* OnionPerf external nodes' `node_exporter`
* Connectivity test on (some?) bridges (using the
[`blackbox_exporter`][])
Note that this list might become out of sync with the actual
implementation, look into [Puppet][] in
`profile::prometheus::server::external` for the actual deployment.
This separate server was actually provisioned for the anti-censorship
team (see [this comment for background][]). The server was setup in
July 2019 following [#31159][].
[`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics
[`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics
[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
[Puppet]: howto/puppet
[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
[this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
[#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
### Other possible services to monitor
Many more exporters could be configured. A non-exhaustive list was
built in [ticket #30028][] around launch time. Here we
can document more such exporters we find along the way:
* [Prometheus Onion Service Exporter][] - "Export the status and
latency of an onion service"
* [`hsprober`][] - similar, but also with histogram buckets, multiple
attempts, warm-up and error counts
* [`haproxy_exporter`][]
There's also a [list of third-party exporters][] in the Prometheus documentation.
[ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
[`hsprober`]: https://git.autistici.org/ale/hsprober
[`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter
[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/
<!-- TODO: how upgrades are performed. preferably automated through Debian -->
<!-- packages, otherwise document how upgrades are performed. see also -->
<!-- the Testing section below -->
## SLA
......@@ -2171,7 +2107,7 @@ using the `matchers` list. Here's an example for the TPA IRC route:
- 'team = "TPA"'
- 'severity =~ "critical|warning"'
## Pushgateway
### Pushgateway
The [Pushgateway][] is a separate server from the main Prometheus
server that is designed to "hold" onto metrics for ephemeral jobs that
......@@ -2179,7 +2115,7 @@ would otherwise be around long enough for Prometheus to scrape their
metrics. We use it as a workaround to bridge Metrics data with
Prometheus/Grafana.
## Debugging the blackbox exporter
### Debugging the blackbox exporter
The [upstream documentation][] has some details that can help. We also
have examples [above][] for how to configure it in our setup.
......@@ -2199,7 +2135,7 @@ things before creating the final configuration for the target.
[upstream documentation]: https://github.com/prometheus/blackbox_exporter
[above]: #adding-alert-rules
## Alertmanager
### Alertmanager
The [Alertmanager][] is a separate program that receives notifications
generated by Prometheus servers through an API, groups, and
......@@ -2261,7 +2197,7 @@ compiler][] which is [not in Debian][]. It can be built by hand
using the `debian/generate-ui.sh` script, but only in newer, post
buster versions. Another alternative to consider is [Crochet][].
### Alerting philosophy
#### Alerting philosophy
In general, when working on alerting, keeping [the "My Philosophy on
Alerting" paper from a Google engineer][] (now the [Monitoring
......@@ -2311,7 +2247,7 @@ again. The [kthxbye bot][] works around that issue.
[Site Reliability Engineering]: https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
[kthxbye bot]: https://github.com/prymitive/kthxbye
### Alert timing details
#### Alert timing details
Alert timing can be a hard topic to understand in Prometheus alerting,
because there are many components associated with it, and Prometheus
......@@ -2429,6 +2365,106 @@ notification in a particularly flappy alert][].
[in `dispatch.go`, line 460, function `aggrGroup.run()`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460
[mysterious failure to send notification in a particularly flappy alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18
## Services
<!-- TODO: open ports, daemons, cron jobs -->
### Monitored services
Those are the actual services monitored by Prometheus.
### Internal server (`prometheus1`)
The "internal" server scrapes all hosts managed by Puppet for
TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
takes care of metrics like CPU, memory, disk usage, time accuracy, and
so on. Then other exporters might be enabled on specific services,
like email or web servers.
Access to the internal server is fairly public: the metrics there are
not considered to be security sensitive and protected by
authentication only to keep bots away.
[`node_exporter`]: https://github.com/prometheus/node_exporter
### External server (`prometheus2`)
The "external" server, on the other hand, is more restrictive and does
not allow public access. This is out of concern that specific metrics
might lead to timing attacks against the network and/or leak sensitive
information. The external server also explicitly does *not* scrape TPA
servers automatically: it only scrapes certain services that are
manually configured by TPA.
Those are the services currently monitored by the external server:
* [`bridgestrap`][]
* [`rdsys`][]
* OnionPerf external nodes' `node_exporter`
* Connectivity test on (some?) bridges (using the
[`blackbox_exporter`][])
Note that this list might become out of sync with the actual
implementation, look into [Puppet][] in
`profile::prometheus::server::external` for the actual deployment.
This separate server was actually provisioned for the anti-censorship
team (see [this comment for background][]). The server was setup in
July 2019 following [#31159][].
[`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics
[`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics
[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
[Puppet]: howto/puppet
[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
[this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
[#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
### Other possible services to monitor
Many more exporters could be configured. A non-exhaustive list was
built in [ticket #30028][] around launch time. Here we
can document more such exporters we find along the way:
* [Prometheus Onion Service Exporter][] - "Export the status and
latency of an onion service"
* [`hsprober`][] - similar, but also with histogram buckets, multiple
attempts, warm-up and error counts
* [`haproxy_exporter`][]
There's also a [list of third-party exporters][] in the Prometheus documentation.
[ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
[`hsprober`]: https://git.autistici.org/ale/hsprober
[`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter
[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/
## Storage
<!-- TODO databases? plain text file? the frigging blockchain? memory? -->
## Queues
<!-- TODO email queues, job queues, schedulers -->
## Interfaces
<!-- TODO e.g. web APIs, commandline clients, etc -->
## Authentication
<!-- TODO SSH? LDAP? standalone? -->
## Implementation
<!-- TODO programming languages, frameworks, versions, license -->
## Related services
<!-- TODO dependent services (e.g. authenticates against LDAP, or requires -->
<!-- git pushes) -->
## Issues
There is no issue tracker specifically for this project, [File][new-ticket] or
......@@ -2475,6 +2511,14 @@ inside TPA. The internal Prometheus server is mostly used by TPA staff
to diagnose issues. The external Prometheus server is used by various
TPO teams for their own monitoring needs.
## Users
<!-- TODO who the main users are, how they use the service. possibly reuse -->
<!-- the Personas section in the RFC, if available. -->
<!-- see overlap with above -->
## Upstream
The upstream Prometheus projects are diverse and generally active as
of early 2021. Since Prometheus is used as an ad-hoc standard in the
new "cloud native" communities like Kubernetes, it has seen an upsurge
......@@ -2503,21 +2547,12 @@ details.
[Voxpupuli collective]: https://github.com/voxpupuli
[upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32
## Monitoring and testing
Prometheus doesn't have specific tests, but there *is* a test suite in
the upstream Prometheus Puppet module.
## Monitoring and metrics
The server is monitored for basic system-level metrics by Nagios. It
also monitors itself for system-level metrics but also
application-specific metrics.
## Logs and metrics
Prometheus servers typically do not generate many logs, except when
errors and warnings occur. They should hold very little PII. The web
frontends collect logs in accordance with our regular policy.
Actual metrics *may* contain PII, although it's quite unlikely:
typically, data is anonymized and aggregated at collection time. It
would still be able to deduce some activity patterns from the metrics
......@@ -2533,6 +2568,19 @@ policies.
[TPA-RFC-33]: policy/tpa-rfc-33-monitoring
## Tests
Prometheus doesn't have specific tests, but there *is* a test suite in
the upstream Prometheus Puppet module.
TODO: merge with alertmanager test stuff
## Logs
Prometheus servers typically do not generate many logs, except when
errors and warnings occur. They should hold very little PII. The web
frontends collect logs in accordance with our regular policy.
## Backups
Prometheus servers should be fully configured through Puppet and
......@@ -2590,7 +2638,7 @@ publicly.
[ticket 31159]: https://bugs.torproject.org/31159
It was originally thought Prometheus could completely replace
[Nagios][] as well [ticket 29864][], but this turned out to be more
[Nagios][] as well [issue 29864][], but this turned out to be more
difficult than planned. The main difficulty is that Nagios checks come
with builtin threshold of acceptable performance. But Prometheus
metrics are just that: metrics, without thresholds... This makes it
......@@ -2600,31 +2648,40 @@ functionality built-in to Nagios, like availability reports,
acknowledgments and other reports, would need to be re-implemented as
well.
## Goals
## Security and risk assessment
<!-- TODO: risk assessment
5. When was the last security review done on the project? What was
the outcome? Are there any security issues currently? Should it
have another security review?
6. When was the last risk assessment done? Something that would cover
risks from the data stored, the access required, etc.
This section didn't exist when the project was launched, so this is
really just second-guessing...
-->
### Must have
## Technical debt and next steps
* Munin replacement: long-term trending metrics to predict resource
allocation, with graphing
* Free software, self-hosted
* Puppet automation
<!-- TODO: tech debt
### Nice to have
7. Are there any in-progress projects? Technical debt cleanup?
Migrations? What state are they in? What's the urgency? What's the
next steps?
* Possibility of eventual Nagios phase-out ([ticket 29864][])
8. What urgent things need to be done on this project?
[ticket 29864]: https://bugs.torproject.org/29864
-->
### Non-Goals
## Proposed Solutions
* Data retention beyond one year
### TPA-RFC-33
## Approvals required
TODO: document the TPA-RFC-33 history here. see overlap with above
Primary Prometheus server was decided [in the Brussels 2019
### Munin replacement
The primary Prometheus server was decided [in the Brussels 2019
developer meeting][], before anarcat joined the team ([ticket
29389][]). Secondary Prometheus server was approved in
[meeting/2019-04-08][]. Storage expansion was approved in
......@@ -2635,15 +2692,7 @@ developer meeting][], before anarcat joined the team ([ticket
[meeting/2019-04-08]: meeting/2019-04-08
[meeting/2019-11-25]: meeting/2019-11-25
## Proposed Solution
Prometheus was chosen, see also [Grafana][].
## Cost
N/A
## Alternatives considered
## Other alternatives
We considered retaining Nagios/Icinga as an alerting system, separate
from Prometheus, but ultimately decided against it in [TPA-RFC-33][].
......
......