Changes

anarcat · fc7dfb0a
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -1096,7 +1096,7 @@ This section details how the alerting setup mentioned above works.

 Note that the [Icinga][] service is still in service, but it
 is planned to eventually be shut down and replaced by the Prometheus +
-Alertmanager setup ([ticket 29864][]).
+Alertmanager setup ([issue 29864][]).

 In general, the upstream documentation for alerting starts from [the
 Alerting Overview][] but it can be lacking at times. [This tutorial][]
@@ -1111,6 +1111,7 @@ TPA-RFC-33 proposal][].
 [This tutorial]: https://ashish.one/blogs/setup-alertmanager/
 [alerting system]: https://grafana.torproject.org/alerting/
 [Grafana for alerting section of the TPA-RFC-33 proposal]: policy/tpa-rfc-33-monitoring#grafana-for-alerting
+[issue 29864]: https://bugs.torproject.org/29864

 ### Diagnosing alerting failures

@@ -1934,7 +1935,7 @@ changed.
 The [Alertmanager][] is configured on the external Prometheus server
 for the metrics and anti-censorship teams to monitor the health of the
 network. It may eventually also be used to replace or enhance
-[Nagios][] ([ticket 29864][]).
+[Nagios][] ([issue 29864][]).

 It is installed through Puppet, in
 `profile::prometheus::server::external`, but could be moved to its own
@@ -2007,76 +2008,11 @@ See also [Adding metrics to applications][], above.

 [Adding metrics to applications]: #adding-metrics-to-applications

-## Monitored services
+## Upgrades

-Those are the actual services monitored by Prometheus.
-
-### Internal server (`prometheus1`)
-
-The "internal" server scrapes all hosts managed by Puppet for
-TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
-takes care of metrics like CPU, memory, disk usage, time accuracy, and
-so on. Then other exporters might be enabled on specific services,
-like email or web servers.
-
-Access to the internal server is fairly public: the metrics there are
-not considered to be security sensitive and protected by
-authentication only to keep bots away.
-
-[`node_exporter`]: https://github.com/prometheus/node_exporter
-
-### External server (`prometheus2`)
-
-The "external" server, on the other hand, is more restrictive and does
-not allow public access. This is out of concern that specific metrics
-might lead to timing attacks against the network and/or leak sensitive
-information. The external server also explicitly does *not* scrape TPA
-servers automatically: it only scrapes certain services that are
-manually configured by TPA.
-
-Those are the services currently monitored by the external server:
-
- * [`bridgestrap`][]
- * [`rdsys`][]
- * OnionPerf external nodes' `node_exporter`
- * Connectivity test on (some?) bridges (using the
-   [`blackbox_exporter`][])
-
-Note that this list might become out of sync with the actual
-implementation, look into [Puppet][] in
-`profile::prometheus::server::external` for the actual deployment.
-
-This separate server was actually provisioned for the anti-censorship
-team (see [this comment for background][]). The server was setup in
-July 2019 following [#31159][].
-
-[`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics
-[`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics
-[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
-[Puppet]: howto/puppet
-[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
-[this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
-[#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
-
-### Other possible services to monitor
-
-Many more exporters could be configured. A non-exhaustive list was
-built in [ticket #30028][] around launch time. Here we
-can document more such exporters we find along the way:
-
- * [Prometheus Onion Service Exporter][] - "Export the status and
-   latency of an onion service"
- * [`hsprober`][] - similar, but also with histogram buckets, multiple
-   attempts, warm-up and error counts
- * [`haproxy_exporter`][]
-
-There's also a [list of third-party exporters][] in the Prometheus documentation.
-
-[ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
-[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
-[`hsprober`]: https://git.autistici.org/ale/hsprober
-[`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter
-[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/
+<!-- TODO: how upgrades are performed. preferably automated through Debian -->
+<!-- packages, otherwise document how upgrades are performed. see also -->
+<!-- the Testing section below -->

 ## SLA

@@ -2171,7 +2107,7 @@ using the `matchers` list. Here's an example for the TPA IRC route:
        - 'team = "TPA"'
        - 'severity =~ "critical|warning"'

-## Pushgateway
+### Pushgateway

 The [Pushgateway][] is a separate server from the main Prometheus
 server that is designed to "hold" onto metrics for ephemeral jobs that
@@ -2179,7 +2115,7 @@ would otherwise be around long enough for Prometheus to scrape their
 metrics. We use it as a workaround to bridge Metrics data with
 Prometheus/Grafana.

-## Debugging the blackbox exporter
+### Debugging the blackbox exporter

 The [upstream documentation][] has some details that can help. We also
 have examples [above][] for how to configure it in our setup.
@@ -2199,7 +2135,7 @@ things before creating the final configuration for the target.
 [upstream documentation]: https://github.com/prometheus/blackbox_exporter
 [above]: #adding-alert-rules

-## Alertmanager
+### Alertmanager

 The [Alertmanager][] is a separate program that receives notifications
 generated by Prometheus servers through an API, groups, and
@@ -2261,7 +2197,7 @@ compiler][] which is [not in Debian][]. It can be built by hand
 using the `debian/generate-ui.sh` script, but only in newer, post
 buster versions. Another alternative to consider is [Crochet][].

-### Alerting philosophy
+#### Alerting philosophy

 In general, when working on alerting, keeping [the "My Philosophy on
 Alerting" paper from a Google engineer][] (now the [Monitoring
@@ -2311,7 +2247,7 @@ again. The [kthxbye bot][] works around that issue.
 [Site Reliability Engineering]: https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
 [kthxbye bot]: https://github.com/prymitive/kthxbye

-### Alert timing details
+#### Alert timing details

 Alert timing can be a hard topic to understand in Prometheus alerting,
 because there are many components associated with it, and Prometheus
@@ -2429,6 +2365,106 @@ notification in a particularly flappy alert][].
 [in `dispatch.go`, line 460, function `aggrGroup.run()`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460
 [mysterious failure to send notification in a particularly flappy alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18

+## Services
+
+<!-- TODO: open ports, daemons, cron jobs -->
+
+### Monitored services
+
+Those are the actual services monitored by Prometheus.
+
+### Internal server (`prometheus1`)
+
+The "internal" server scrapes all hosts managed by Puppet for
+TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
+takes care of metrics like CPU, memory, disk usage, time accuracy, and
+so on. Then other exporters might be enabled on specific services,
+like email or web servers.
+
+Access to the internal server is fairly public: the metrics there are
+not considered to be security sensitive and protected by
+authentication only to keep bots away.
+
+[`node_exporter`]: https://github.com/prometheus/node_exporter
+
+### External server (`prometheus2`)
+
+The "external" server, on the other hand, is more restrictive and does
+not allow public access. This is out of concern that specific metrics
+might lead to timing attacks against the network and/or leak sensitive
+information. The external server also explicitly does *not* scrape TPA
+servers automatically: it only scrapes certain services that are
+manually configured by TPA.
+
+Those are the services currently monitored by the external server:
+
+ * [`bridgestrap`][]
+ * [`rdsys`][]
+ * OnionPerf external nodes' `node_exporter`
+ * Connectivity test on (some?) bridges (using the
+   [`blackbox_exporter`][])
+
+Note that this list might become out of sync with the actual
+implementation, look into [Puppet][] in
+`profile::prometheus::server::external` for the actual deployment.
+
+This separate server was actually provisioned for the anti-censorship
+team (see [this comment for background][]). The server was setup in
+July 2019 following [#31159][].
+
+[`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics
+[`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics
+[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
+[Puppet]: howto/puppet
+[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
+[this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
+[#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
+
+### Other possible services to monitor
+
+Many more exporters could be configured. A non-exhaustive list was
+built in [ticket #30028][] around launch time. Here we
+can document more such exporters we find along the way:
+
+ * [Prometheus Onion Service Exporter][] - "Export the status and
+   latency of an onion service"
+ * [`hsprober`][] - similar, but also with histogram buckets, multiple
+   attempts, warm-up and error counts
+ * [`haproxy_exporter`][]
+
+There's also a [list of third-party exporters][] in the Prometheus documentation.
+
+[ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
+[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
+[`hsprober`]: https://git.autistici.org/ale/hsprober
+[`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter
+[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/
+
+## Storage
+
+<!-- TODO databases? plain text file? the frigging blockchain? memory? -->
+
+## Queues
+
+<!-- TODO email queues, job queues, schedulers -->
+
+## Interfaces
+
+<!-- TODO e.g. web APIs, commandline clients, etc -->
+
+## Authentication
+
+<!-- TODO SSH? LDAP? standalone? -->
+
+## Implementation
+
+<!-- TODO programming languages, frameworks, versions, license -->
+
+## Related services
+
+<!-- TODO dependent services (e.g. authenticates against LDAP, or requires -->
+<!-- git pushes)  -->
+
 ## Issues

 There is no issue tracker specifically for this project, [File][new-ticket] or
@@ -2475,6 +2511,14 @@ inside TPA. The internal Prometheus server is mostly used by TPA staff
 to diagnose issues. The external Prometheus server is used by various
 TPO teams for their own monitoring needs.

+## Users
+
+<!-- TODO who the main users are, how they use the service. possibly reuse -->
+<!-- the Personas section in the RFC, if available. -->
+<!-- see overlap with above -->
+
+## Upstream
+
 The upstream Prometheus projects are diverse and generally active as
 of early 2021. Since Prometheus is used as an ad-hoc standard in the
 new "cloud native" communities like Kubernetes, it has seen an upsurge
@@ -2503,21 +2547,12 @@ details.
 [Voxpupuli collective]: https://github.com/voxpupuli
 [upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32

-## Monitoring and testing
-
-Prometheus doesn't have specific tests, but there *is* a test suite in
-the upstream Prometheus Puppet module.
+## Monitoring and metrics

 The server is monitored for basic system-level metrics by Nagios. It
 also monitors itself for system-level metrics but also
 application-specific metrics.

-## Logs and metrics
-
-Prometheus servers typically do not generate many logs, except when
-errors and warnings occur. They should hold very little PII. The web
-frontends collect logs in accordance with our regular policy.
-
 Actual metrics *may* contain PII, although it's quite unlikely:
 typically, data is anonymized and aggregated at collection time. It
 would still be able to deduce some activity patterns from the metrics
@@ -2533,6 +2568,19 @@ policies.

 [TPA-RFC-33]: policy/tpa-rfc-33-monitoring

+## Tests
+
+Prometheus doesn't have specific tests, but there *is* a test suite in
+the upstream Prometheus Puppet module.
+
+TODO: merge with alertmanager test stuff
+
+## Logs
+
+Prometheus servers typically do not generate many logs, except when
+errors and warnings occur. They should hold very little PII. The web
+frontends collect logs in accordance with our regular policy.
+
 ## Backups

 Prometheus servers should be fully configured through Puppet and
@@ -2590,7 +2638,7 @@ publicly.
 [ticket 31159]: https://bugs.torproject.org/31159

 It was originally thought Prometheus could completely replace
-[Nagios][] as well [ticket 29864][], but this turned out to be more
+[Nagios][] as well [issue 29864][], but this turned out to be more
 difficult than planned. The main difficulty is that Nagios checks come
 with builtin threshold of acceptable performance. But Prometheus
 metrics are just that: metrics, without thresholds... This makes it
@@ -2600,31 +2648,40 @@ functionality built-in to Nagios, like availability reports,
 acknowledgments and other reports, would need to be re-implemented as
 well.

-## Goals
+## Security and risk assessment
+
+<!-- TODO: risk assessment
+
+ 5. When was the last security review done on the project? What was
+    the outcome? Are there any security issues currently? Should it
+    have another security review?
+
+ 6. When was the last risk assessment done? Something that would cover
+    risks from the data stored, the access required, etc.

-This section didn't exist when the project was launched, so this is
-really just second-guessing...
+-->

-### Must have
+## Technical debt and next steps

- * Munin replacement: long-term trending metrics to predict resource
-   allocation, with graphing
- * Free software, self-hosted
- * Puppet automation
+<!-- TODO: tech debt

-### Nice to have
+ 7. Are there any in-progress projects? Technical debt cleanup?
+    Migrations? What state are they in? What's the urgency? What's the
+    next steps?

- * Possibility of eventual Nagios phase-out ([ticket 29864][])
+ 8. What urgent things need to be done on this project?

- [ticket 29864]: https://bugs.torproject.org/29864
+-->

-### Non-Goals
+## Proposed Solutions

- * Data retention beyond one year
+### TPA-RFC-33

-## Approvals required
+TODO: document the TPA-RFC-33 history here. see overlap with above

-Primary Prometheus server was decided [in the Brussels 2019
+### Munin replacement
+
+The primary Prometheus server was decided [in the Brussels 2019
 developer meeting][], before anarcat joined the team ([ticket
 29389][]). Secondary Prometheus server was approved in
 [meeting/2019-04-08][]. Storage expansion was approved in
@@ -2635,15 +2692,7 @@ developer meeting][], before anarcat joined the team ([ticket
 [meeting/2019-04-08]: meeting/2019-04-08
 [meeting/2019-11-25]: meeting/2019-11-25

-## Proposed Solution
-
-Prometheus was chosen, see also [Grafana][].
-
-## Cost
-
-N/A
-
-## Alternatives considered
+## Other alternatives

 We considered retaining Nagios/Icinga as an alerting system, separate
 from Prometheus, but ultimately decided against it in [TPA-RFC-33][].