From fc7dfb0ab0079f8b3828efc7c4734904ffbec5da Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org> Date: Mon, 7 Oct 2024 13:26:47 -0400 Subject: [PATCH] prom: merge with template (tpo/tpa/team#41655) --- service/prometheus.md | 271 +++++++++++++++++++++++++----------------- 1 file changed, 160 insertions(+), 111 deletions(-) diff --git a/service/prometheus.md b/service/prometheus.md index 60cba5ef..30cf288b 100644 --- a/service/prometheus.md +++ b/service/prometheus.md @@ -1096,7 +1096,7 @@ This section details how the alerting setup mentioned above works. Note that the [Icinga][] service is still in service, but it is planned to eventually be shut down and replaced by the Prometheus + -Alertmanager setup ([ticket 29864][]). +Alertmanager setup ([issue 29864][]). In general, the upstream documentation for alerting starts from [the Alerting Overview][] but it can be lacking at times. [This tutorial][] @@ -1111,6 +1111,7 @@ TPA-RFC-33 proposal][]. [This tutorial]: https://ashish.one/blogs/setup-alertmanager/ [alerting system]: https://grafana.torproject.org/alerting/ [Grafana for alerting section of the TPA-RFC-33 proposal]: policy/tpa-rfc-33-monitoring#grafana-for-alerting +[issue 29864]: https://bugs.torproject.org/29864 ### Diagnosing alerting failures @@ -1934,7 +1935,7 @@ changed. The [Alertmanager][] is configured on the external Prometheus server for the metrics and anti-censorship teams to monitor the health of the network. It may eventually also be used to replace or enhance -[Nagios][] ([ticket 29864][]). +[Nagios][] ([issue 29864][]). It is installed through Puppet, in `profile::prometheus::server::external`, but could be moved to its own @@ -2007,76 +2008,11 @@ See also [Adding metrics to applications][], above. [Adding metrics to applications]: #adding-metrics-to-applications -## Monitored services +## Upgrades -Those are the actual services monitored by Prometheus. - -### Internal server (`prometheus1`) - -The "internal" server scrapes all hosts managed by Puppet for -TPA. Puppet installs a [`node_exporter`][] on *all* servers, which -takes care of metrics like CPU, memory, disk usage, time accuracy, and -so on. Then other exporters might be enabled on specific services, -like email or web servers. - -Access to the internal server is fairly public: the metrics there are -not considered to be security sensitive and protected by -authentication only to keep bots away. - -[`node_exporter`]: https://github.com/prometheus/node_exporter - -### External server (`prometheus2`) - -The "external" server, on the other hand, is more restrictive and does -not allow public access. This is out of concern that specific metrics -might lead to timing attacks against the network and/or leak sensitive -information. The external server also explicitly does *not* scrape TPA -servers automatically: it only scrapes certain services that are -manually configured by TPA. - -Those are the services currently monitored by the external server: - - * [`bridgestrap`][] - * [`rdsys`][] - * OnionPerf external nodes' `node_exporter` - * Connectivity test on (some?) bridges (using the - [`blackbox_exporter`][]) - -Note that this list might become out of sync with the actual -implementation, look into [Puppet][] in -`profile::prometheus::server::external` for the actual deployment. - -This separate server was actually provisioned for the anti-censorship -team (see [this comment for background][]). The server was setup in -July 2019 following [#31159][]. - -[`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics -[`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics -[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/ -[Puppet]: howto/puppet -[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114 -[this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159 -[#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159 - -### Other possible services to monitor - -Many more exporters could be configured. A non-exhaustive list was -built in [ticket #30028][] around launch time. Here we -can document more such exporters we find along the way: - - * [Prometheus Onion Service Exporter][] - "Export the status and - latency of an onion service" - * [`hsprober`][] - similar, but also with histogram buckets, multiple - attempts, warm-up and error counts - * [`haproxy_exporter`][] - -There's also a [list of third-party exporters][] in the Prometheus documentation. - -[ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028 -[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/ -[`hsprober`]: https://git.autistici.org/ale/hsprober -[`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter -[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/ +<!-- TODO: how upgrades are performed. preferably automated through Debian --> +<!-- packages, otherwise document how upgrades are performed. see also --> +<!-- the Testing section below --> ## SLA @@ -2171,7 +2107,7 @@ using the `matchers` list. Here's an example for the TPA IRC route: - 'team = "TPA"' - 'severity =~ "critical|warning"' -## Pushgateway +### Pushgateway The [Pushgateway][] is a separate server from the main Prometheus server that is designed to "hold" onto metrics for ephemeral jobs that @@ -2179,7 +2115,7 @@ would otherwise be around long enough for Prometheus to scrape their metrics. We use it as a workaround to bridge Metrics data with Prometheus/Grafana. -## Debugging the blackbox exporter +### Debugging the blackbox exporter The [upstream documentation][] has some details that can help. We also have examples [above][] for how to configure it in our setup. @@ -2199,7 +2135,7 @@ things before creating the final configuration for the target. [upstream documentation]: https://github.com/prometheus/blackbox_exporter [above]: #adding-alert-rules -## Alertmanager +### Alertmanager The [Alertmanager][] is a separate program that receives notifications generated by Prometheus servers through an API, groups, and @@ -2261,7 +2197,7 @@ compiler][] which is [not in Debian][]. It can be built by hand using the `debian/generate-ui.sh` script, but only in newer, post buster versions. Another alternative to consider is [Crochet][]. -### Alerting philosophy +#### Alerting philosophy In general, when working on alerting, keeping [the "My Philosophy on Alerting" paper from a Google engineer][] (now the [Monitoring @@ -2311,7 +2247,7 @@ again. The [kthxbye bot][] works around that issue. [Site Reliability Engineering]: https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/ [kthxbye bot]: https://github.com/prymitive/kthxbye -### Alert timing details +#### Alert timing details Alert timing can be a hard topic to understand in Prometheus alerting, because there are many components associated with it, and Prometheus @@ -2429,6 +2365,106 @@ notification in a particularly flappy alert][]. [in `dispatch.go`, line 460, function `aggrGroup.run()`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460 [mysterious failure to send notification in a particularly flappy alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18 +## Services + +<!-- TODO: open ports, daemons, cron jobs --> + +### Monitored services + +Those are the actual services monitored by Prometheus. + +### Internal server (`prometheus1`) + +The "internal" server scrapes all hosts managed by Puppet for +TPA. Puppet installs a [`node_exporter`][] on *all* servers, which +takes care of metrics like CPU, memory, disk usage, time accuracy, and +so on. Then other exporters might be enabled on specific services, +like email or web servers. + +Access to the internal server is fairly public: the metrics there are +not considered to be security sensitive and protected by +authentication only to keep bots away. + +[`node_exporter`]: https://github.com/prometheus/node_exporter + +### External server (`prometheus2`) + +The "external" server, on the other hand, is more restrictive and does +not allow public access. This is out of concern that specific metrics +might lead to timing attacks against the network and/or leak sensitive +information. The external server also explicitly does *not* scrape TPA +servers automatically: it only scrapes certain services that are +manually configured by TPA. + +Those are the services currently monitored by the external server: + + * [`bridgestrap`][] + * [`rdsys`][] + * OnionPerf external nodes' `node_exporter` + * Connectivity test on (some?) bridges (using the + [`blackbox_exporter`][]) + +Note that this list might become out of sync with the actual +implementation, look into [Puppet][] in +`profile::prometheus::server::external` for the actual deployment. + +This separate server was actually provisioned for the anti-censorship +team (see [this comment for background][]). The server was setup in +July 2019 following [#31159][]. + +[`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics +[`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics +[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/ +[Puppet]: howto/puppet +[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114 +[this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159 +[#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159 + +### Other possible services to monitor + +Many more exporters could be configured. A non-exhaustive list was +built in [ticket #30028][] around launch time. Here we +can document more such exporters we find along the way: + + * [Prometheus Onion Service Exporter][] - "Export the status and + latency of an onion service" + * [`hsprober`][] - similar, but also with histogram buckets, multiple + attempts, warm-up and error counts + * [`haproxy_exporter`][] + +There's also a [list of third-party exporters][] in the Prometheus documentation. + +[ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028 +[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/ +[`hsprober`]: https://git.autistici.org/ale/hsprober +[`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter +[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/ + +## Storage + +<!-- TODO databases? plain text file? the frigging blockchain? memory? --> + +## Queues + +<!-- TODO email queues, job queues, schedulers --> + +## Interfaces + +<!-- TODO e.g. web APIs, commandline clients, etc --> + +## Authentication + +<!-- TODO SSH? LDAP? standalone? --> + +## Implementation + +<!-- TODO programming languages, frameworks, versions, license --> + +## Related services + +<!-- TODO dependent services (e.g. authenticates against LDAP, or requires --> +<!-- git pushes) --> + ## Issues There is no issue tracker specifically for this project, [File][new-ticket] or @@ -2475,6 +2511,14 @@ inside TPA. The internal Prometheus server is mostly used by TPA staff to diagnose issues. The external Prometheus server is used by various TPO teams for their own monitoring needs. +## Users + +<!-- TODO who the main users are, how they use the service. possibly reuse --> +<!-- the Personas section in the RFC, if available. --> +<!-- see overlap with above --> + +## Upstream + The upstream Prometheus projects are diverse and generally active as of early 2021. Since Prometheus is used as an ad-hoc standard in the new "cloud native" communities like Kubernetes, it has seen an upsurge @@ -2503,21 +2547,12 @@ details. [Voxpupuli collective]: https://github.com/voxpupuli [upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32 -## Monitoring and testing - -Prometheus doesn't have specific tests, but there *is* a test suite in -the upstream Prometheus Puppet module. +## Monitoring and metrics The server is monitored for basic system-level metrics by Nagios. It also monitors itself for system-level metrics but also application-specific metrics. -## Logs and metrics - -Prometheus servers typically do not generate many logs, except when -errors and warnings occur. They should hold very little PII. The web -frontends collect logs in accordance with our regular policy. - Actual metrics *may* contain PII, although it's quite unlikely: typically, data is anonymized and aggregated at collection time. It would still be able to deduce some activity patterns from the metrics @@ -2533,6 +2568,19 @@ policies. [TPA-RFC-33]: policy/tpa-rfc-33-monitoring +## Tests + +Prometheus doesn't have specific tests, but there *is* a test suite in +the upstream Prometheus Puppet module. + +TODO: merge with alertmanager test stuff + +## Logs + +Prometheus servers typically do not generate many logs, except when +errors and warnings occur. They should hold very little PII. The web +frontends collect logs in accordance with our regular policy. + ## Backups Prometheus servers should be fully configured through Puppet and @@ -2590,7 +2638,7 @@ publicly. [ticket 31159]: https://bugs.torproject.org/31159 It was originally thought Prometheus could completely replace -[Nagios][] as well [ticket 29864][], but this turned out to be more +[Nagios][] as well [issue 29864][], but this turned out to be more difficult than planned. The main difficulty is that Nagios checks come with builtin threshold of acceptable performance. But Prometheus metrics are just that: metrics, without thresholds... This makes it @@ -2600,31 +2648,40 @@ functionality built-in to Nagios, like availability reports, acknowledgments and other reports, would need to be re-implemented as well. -## Goals +## Security and risk assessment + +<!-- TODO: risk assessment + + 5. When was the last security review done on the project? What was + the outcome? Are there any security issues currently? Should it + have another security review? + + 6. When was the last risk assessment done? Something that would cover + risks from the data stored, the access required, etc. -This section didn't exist when the project was launched, so this is -really just second-guessing... +--> -### Must have +## Technical debt and next steps - * Munin replacement: long-term trending metrics to predict resource - allocation, with graphing - * Free software, self-hosted - * Puppet automation +<!-- TODO: tech debt -### Nice to have + 7. Are there any in-progress projects? Technical debt cleanup? + Migrations? What state are they in? What's the urgency? What's the + next steps? - * Possibility of eventual Nagios phase-out ([ticket 29864][]) + 8. What urgent things need to be done on this project? - [ticket 29864]: https://bugs.torproject.org/29864 +--> -### Non-Goals +## Proposed Solutions - * Data retention beyond one year +### TPA-RFC-33 -## Approvals required +TODO: document the TPA-RFC-33 history here. see overlap with above -Primary Prometheus server was decided [in the Brussels 2019 +### Munin replacement + +The primary Prometheus server was decided [in the Brussels 2019 developer meeting][], before anarcat joined the team ([ticket 29389][]). Secondary Prometheus server was approved in [meeting/2019-04-08][]. Storage expansion was approved in @@ -2635,15 +2692,7 @@ developer meeting][], before anarcat joined the team ([ticket [meeting/2019-04-08]: meeting/2019-04-08 [meeting/2019-11-25]: meeting/2019-11-25 -## Proposed Solution - -Prometheus was chosen, see also [Grafana][]. - -## Cost - -N/A - -## Alternatives considered +## Other alternatives We considered retaining Nagios/Icinga as an alerting system, separate from Prometheus, but ultimately decided against it in [TPA-RFC-33][]. -- GitLab