prometheus: Fill in the TODO left in the page.

refs: team#41655

prometheus: Fill in the TODO left in the page.
553139e7 · lelutin · 71b91270 · 553139e7
Verified Commit 553139e7 authored 3 weeks ago by lelutin
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -2696,7 +2696,19 @@ retention periods](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330) fo

 ## Queues

-<!-- TODO email queues, job queues, schedulers -->
+There are a couple of places where things happen automatically on a schedule in
+the monitoring infrastructure:
+
+- Prometheus schedules scrape jobs (pulling metrics) according to rules that can
+  differ for each scrape job. Each job can define its own `scrape_interval`. The
+  default is to scrape each 15 seconds, but some jobs are currently configured
+  to scrape once every minute.
+- Each alertmanager alert rule can define its own evaluation interval and delay
+  before triggering. See [Adding alerts](#writing-an-alert)
+- Prometheus can automatically discover scrape targets through different means.
+  We currently don't fully use the auto-discovery feature since we create
+  targets through files created by puppet, so any interval for this feature does
+  not affect our setup.

 ## Interfaces

@@ -3002,16 +3014,12 @@ This was performed in [TPA-RFC-33][], over the course of 2024 and 2025.

 ## Security and risk assessment

-<!-- TODO: risk assessment
+There were no security review yet.

- 5. When was the last security review done on the project? What was
-    the outcome? Are there any security issues currently? Should it
-    have another security review?
+The shared password for accessing the web interface is a challenge. We intend to
+replace this soon with individual users.

- 6. When was the last risk assessment done? Something that would cover
-    risks from the data stored, the access required, etc.
-
-->
+There were no risk assessments done yet.

 ## Technical debt and next steps

@@ -3024,7 +3032,31 @@ In progress projects:

 ### TPA-RFC-33

-TODO: document the TPA-RFC-33 history here. see overlap with above
+TPA's monitoring infrastructure has been originally setup with
+[Nagios](https://en.wikipedia.org/wiki/Nagios) and [Munin][]. Nagios was
+eventually [removed from Debian in 2016][] and replaced with Icinga 1. Munin
+somehow "died in a fire" some time before anarcat joined TPA in 2019.
+
+At that point, the lack of trending infrastructure was seen as a serious
+problem, so [Prometheus][] and [Grafana][] were [deployed in 2019][] as
+a stopgap measure.
+
+A secondary Prometheus server (`prometheus2`) was setup with stronger
+authentication for service admins. The rationale was that those
+services were more privacy-sensitive and the primary TPA setup
+(`prometheus1`) was too open to the public, which could allow for
+side-channels attacks.
+
+Those tools has been used for trending ever since, while keeping Icinga
+for monitoring.
+
+During the March 2021 hack week, Prometheus' [Alertmanager][] was
+deployed on the secondary Prometheus server to provide alerting to the
+Metrics and Anti-Censorship teams.
+
+  [Munin]: https://en.wikipedia.org/wiki/Munin_(software)
+  [removed from Debian in 2016]: https://tracker.debian.org/news/818363/removed-351dfsg-22-from-unstable/
+  [deployed in 2019]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29681

 ### Munin replacement