Changes

refs: #41655
lelutin · 71b91270
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -2771,16 +2771,24 @@ inspect alerts, and issue silences. It's used in our test suite.

 ## Authentication

-<!-- TODO SSH? LDAP? standalone? -->
+The web interface is accessed via HTTP Basic Authentication. Currently all
+access is done through a single user. We plan to setup one user per person
+before merging the external monitoring server to the main setup.
+
+Polling from the prometheus servers to the exporters on servers is permitted by
+IP address specifically just for the prometheus server IPs.

 ## Implementation

-<!-- TODO programming languages, frameworks, versions, license -->
+Prometheus and Alertmanager are coded in Go and released under the Apache 2.0
+license. We use the versions provided by the debian package archives in the
+current stable release.

 ## Related services

-<!-- TODO dependent services (e.g. authenticates against LDAP, or requires -->
-<!-- git pushes)  -->
+By design, no other service is required. Emails get sent out for some
+notifications and that might depend on Tor email servers, depending on which
+addresses receive the notifications.

 ## Issues

@@ -3007,15 +3015,10 @@ This was performed in [TPA-RFC-33][], over the course of 2024 and 2025.

 ## Technical debt and next steps

-<!-- TODO: tech debt
-
- 7. Are there any in-progress projects? Technical debt cleanup?
-    Migrations? What state are they in? What's the urgency? What's the
-    next steps?
-
- 8. What urgent things need to be done on this project?
+In progress projects:

-->
+- merging external and internal monitoring servers
+- reimplementing some of the alerts that were in icinga

 ## Proposed Solutions

@@ -3132,9 +3135,7 @@ Basically, Prometheus is similar to Munin in many ways:
 Near the end of 2024, Icinga was replaced by Prometheus and
 Alertmanager, as part of [TPA-RFC-33][].

-TODO: document a little bit how the actual migration went, along with
-the three stages and milestones. see overlap with Proposed solutions
-above.
+The project was split into three phases from A to C.

 Before Icinga was retired, we performed an audit of the notifications
 sent from Icinga about our services ([#41791][]) to see if we're
@@ -3146,6 +3147,14 @@ by monitoring.

 [#41791]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41791

+In phase B we implemented more alerts, integrated more metrics that were
+necessary for some new alerts and did a lot of work on ensuring that we wouldn't
+be getting double alerts for the same problem. It is also planned to merge the
+external monitoring server in this phase.
+
+Phase C concerns the setup of high availability between two prometheus servers,
+each with its own alertmanager instance, and to finalize implementing alerts.
+
 #### Prometheus equivalence for Icinga/Nagios checks

 This is an equivalence table between Nagios checks and their