Changes
Page history
prometheus: Fill in some missing parts of documentation
authored
Feb 27, 2025
by
lelutin
refs:
#41655
Show whitespace changes
Inline
Side-by-side
service/prometheus.md
View page @
71b91270
...
...
@@ -2771,16 +2771,24 @@ inspect alerts, and issue silences. It's used in our test suite.
## Authentication
<!-- TODO SSH? LDAP? standalone? -->
The web interface is accessed via HTTP Basic Authentication. Currently all
access is done through a single user. We plan to setup one user per person
before merging the external monitoring server to the main setup.
Polling from the prometheus servers to the exporters on servers is permitted by
IP address specifically just for the prometheus server IPs.
## Implementation
<!-- TODO programming languages, frameworks, versions, license -->
Prometheus and Alertmanager are coded in Go and released under the Apache 2.0
license. We use the versions provided by the debian package archives in the
current stable release.
## Related services
<!-- TODO dependent services (e.g. authenticates against LDAP, or requires -->
<!-- git pushes) -->
By design, no other service is required. Emails get sent out for some
notifications and that might depend on Tor email servers, depending on which
addresses receive the notifications.
## Issues
...
...
@@ -3007,15 +3015,10 @@ This was performed in [TPA-RFC-33][], over the course of 2024 and 2025.
## Technical debt and next steps
<!-- TODO: tech debt
7.
Are there any in-progress projects? Technical debt cleanup?
Migrations? What state are they in? What's the urgency? What's the
next steps?
8.
What urgent things need to be done on this project?
In progress projects:
-->
-
merging external and internal monitoring servers
-
reimplementing some of the alerts that were in icinga
## Proposed Solutions
...
...
@@ -3132,9 +3135,7 @@ Basically, Prometheus is similar to Munin in many ways:
Near the end of 2024, Icinga was replaced by Prometheus and
Alertmanager, as part of
[
TPA-RFC-33
][]
.
TODO: document a little bit how the actual migration went, along with
the three stages and milestones. see overlap with Proposed solutions
above.
The project was split into three phases from A to C.
Before Icinga was retired, we performed an audit of the notifications
sent from Icinga about our services (
[
#41791
][]
) to see if we're
...
...
@@ -3146,6 +3147,14 @@ by monitoring.
[
#41791
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/41791
In phase B we implemented more alerts, integrated more metrics that were
necessary for some new alerts and did a lot of work on ensuring that we wouldn't
be getting double alerts for the same problem. It is also planned to merge the
external monitoring server in this phase.
Phase C concerns the setup of high availability between two prometheus servers,
each with its own alertmanager instance, and to finalize implementing alerts.
#### Prometheus equivalence for Icinga/Nagios checks
This is an equivalence table between Nagios checks and their
...
...
...
...