Skip to content
Snippets Groups Projects
Verified Commit 553139e7 authored by lelutin's avatar lelutin
Browse files

prometheus: Fill in the TODO left in the page.

refs: team#41655
parent 71b91270
No related branches found
No related tags found
No related merge requests found
Pipeline #253277 passed with warnings
......@@ -2696,7 +2696,19 @@ retention periods](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330) fo
## Queues
<!-- TODO email queues, job queues, schedulers -->
There are a couple of places where things happen automatically on a schedule in
the monitoring infrastructure:
- Prometheus schedules scrape jobs (pulling metrics) according to rules that can
differ for each scrape job. Each job can define its own `scrape_interval`. The
default is to scrape each 15 seconds, but some jobs are currently configured
to scrape once every minute.
- Each alertmanager alert rule can define its own evaluation interval and delay
before triggering. See [Adding alerts](#writing-an-alert)
- Prometheus can automatically discover scrape targets through different means.
We currently don't fully use the auto-discovery feature since we create
targets through files created by puppet, so any interval for this feature does
not affect our setup.
## Interfaces
......@@ -3002,16 +3014,12 @@ This was performed in [TPA-RFC-33][], over the course of 2024 and 2025.
## Security and risk assessment
<!-- TODO: risk assessment
There were no security review yet.
5. When was the last security review done on the project? What was
the outcome? Are there any security issues currently? Should it
have another security review?
The shared password for accessing the web interface is a challenge. We intend to
replace this soon with individual users.
6. When was the last risk assessment done? Something that would cover
risks from the data stored, the access required, etc.
-->
There were no risk assessments done yet.
## Technical debt and next steps
......@@ -3024,7 +3032,31 @@ In progress projects:
### TPA-RFC-33
TODO: document the TPA-RFC-33 history here. see overlap with above
TPA's monitoring infrastructure has been originally setup with
[Nagios](https://en.wikipedia.org/wiki/Nagios) and [Munin][]. Nagios was
eventually [removed from Debian in 2016][] and replaced with Icinga 1. Munin
somehow "died in a fire" some time before anarcat joined TPA in 2019.
At that point, the lack of trending infrastructure was seen as a serious
problem, so [Prometheus][] and [Grafana][] were [deployed in 2019][] as
a stopgap measure.
A secondary Prometheus server (`prometheus2`) was setup with stronger
authentication for service admins. The rationale was that those
services were more privacy-sensitive and the primary TPA setup
(`prometheus1`) was too open to the public, which could allow for
side-channels attacks.
Those tools has been used for trending ever since, while keeping Icinga
for monitoring.
During the March 2021 hack week, Prometheus' [Alertmanager][] was
deployed on the secondary Prometheus server to provide alerting to the
Metrics and Anti-Censorship teams.
[Munin]: https://en.wikipedia.org/wiki/Munin_(software)
[removed from Debian in 2016]: https://tracker.debian.org/news/818363/removed-351dfsg-22-from-unstable/
[deployed in 2019]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29681
### Munin replacement
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment