Changes
Page history
prom: more docs (
#41655
)
authored
Oct 07, 2024
by
anarcat
Show whitespace changes
Inline
Side-by-side
service/prometheus.md
View page @
7f4ee242
...
...
@@ -2064,7 +2064,25 @@ Nagios deployment.
[
Kubernetes
]:
https://kubernetes.io/
It does not show that Prometheus can federate to multiple instances
and the Alertmanager can be configured with High availability.
and the Alertmanager can be configured with High availability. We have
a monolithic server setup right now, that's planned for
[
TPA-RFC-33-C
][]
.
### Metrics types
In
[
monitoring distributed systems
][]
, Google defines 4 "golden
signals", categories of metrics that need to be monitored:
*
**Latency**
: time to service a request
*
**Traffic**
: transactions per second or bandwidth
*
**Errors**
: failure rates, e.g. 500 errors in web servers
*
**Saturation**
: full disks, memory, CPU utilization, etc
In the book, they argue all four should issue pager alerts, but we
believe warnings for saturation, except for extreme cases ("disk
actually full") might be sufficient.
[
monitoring distributed systems
]:
https://sre.google/sre-book/monitoring-distributed-systems/
### Alertmanager
...
...
@@ -2087,7 +2105,6 @@ but it's not deployed in our configuration, we use [Karma][]
(previously Cloudflare's
[
unsee
][]
) instead.
[
the "My Philosophy on Alerting" paper from a Google engineer
]:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
[
Monitoring distributed systems
]:
https://www.oreilly.com/radar/monitoring-distributed-systems/
[
Site Reliability Engineering
]:
https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
[
kthxbye bot
]:
https://github.com/prymitive/kthxbye
...
...
@@ -2289,6 +2306,16 @@ would otherwise be around long enough for Prometheus to scrape their
metrics. We use it as a workaround to bridge Metrics data with
Prometheus/Grafana.
## Configuration
The Prometheus server is currently configured mostly through Puppet,
where modules define exporters and "export resources" that get
collected on the central server, which then scrapes those targets.
The
[
`prometheus-alerts.git` repository
][]
contains all alerts and
some non-TPA targets, specified in the
`targets.d`
directory for all
teams.
## Services
Prometheus is made of multiple components:
...
...
@@ -2393,7 +2420,74 @@ There's also a [list of third-party exporters][] in the Prometheus documentation
## Interfaces
<!-- TODO e.g. web APIs, commandline clients, etc -->
This system has multiple interfaces. Let's take them one by one.
### Trending: Grafana
Long term trends are visible in the
[
Grafana
][]
dashboards, which taps
into the Prometheus API to show graphs for history. Documentation on
that is in the
[
Grafana
][]
wiki page.
### Alerting: Karma
The main alerting dashboard is the
[
Karma dashboard
][]
, which shows
the currently firing alerts, and allows users to silence alerts.
Technically, alerts are generated by the Prometheus server and relayed
through the Alertmanager server, then Karma taps into the Alertmanager
API to show those alerts. Karma provides those features:
*
Silencing alerts
*
Showing alert inhibitions
*
Aggregate alerts from multiple alert managers
*
Alert groups
*
Alert history
*
Dead man's switch (an alert always firing that signals an error
when it
*stops*
firing)
### Notifications: Alertmanager
We aggressively restrict the kind and number of alerts that will
actually send notifications. This was done mainly by creating two
different alerting levels ("warning" and "critical", above), and
drastically limiting the number of critical alerts.
The basic idea is that the dashboard (Karma) has "everything": alerts
(both with "warning" and "critical" levels) show up there, and it's
expected that it is "noisy". Operators are be expected to look at the
dashboard while on rotation for tasks to do. A typical example is
pending reboots, but anomalies like high load on a server or a
partition to expand in a few weeks is also expected.
All notifications are also sent over the IRC channel (
`#tor-alerts`
on
OFTC) and logged through the
`tpa_http_post_dump.service`
. It is
expected that operators look at their emails or the IRC channels
regularly and will act upon those notifications promptly.
IRC notifications are handled by the
[
`alertmanager-irc-relay`
][]
.
[
`alertmanager-irc-relay`
]:
https://github.com/google/alertmanager-irc-relay
### Command-line
Prometheus has a
[
`promtool`
][]
that allows you to query the server
from the command-line, but there's also a
[
HTTP API
](
https://prometheus.io/docs/prometheus/latest/querying/api/
)
that we can
use with
`curl`
. For example, this shows the hosts with pending
upgrades:
curl -sSL --data-urlencode query='apt_upgrades_pending>0' \
'https://$HTTP_USER@prometheus.torproject.org/api/v1/query \
| jq -r .data.result[].metric.alias \
| grep -v '^null$' | paste -sd,
The output can be passed to a tool like
[
Cumin
][]
, for example. This
is actually used in the
`fleet.pending-upgrades`
task to show an
inventory of the pending upgrades across the fleet.
[
`promtool`
]:
http://manpages.debian.org/promtool.1
Alertmanager also has a
[
`amtool`
](
https://manpages.debian.org/amtool.1
)
tool which can be used to
inspect alerts, and issue silences. It's used in our test suite.
## Authentication
...
...
@@ -2492,7 +2586,9 @@ details.
The server monitors itself for system-level metrics but also
application-specific metrics. There's a long-term plan for
high-availability in
[
TPA-RFC-33-C
](
https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/15
)
.
high-availability in
[
TPA-RFC-33-C
][]
.
[
TPA-RFC-33-C
]:
https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/15
Metrics are held for about a year or less, depending on the server,
see
[
ticket 29388
][]
for storage requirements and possible
...
...
@@ -2550,6 +2646,7 @@ setup ([#41643][]).
*
[
Prometheus developer blog
][]
*
[
Awesome Prometheus
](
https://github.com/roaldnefs/awesome-prometheus
)
listen
*
[
Blue book
](
https://lyz-code.github.io/blue-book/devops/prometheus/prometheus/
)
- interesting guide
*
[
Robust perception consulting
](
https://www.robustperception.io/
)
has a
[
series of blog posts on Prometheus
](
https://www.robustperception.io/tag/prometheus/
)
[
Prometheus home page
]:
https://prometheus.io/
[
Prometheus documentation
]:
https://prometheus.io/docs/introduction/overview/
...
...
@@ -2962,3 +3059,23 @@ to consider is [Crochet][].
[
Elm compiler
]:
https://github.com/elm/compiler
[
not in Debian
]:
http://bugs.debian.org/973915
[
Crochet
]:
https://github.com/simonpasquier/crochet
### Mobile notifications
Like
[
others
][]
we do not intend on having on-call rotation yet, and
will not ring people on their mobile devices at first. After all
exporters have been deployed (priority "C", "nice to have") and alerts
properly configured, we will evaluate the number of notifications that
get sent out. If levels are acceptable (say, once a month or so),
we might implement push notifications during business hours to
consenting staff.
[
others
]:
https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AlertsAsNotificationsFreedom
We have been advised to avoid Signal notifications as that setup is
often brittle,
`signal.org`
frequently changing their API and leading
to silent failures. We might implement
[
alerts over Matrix
][]
depending on what messaging platform gets standardized in the Tor
project.
[
alerts over Matrix
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216