move prometheus into service directory authored by anarcat's avatar anarcat
[Prometheus][] is a monitoring system that is designed to process a
large number of metrics, centralize them on one (or multiple) servers
and serve them with a well-defined API. That API is queried through a
domain-specific language (DSL) called "PromQL" or "Prometheus Query
Language". Prometheus also supports basic graphing capabilities
although those are limited enough that we use a separate graphing
layer on top (see [Grafana][]).
[Prometheus]: https://prometheus.io/
[Grafana]: howto/grafana
[[_TOC_]]
# Tutorial
## Web dashboards
The main Prometheus web interface is available at:
<https://prometheus.torproject.org>
A simple query you can try is to pick any metric in the list and click
`Execute`. For example, [this link][] will show the 5-minute load
over the last two weeks for the known servers.
[this link]: https://prometheus1.torproject.org/graph?g0.range_input=2w&g0.expr=node_load5&g0.tab=0
The Prometheus web interface is crude: it's better to use [Grafana][]
dashboards for most purposes other than debugging.
It also shows alerts, but for that, there are better dashboards, see
below.
### Alerting dashboards
There are a couple o0f web interfaces to see alerts in our setup:
* [Karma dashboard][] - our primary view on
currently firing alerts. The alerts are grouped by labels.
* This web interface only shows what's current, not some form of
alert history.
* Shows links to "run books" related to alerts
* [Grafana availability dashboard][] - drills down into alerts and,
more importantly shows their past values.
* [Prometheus' Alerts dashboard][] - show all alerting rules and which
file they are from
* Also contains links to graphs based on alerts' PromQL expressions
Normally, all rules are defined in the [`prometheus-alerts.git`
repository][]. Another view of this is the [rules configuration
dump][] which also shows when the rule was last evaluated and how long
it took.
Each alert should have a URL to a "run book" in its annotations, typically a link
to this very wiki, in the "Pager playbook" section, which shows how to handle
any particular outage. If it's not present, it's a bug and can be filed as such.
[Karma dashboard]: https://karma.torproject.org
[Grafana availability dashboard]: https://grafana.torproject.org/d/adwbl8mxnaneoc/availability
[Prometheus' Alerts dashboard]: https://prometheus.torproject.org/classic/alerts
[`prometheus-alerts.git` repository]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
[rules configuration dump]: https://prometheus.torproject.org/classic/rules
## Adding metrics to applications
If you want your service to be monitored by Prometheus, you need to
[write][] or [reuse an existing exporter][]. [Writing an
exporter][] is more involved, but still fairly easy and might be
necessary if you are the maintainer of an application not already
instrumented for Prometheus.
[Writing an exporter]: https://prometheus.io/docs/instrumenting/writing_exporters/
The [actual documentation][Writing an exporter] is fairly good, but basically: a
Prometheus exporter is a simple HTTP server which responds to a
specific HTTP URL (`/metrics`, by convention, but it can be
anything). It responds with a key/value list of entries, one on each
line, in a simple text format more or less following the
[OpenMetrics][] standard.
Each "key" is a simple string with an arbitrary list of "[labels][]"
enclosed in curly braces. The [value][] is a float or integer.
For example, here's how the "node exporter" exports CPU usage:
# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 948736.11
node_cpu_seconds_total{cpu="0",mode="iowait"} 1659.94
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 516.23
node_cpu_seconds_total{cpu="0",mode="softirq"} 16491.47
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 35893.84
node_cpu_seconds_total{cpu="0",mode="user"} 67711.74
Note that the `HELP` and `TYPE` lines look like comments, but they are
actually important, and misusing them will lead to the metric being
ignored by Prometheus.
Also note that Prometheus's [actual support for OpenMetrics][] varies
across the ecosystem. It's better to rely on Prometheus' documentation
than OpenMetrics when writing metrics for Prometheus.
Obviously, you don't necessarily have to write all that logic
yourself, however: there are [client libraries][] (see the [Golang
guide][], [Python demo][] or [C documentation][] for examples) that
do most of the job for you.
In any case, you should be careful about the names and labels of the
metrics. See the [metric and label naming best practices][].
Once you have an exporter endpoint (say at
`http://example.com:9090/metrics`), make sure it works:
curl http://example.com:9090/metrics
This should return a number of metrics that change (or not) at each
call. Note that there's a [registry of official Prometheus export port
numbers][] that should be respected, but [it's full][] (oops).
From there on, provide that endpoint to the sysadmins (or someone with
access to the external monitoring server), which will follow the
procedure below to add the metric to Prometheus.
Once the exporter is hooked into Prometheus, you can browse the
metrics directly at: <https://prometheus.torproject.org>. Graphs
should be available at <https://grafana.torproject.org>, although
those need to be created and committed into git by sysadmins to
persist, see the [`grafana-dashboards.git` repository][] for more
information.
[write]: https://prometheus.io/docs/instrumenting/writing_exporters/
[reuse an existing exporter]: https://prometheus.io/docs/instrumenting/exporters/
[labels]: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#label
[value]: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#values
[actual support for OpenMetrics]: https://github.com/prometheus/prometheus/issues/14762
[client libraries]: https://prometheus.io/docs/instrumenting/clientlibs/
[Golang guide]: https://prometheus.io/docs/guides/go-application/
[Python demo]: https://github.com/prometheus/client_python#three-step-demo
[C documentation]: https://digitalocean.github.io/prometheus-client-c/
[metric and label naming best practices]: https://prometheus.io/docs/practices/naming/
[registry of official Prometheus export port numbers]: https://github.com/prometheus/prometheus/wiki/Default-port-allocations
[it's full]: https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusExportersFixedPorts
[`grafana-dashboards.git` repository]: https://gitlab.torproject.org/tpo/tpa/grafana-dashboards
## Adding scrape targets
"Scrape targets" are remote endpoints that Prometheus "scrapes" (or
fetches content from) to get metrics.
There are two ways of adding metrics, depending on whether or not you
have access to the Puppet server.
### Adding metrics through the git repository
People outside of TPA without access to the Puppet server can
contribute targets through a repo called
[`prometheus-alerts.git`][]. To add a scrape target:
1. Clone the repository, if not done already:
git clone https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/
cd prometheus-alerts
2. Assuming you're adding a node exporter, to add the target:
cat > targets.d/node_myproject.yaml <<EOF
# scrape the external node exporters for project Foo
---
- targets:
- targetone.example.com
- targettwo.example.com
3. Add, commit, and push:
git checkout -b myproject
git add targets.d
git commit -m"add node exporter targets for my project"
git push origin -u myproject
The last push command should show you the URL where you can submit
your merge request.
After being merged, the changes should propagate within [4 to 6
hours][]. Prometheus automatically reloads those rules when they are
deployed.
See also the [`targets.d` documentation in the git repository][].
[4 to 6 hours]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/puppet/#cron-and-scheduling
[`targets.d` documentation in the git repository]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/targets.d
[`prometheus-alerts.git`]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
### Adding metrics through Puppet
TPA-managed services should define their scrape jobs, and thus targets, via
puppet profiles.
To add a scrape job in a puppet profile, you can use the
`prometheus::scrape_job` defined type, or one of the defined types which are
convenience wrappers around that.
Here is, for example, how the gitlab runners are scraped:
```
# tell Prometheus to scrape the exporter
@@prometheus::scrape_job { "gitlab-runner_${facts['networking']['fqdn']}_9252":
job_name => 'gitlab_runner',
targets => [ "${facts['networking']['fqdn']}:9252" ],
labels => {
'alias' => $facts['networking']['fqdn'],
'team' => 'TPA',
},
}
```
The `job_name` (`gitlab_runner` above) needs to be added to the
`profile::prometheus::server::internal::collect_scrape_jobs` list in
`hiera/common/prometheus.yaml`, for example:
```
profile::prometheus::server::internal::collect_scrape_jobs:
# [...]
- job_name: 'gitlab_runner'
# [...]
```
Note that you will likely need a firewall rule to poke a hole for the
exporter:
# grant Prometheus access to the exporter, activated with the
# listen_address parameter above
Ferm::Rule <<| tag == 'profile::prometheus::server-gitlab-runner-exporter' |>>
That rule, in turn, is defined with the
`profile::prometheus::server::rule` define, in
`profile::prometheus::server::internal`, like so:
profile::prometheus::server::rule {
# [...]
'gitlab-runner': port => 9252;
# [...]
}
In another example, to configure the ssh scrape jobs (in
`modules/profile/manifests/ssh.pp`), the scrape job is created with:
@@prometheus::scrape_job { "blackbox_ssh_banner_${facts['networking']['fqdn']}":
job_name => 'blackbox_ssh_banner',
targets => [ "${facts['networking']['fqdn']}:22" ],
labels => {
'alias' => $facts['networking']['fqdn'],
'team' => 'TPA',
},
}
But because this is a blackbox exporter, the `scrape_configs`
configuration is more involved, as it needs to define the
`relabel_configs` element that make the blackbox exporter work:
- job_name: 'blackbox_ssh_banner'
metrics_path: '/probe'
params:
module:
- 'ssh_banner'
relabel_configs:
- source_labels:
- '__address__'
target_label: '__param_target'
- source_labels:
- '__param_target'
target_label: 'instance'
- target_label: '__address__'
replacement: 'localhost:9115'
Scrape jobs for non-TPA services are defined in hiera under keys named
`scrape_configs` in `hiera/common/prometheus.yaml`. Here's one example of such a
scrape job definition:
profile::prometheus::server::external::scrape_configs:
# generic blackbox exporters from any team
- job_name: blackbox
metrics_path: "/probe"
params:
module:
- http_2xx
file_sd_configs:
- files:
- "/etc/prometheus-alerts/targets.d/blackbox_*.yaml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115
Some scrape jobs can be simpler and not require the relabeling part. In the
above case, the relabeling is done since the exporter runs on the Prometheus
server itself instead of the actual target.
Targets for scrape jobs defined in Hiera are however not managed by
puppet. They are defined through files in the [`prometheus-alerts.git`
repository][]. See the section below for more details on how things
are maintained there. In the above example, we can see that targets
are obtained via files on disk. The [`prometheus-alerts.git`
repository][] is cloned in `/etc/prometheus-alerts` on the Prometheus
servers.
[prometheus-alerts]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
Note: we currently have a handful of `blackbox-exporter`-related targets for TPA
services, namely for the HTTP checks. We intend to move those into puppet
profiles whenever possible.
#### Manually adding targets in Puppet
Normally, services configured in Puppet SHOULD automatically be
scraped by Prometheus (see above). If, however, you need to manually
configure a service, you *may* define extra jobs in the
`$scrape_configs` array, in the
`profile::prometheus::server::internal` Puppet class.
For example, because the GitLab setup is fully managed by Puppet
(e.g. [tpo/tpa/gitlab#20][], but other similar issues remain), we
cannot use this automatic setup, so manual scrape targets are defined
like this:
$scrape_configs =
[
{
'job_name' => 'gitaly',
'static_configs' => [
{
'targets' => [
'gitlab-02.torproject.org:9236',
],
'labels' => {
'alias' => 'Gitaly-Exporter',
},
},
],
},
[...]
]
But ideally those would be configured with automatic targets, below.
Metrics for the internal server are scraped automatically if the
exporter is configured by the [`puppet-prometheus`][] module. This is
done almost automatically, apart from the need to open a firewall port
in our configuration.
To take the `apache_exporter`, as an example, in
`profile::prometheus::apache_exporter`, include the
`prometheus::apache_exporter` class from the upstream Puppet module,
then we open the port to the Prometheus server on the exporter, with:
Ferm::Rule <<| tag == 'profile::prometheus::server-apache-exporter' |>>
Those rules are declared on the server, in `prometheus::prometheus::server::internal`.
[tpo/tpa/gitlab#20]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
# How-to
## Queries cheat sheet
Some handy queries I often find myself looking for and forgetting.
### Availability
Those are almost all visible from the [availability dashboard][].
[Currently firing alerts][]:
ALERTS{alertstate="firing"}
[Unreachable hosts][] (technically, unavailable node exporters):
up{job="node"} != 1
[How much time was the given service (`node` job, in this case) `up` in the past period (`30d`)][]:
avg(avg_over_time(up{job="node"}[30d]))
[How many hosts are online at any given point in time][]:
sum(count(up==1))/sum(count(up)) by (alias)
[How long did an alert fire over a given period of time][], in seconds per
day:
sum_over_time(ALERTS{alertname="MemFullSoon"}[1d:1s])
[availability dashboard]: https://grafana.torproject.org/d/adwbl8mxnaneoc/availability?var-alertstate=All
[Currently firing alerts]: https://prometheus.torproject.org/graph?g0.expr=ALERTS{alertstate%3D"firing"}
[Unreachable hosts]: https://prometheus.torproject.org/graph?g0.expr=up{job%3D"node"}+!%3D+1
[How much time was the given service (`node` job, in this case) `up` in the past period (`30d`)]: https://prometheus.torproject.org/graph?g0.expr=avg(avg_over_time(up{job%3D"node"}[30d]))
[How many hosts are online at any given point in time]: https://prometheus.torproject.org/graph?g0.expr=sum(count(up%3D=1))/sum(count(up))+by+(alias)
[How long did an alert fire over a given period of time]: https://prometheus.torproject.org/graph?g0.expr=sum_over_time(ALERTS{alertname%3D"MemFullSoon"}[1d:1s])
### Disk usage
This is a less strict version of the [`DiskWillFillSoon` alert][],
see also the [disk usage dashboard][].
[Find disks that will be full in 6 hours][]:
predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0
[Find disks that will be full in 6 hours]: https://prometheus.torproject.org/graph?g0.expr=predict_linear(node_filesystem_avail_bytes[6h],+24*60*60)+<+0
[`DiskWillFillSoon` alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/6a27846edfba9b0fcb8fa3230f0f929ceeeb0fc2/rules.d/tpa_node.rules#L15-23
[disk usage dashboard]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage
### Inventory
Those are visible in the [main Grafana dashboard][].
[Number of machines][]:
count(up{job="node"})
[Number of machine per OS version][]:
count(node_os_info) by (version_id, version_codename)
[Number of machines per exporters, or technically, number of machines per job][]:
sort_desc(sum(up{job=~\"$job\"}) by (job)
[Number of CPU cores, memory size, filesystem and LVM sizes][]:
count(node_cpu_seconds_total{classes=~\"$class\",mode=\"system\"})
sum(node_memory_MemTotal_bytes{classes=~\"$class\"}) by (alias)
sum(node_filesystem_size_bytes{classes=~\"$class\"}) by (alias)
sum(node_volume_group_size{classes=~\"$class\"}) by (alias)
See also the [CPU][], [memory][], and [disk][] dashboards.
[Uptime, in days][]:
round((time() - node_boot_time_seconds) / (24*60*60))
[Number of machines]: https://prometheus.torproject.org/graph?g0.expr=count(up{job%3D"node"})
[Number of machine per OS version]: https://prometheus.torproject.org/graph?g0.expr=count(node_os_info)+by+(version_id,+version_codename)
[Number of machines per exporters, or technically, number of machines per job]: https://prometheus.torproject.org/graph?g0.expr=sort_desc(sum(up{job%3D~\"$job\"})+by+(job)
[Number of CPU cores, memory size, filesystem and LVM sizes]: https://prometheus.torproject.org/graph?g0.expr=count(node_cpu_seconds_total{classes%3D~\"$class\",mode%3D\"system\"})
[Uptime, in days]: https://prometheus.torproject.org/graph?g0.expr=round((time()+-+node_boot_time_seconds)+/+(24*60*60))
[main Grafana dashboard]: https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview
[CPU]: https://grafana.torproject.org/d/gex9eLcWz/cpu-usage
[memory]: https://grafana.torproject.org/d/amgrk2Qnk/memory-usage
[disk]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?from=now-24h&to=now&var-class=All&var-node=All
### Running commands on hosts matching a PromQL query
Say you have an alert or situation (e.g. high load) affecting multiple
servers. Say, for example, that you have some issue that you fixed in
Puppet that will clear such an alert, and want to run Puppet on all
affected servers.
You can use the [Prometheus JSON API][] to return the host list of the
hosts matching the query (in this case `up < 1`) and run commands (in
this case, Puppet, or `patc`) with [Cumin][]:
cumin "$(curl -sSL --data-urlencode='up < 1' 'https://$HTTP_USER@prometheus.torproject.org/api/v1/query | jq -r .data.result[].metric.alias | grep -v '^null$' | paste -sd,)" 'patc'
Make sure to populate the `HTTP_USER` environment to authenticate with
the Prometheus server.
[Prometheus JSON API]: https://prometheus.io/docs/prometheus/latest/querying/api/
[Cumin]: howto/cumin
## Alerting
We are now using Prometheus for alerting for TPA services. Here's a basic
overview of how things interact around alerting:
1. Prometheus is configured to create alerts on certain conditions on metrics.
* When the PromQL expression produces a result, an alert is created in state
`pending`.
* If the PromQL keeps on producing a result for the whole `for` duration
configured in the alert, then the alert changes to state `firing` and
Prometheus then sends the alert to one or more Alertmanager instance.
2. Alertmanager receives alerts from Prometheus and is responsible for routing
the alert to the appropriate channels. For example:
* A team's or service operator's email address
* TPA's IRC channel for alerts, `#tor-alerts`
3. Karma and Grafana read alert data from Alertmanager and displays them in a
way that can be used by humans.
Currently, the secondary Prometheus server (`prometheus2`) reproduces this setup
specifically for sending out alerts to other teams with metrics that are not
made public.
This section details how the alerting setup mentioned above works.
Note that the [Icinga][] service is still in service, but it
is planned to eventually be shut down and replaced by the Prometheus +
Alertmanager setup ([ticket 29864][]).
In general, the upstream documentation for alerting starts from [the
Alerting Overview][] but it can be lacking at times. [This tutorial][]
can be quite helpful in better understanding how things are working.
Note that Grafana also has its own [alerting system][] but we are
_not_ using that, see the [Grafana for alerting section of the
TPA-RFC-33 proposal][].
[Icinga]: howto/nagios
[the Alerting Overview]: https://prometheus.io/docs/alerting/latest/overview/
[This tutorial]: https://ashish.one/blogs/setup-alertmanager/
[alerting system]: https://grafana.torproject.org/alerting/
[Grafana for alerting section of the TPA-RFC-33 proposal]: policy/tpa-rfc-33-monitoring#grafana-for-alerting
### Writing alerting rules
TODO
### Writing a playbook
Every alert in Prometheus *must* have a playbook annotation. This is
(if done well), a URL pointing at a service page like this one,
typically in the `Pager playbook` section, that explains how to deal
with the alert.
The playbook *must* include those things:
1. the actual code name of the alert (e.g. `JobDown` or
`DiskWillFillSoon`)
2. an example of the alert output (e.g. `Exporter job gitlab_runner
on tb-build-02.torproject.org:9252 is down`)
3. why this alert triggered, what is its impact
4. optionally, how to reproduce the issue
5. how to fix it
How to reproduce the issue is optional, but important. Think of
yourself in the future, tired and panicking because things are
broken:
- Where do you think the error will be visible?
- Can we `curl` something to see it happening?
- Is there a dashboard where you can see trends?
- Is there a specific Prometheus query to run live?
- Which log file can we inspect?
- Which systemd service is running it?
The "how to fix it" can be a simple one line, or it can go into a
multiple case example of scenarios that were found in the wild. It's
the hard part: sometimes, when you make an alert, you don't actually
*know* how to handle the situation. If so, explicitly state that
problem in the playbook, and say you're sorry, and that it should be
fixed.
If the playbook becomes too complicated, consider making a [Fabric][]
script out of it.
A good example of a proper playbook is the [Textfile collector errors
playbook here][]. It has all of the above points, including actual
fixes for different actual scenarios.
Here's a template to get started:
```
### Foo errors
The `FooAlert` looks like this:
Service Foo has too many errors on test.torproject.org
It means that the service Foo is having some kind of trouble. [Explain
why this happened, and what the impact is, what means for which
users. Are we losing money, data, exposing users, etc.]
[Optional] You can tell this is a real issue by going to place X and
trying Y.
[Ideal] To fix this issue, [inverse the polarity of the shift inverter
in service Foo].
[Optional] We do not yet exactly know how to fix issue, sorry. Please
document here how you fix this next time.
```
[Fabric]: howto/fabric
[Textfile collector errors playbook here]: #textfile-collector-errors
### Adding alerting rules
Adding alerts is mainly an alerting rule definition that matches on a
PromQL expression, defined in a Git repository.
But it already assumes some metrics are available and scraped by
Prometheus. For this, ensure you have followed the tutorials [Adding
metrics to applications][] and [Adding scrape targets][].
[Adding scrape targets]: #adding-scrape-targets
The Prometheus servers regularly pull the [`prometheus-alerts.git`
repository][] for alerting rule and target definitions. Alert rules
can be added through the repository by adding a file in the `rules.d`
directory, see [`rules.d`][] directory for more documentation on that.
[`rules.d`]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d
After being merged, the changes should propagate within [4 to 6
hours][]. Prometheus does _not_ automatically reload those rules by
itself, but Puppet should handle reloading the service as a
consequence of the file changes. TPA members can accelerate this by
running Puppet on the Prometheus servers, or pulling the code and
reloading the Prometheus server with:
git -C /etc/prometheus-alerts/ pull
systemctl reload prometheus
### Diagnosing alerting failures
Normally, alerts should fire on the Prometheus server and be sent out
to the Alertmanager server, and be visible in Karma. See also the
[alert routing details reference][].
If you're not sure alerts are working, head to the Prometheus
dashboard and look at the `/alerts`, and `/rules` pages. For example:
* <https://prometheus.torproject.org/alerts> - should show the configure alerts,
and if they are firing
* <https://prometheus.torproject.org/rules> - should show the configured rules,
and whether they match
Typically, the Alertmanager address (currently
<http://localhost:9093>, but to be [exposed][]) should also be useful
to manage the Alertmanager, but in practice the Debian package does
not ship the web interface, so its interest is limited in that
regard. See the `amtool` section below for more information.
Note that the [`/targets`][] URL is also useful to diagnose problems
with exporters, in general, see also the [troubleshooting section][]
below.
If you can't access the dashboard at all or if the above seems too
complicated, [Grafana][] can be used as a debugging tool for metrics
as well. In the [Explore](https://grafana.torproject.org/explore) section, you can input Prometheus
metrics, with auto-completion, and inspect the output directly.
There's also the [Grafana availability dashboard][], see the [Alerting
dashboards][] section for details.
[Installation]: #installation
[the access instructions]: #web-dashboard-access
[troubleshooting section]: #troubleshooting-missing-metrics
[alert routing details reference]: #alert-routing-details
[exposed]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41733
[Alerting dashboards]: #alerting-dashboards
### Managing alerts with amtool
Since the Alertmanager web UI is not available in Debian, you need to
use the [amtool][] command. A few useful commands:
* `amtool alert`: show firing alerts
* `amtool silence add --duration=1h --author=anarcat
--comment="working on it" ALERTNAME`: silence alert ALERTNAME for
an hour, with some comments
[amtool]: https://manpages.debian.org/amtool.1
### Checking alert history
Note that all alerts sent through the Alertmanager are dumped in
system logs, through a first "fall through" web hook route:
```
routes:
# dump *all* alerts to the debug logger
- receiver: 'tpa_http_post_dump'
continue: true
```
The receiver is configured below:
```
- name: 'tpa_http_post_dump'
webhook_configs:
- url: 'http://localhost:8098/'
```
This URL, in turn, runs a simple Python script that just dumps to
standard output all POST requests it receives, which provides us with,
basically, a JSON log of all notifications sent through the
Alertmanager. All logged entries since last boot can be seen with:
journalctl -u tpa_http_post_dump.service -b
You can see a prettier version of recent entries with the `jq`
command, for example:
journalctl -u tpa_http_post_dump.service -o cat -e | grep '^{' | jq -s .[].alerts
Note that the `grep` is required because `journalctl` insists on
bundling supervisor messages in its output, so we filter for JSON
objects, basically.
### Testing alerts
Prometheus can run unit tests for your defined alerts. See [upstream unit test
documentation][].
We managed to build a minimal unit test for an alert. Note that for a unit test
to succeed, the test must match _all_ the tags and annotations for alerts
that are expected, including ones that are added by `rewrite` in Prometheus:
```yaml
root@hetzner-nbg1-02:~/tests# cat tpa_system.yml
rule_files:
- /etc/prometheus-alerts/rules.d/tpa_system.rules
evaluation_interval: 1m
tests:
# NOTE: interval is *necessary* here. contrary to what the documentation
# shows, leaving it out will not default to the evaluation_interval set
# above
- interval: 1m
# Set of fixtures for the tests below
input_series:
- series: 'node_reboot_required{alias="NetworkHealthNodeRelay",instance="akka.0x90.dk:9100",job="relay",team="network"}'
# that's "one" for 60 samples, or 60 minutes
values: '1x60'
alert_rule_test:
# NOTE: eval_time is the offset from 0s at which the alert should be
# evaluated. if it is shorter than the alert's `for` setting, you will
# have some missing values for a while (which might be something you
# need to test?). You can play with the eval_time in other test
# entries to evaluate the same alert at different offsets in the
# timeseries above.
- eval_time: 60m
alertname: NeedsReboot
exp_alerts:
# Alert 1.
- exp_labels:
severity: warning
instance: akka.0x90.dk:9100
job: relay
team: network
alias: "NetworkHealthNodeRelay"
exp_annotations:
description: "Found pending kernel upgrades for host NetworkHealthNodeRelay"
playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/upgrades#reboots"
summary: "Host NetworkHealthNodeRelay needs to reboot"
```
The success result:
```
root@hetzner-nbg1-01:~/tests# promtool test rules tpa_system.yml
Unit Testing: tpa_system.yml
SUCCESS
```
A failing test will show you what alerts were obtained and how they compare to
what your failing test was expecting:
```
root@hetzner-nbg1-02:~/tests# promtool test rules tpa_system.yml
Unit Testing: tpa_system.yml
FAILED:
alertname: NeedsReboot, time: 10m,
exp:[
0:
Labels:{alertname="NeedsReboot", instance="akka.0x90.dk:9100", job="relay", severity="warning", team="network"}
Annotations:{}
],
got:[]
```
The above allows us to confirm that, under a specific set of circumstances (the
defined series), a specific query will generate a specific alert with a given
set of labels and annotations.
Those labels can then be fed into `amtool` to test routing. For
example, the above alert can be tested against the alertmanager
configuration with:
amtool config routes test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"
Or really, what matters in most cases are `severity` and `team`, so
this also works, and gives out the proper route:
amtool config routes test severity="warning" team="network" ; echo $?
Example:
root@hetzner-nbg1-02:~/tests# amtool config test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"
network team
Ignore the warning, it's the difference between testing the live
server and the local configuration. Naturally, you can test what
happens if the `team` label is missing or incorrect, to confirm
[default route errors][]:
root@hetzner-nbg1-02:~/tests# amtool config routes test severity="warning" team="networking"
fallback
The above, for example, confirms that `networking` is not the correct
team name (it should be `network`).
Note that you can also deliver an alert to a webhook receiver
syntetically. For example, this will deliver and empty message to the
IRC relay:
curl --header "Content-Type: application/json" --request POST --data "{}" http://localhost:8098
[upstream unit test documentation]: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/
[default route errors]: #default-route-errors
## Advanced metrics ingestion
This section documents more advanced metrics injection topics that we
rarely need or use.
### Backfilling
Starting from Prometheus 2.24, Prometheus [now
supports][] [backfilling][]. This is untested, but [this guide][]
might provide a good tutorial.
[now supports]: https://github.com/prometheus/prometheus/issues/535
[backfilling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
[this guide]: https://tlvince.com/prometheus-backfilling
### Push metrics to the Pushgateway
The [Pushgateway][] is setup on the secondary Prometheus server
(`prometheus2`). Note that you might not need to use the Pushgateway,
see the [article about pushing metrics][] before going down this
route.
The Pushgateway is fairly particular: it listens on port 9091 and gets
data through a fairly simple [curl-friendly commandline][] [API][]. We
have found that, once installed, this command just "does the right
thing", more or less:
echo 'some_metrics{foo="bar"} 3.14 | curl --data-binary @- http://localhost:9091/metrics/job/jobtest/instance/instancetest
To confirm the data was injected by the Push gateway, this can be
done:
curl localhost:9091/metrics | head
The Pushgateway is scraped, like other Prometheus jobs, every minute,
with metrics kept for a year, at the time of writing. This is
configured, inside Puppet, in `profile::prometheus::server::external`.
Note that it's [not possible to push timestamps][] into the
Pushgateway, so it's not useful to ingest past historical data.
[article about pushing metrics]: https://prometheus.io/docs/practices/pushing/
[curl-friendly commandline]: https://github.com/prometheus/pushgateway#command-line
[API]: https://github.com/prometheus/pushgateway#api
[not possible to push timestamps]: https://github.com/prometheus/pushgateway#about-timestamps
### Deleting metrics
Deleting metrics can be done through the Admin API. That first needs
to be enabled in `/etc/default/prometheus`, by adding
`--web.enable-admin-api` to the `ARGS` list, then Prometheus needs to
be restarted:
service prometheus restart
WARNING: make sure there is authentication in front of Prometheus
because this could expose the server to more destruction.
Then you need to issue a special query through the API. This, for
example, will wipe all metrics associated with the given instance:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}'
The same, but only for about an hour, good for testing that only the
wanted metrics are destroyed:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}&start=2021-10-25T19:00:00Z&end=2021-10-25T20:00:00Z'
To match only a job on a specific instance:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}&match[]={job="gitlab"}'
Deleted metrics are not necessarily immediately removed from disk but
are "eligible for compaction". Changes *should* show up immediately
however. The "Clean Tombstones" should be used to remove samples from
disk, if that's absolutely necessary:
curl -XPOST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
Make sure to disable the Admin API when done.
## Pager playbook
This section documents alerts and issues with the Prometheus service
itself. Do *NOT* document *all* alerts possibly generated from the
Prometheus here! Document those in the individual services pages, and
link to that in the alert's `playbook` annotation.
What belong here are only alerts that truly don't have any other place
to go, or that are completely generic to any service (e.g. `JobDown`
is in its place here). Generic operating system issues like "disk
full" or else *must* be documented elsewhere.
### Troubleshooting missing metrics
If metrics do not correctly show up in Grafana, it might be worth
checking in the [Prometheus dashboard][] itself for the same
metrics. Typically, if they do not show up in Grafana, they won't show
up in Prometheus either, but it's worth a try, even if only to see the
raw data.
Then, if data truly isn't present in Prometheus, you can track down
the "target" (the exporter) responsible for it in the [`/targets`][]
listing. If the target is "unhealthy", it will be marked in red and an
error message will show up.
[`/targets`]: https://prometheus.torproject.org/targets
If the target is marked healthy, the next step is to scrape the
metrics manually. This, for example, will scrape the Apache exporter
from the host `gayi`:
curl -s http://gayi.torproject.org:9117/metrics | grep apache
In the case of [this bug][], the metrics were not showing up at all:
root@hetzner-nbg1-01:~# curl -s http://gayi.torproject.org:9117/metrics | grep apache
# HELP apache_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which apache_exporter was built.
# TYPE apache_exporter_build_info gauge
apache_exporter_build_info{branch="",goversion="go1.7.4",revision="",version=""} 1
# HELP apache_exporter_scrape_failures_total Number of errors while scraping apache.
# TYPE apache_exporter_scrape_failures_total counter
apache_exporter_scrape_failures_total 18371
# HELP apache_up Could the apache server be reached
# TYPE apache_up gauge
apache_up 0
Notice, however, the `apache_exporter_scrape_failures_total`, which
was incrementing. From there, we reproduced the work the exporter was
doing manually and fixed the issue, which involved passing the correct
argument to the exporter.
[Prometheus dashboard]: https://prometheus.torproject.org/
[this bug]: https://github.com/voxpupuli/puppet-prometheus/pull/541
### Slow startup times
If Prometheus takes a long time to start, and floods logs with lines
like this every second:
Nov 01 19:43:03 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:03.788Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=30182 maxSegment=30196
... it's somewhat normal. At the time of writing, Prometheus2 takes
over a minute to start because of this problem. When it's done, it
will show the timing information, which is currently:
Nov 01 19:43:04 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:04.533Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=314.859946ms wal_replay_duration=1m16.079474672s total_replay_duration=1m16.396139067s
The solution for this is to use the [memory-snapshot-on-shutdown
feature flag][], but that is available only from 2.30.0 onward (not
in Debian bullseye), and there are critical bugs in the feature flag
before 2.34 (see [PR 10348][]), so thread carefully.
In other words, this is frustrating, but expected for older releases
of Prometheus. Newer releases may have optimizations for this, but
they need a restart to apply.
[memory-snapshot-on-shutdown feature flag]: https://prometheus.io/docs/prometheus/latest/feature_flags/#memory-snapshot-on-shutdown
[PR 10348]: https://github.com/prometheus/prometheus/pull/10348
### Pushgateway errors
The Pushgateway web interface provides some basic information about
the metrics it collects, and allow you to view the pending metrics
before they get scraped by Prometheus, which may be useful to
troubleshoot issues with the gateway.
To pull metrics by hand, you can pull directly from the pushgateway:
curl localhost:9091/metrics
If you get this error while pulling metrics from the exporter:
An error has occurred while serving metrics:
collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values
It's because similar metrics were sent twice into the gateway, which
corrupts the state of the pushgateway, a [known problems][] in
earlier versions and [fixed in 0.10][] (Debian bullseye and later). A
workaround is simply to restart the Pushgateway (and clear the
storage, if persistence is enabled, see the `--persistence.file`
flag).
[known problems]: https://github.com/prometheus/pushgateway/issues/232
[fixed in 0.10]: https://github.com/prometheus/pushgateway/pull/290
### Running out of disk space
In [tpo/tpa/team#41070][], we encountered a situation where disk
usage on the main Prometheus server was growing linearly even if the
number of targets didn't change. This is a typical problem in time
series like this where the "cardinality" of metrics grows without
bound, consuming more and more disk space as time goes by.
The first step is to confirm the diagnosis by looking at the [Grafana
graph showing Prometheus disk usage][] over time. This should show a
"sawtooth" pattern where compactions happen regularly (about once
every three weeks), but without growing much over longer periods of
time. In the above ticket, the usage was growing despite
compactions. There are also shorter-term (~4h) and smaller compactions
happening. This information is also available in the normal [disk
usage graphic][].
We then headed for the self-diagnostics Prometheus provides at:
<https://prometheus.torproject.org/classic/status>
The "Most Common Label Pairs" section will show us which `job` is
responsible for the most number of metrics. It should be `job=node`,
as that collects a lot of information for *all* the machines managed
by TPA. About 100k pairs is expected there.
It's also expected to see the "Highest Cardinality Labels" to be
`__name__` at around 1600 entries.
We haven't implemented it yet, but the [upstream Storage
documentation][] has some interesting tips, including [advice on
long-term storage][] which suggests tweaking the
`storage.local.series-file-shrink-ratio`.
[This guide from Alexandre Vazquez][] also had some useful queries and
tips we didn't fully investigate.
[tpo/tpa/team#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070
[Grafana graph showing Prometheus disk usage]: https://grafana.torproject.org/d/000000012/prometheus-2-0-stats?orgId=1&refresh=1m&viewPanel=40&from=now-1y&to=now
[disk usage graphic]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=hetzner-nbg1-01.torproject.org&from=now-3d&to=now&viewPanel=2
[upstream Storage documentation]: https://prometheus.io/docs/prometheus/1.8/storage/
[advice on long-term storage]: https://prometheus.io/docs/prometheus/1.8/storage/#settings-for-very-long-retention-time
[This guide from Alexandre Vazquez]: https://alexandre-vazquez.com/how-it-optimize-the-disk-usage-in-the-prometheus-database/
### Default route errors
If you get an email like:
```
Subject: Configuration error - Default route: [FIRING:1] JobDown
```
It's because an alerting rule fired with an incorrect
configuration. Instead of being routed to the proper team, it fell
through the default route.
This is not an emergency in the sense that it's a normal alert, but it
just got routed improperly. It should be fixed, in time. If in a rush,
open a ticket for the team likely responsible for the alerting
rule.
#### Finding the responsible party
So the first step, even if just filing a ticket, is to find the
responsible party.
Let's take this email for example:
```
Date: Wed, 03 Jul 2024 13:34:47 +0000
From: alertmanager@hetzner-nbg1-01.torproject.org
To: root@localhost
Subject: Configuration error - Default route: [FIRING:1] JobDown
CONFIGURATION ERROR: The following notifications were sent via the default route node, meaning
that they had no team label matching one of the per-team routes.
This should not be happening and it should be fixed. See:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/prometheus#reference
Total firing alerts: 1
## Firing Alerts
-----
Time: 2024-07-03 13:34:17.366 +0000 UTC
Summary: Job mtail@rdsys-test-01.torproject.org is down
Description: Job mtail on rdsys-test-01.torproject.org has been down for more than 5 minutes.
-----
```
in the above, the `mtail` job on `rdsys-test-01` "has been down for
more than 5 minutes" and has been routed to `root@localhost`.
The more likely target for that rule would probably be TPA, which
manages the `mtail` service and jobs, even though the services on that
host are managed by the anti-censorship team service admins. If the
host was *not* managed by TPA or this was a notification about a
*service* operated by the team, then a ticket should be filed there.
In this case, [tpo/tpa/team#41667][] was filed.
[tpo/tpa/team#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667
#### Fixing routing
To *fix* this issue, you must first reproduce the query that triggered
the alert. This can be found in the [Prometheus alerts dashboard][],
if the alert is still firing. In this case, we see this:
| Labels | State | Active Since | Value |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|----------------------------------------|-------|
| `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0 |
In this case, we can see there's no `team` label on that metric, which
is the root cause.
If we *can't* find the alert anymore (say it fixed itself), we can
still try to look for the matching alerting rule. Grep for the
`alertname` above in `prometheus-alerts.git`. In this case, we find:
```
anarcat@angela:prometheus-alerts$ git grep JobDown
rules.d/tpa_system.rules: - alert: JobDown
```
and the following rule:
```
- alert: JobDown
expr: up < 1
for: 5m
labels:
severity: warning
annotations:
summary: 'Job {{ $labels.job }}@{{ $labels.alias }} is down'
description: 'Job {{ $labels.job }} on {{ $labels.alias }} has been down for more than 5 minutes.'
playbook: "TODO"
```
The query, in this case, is therefore `up < 1`. But since the alert
has resolved, we can't actually do the exact same query and expect to
find the same host, we need instead to broaden the query without the
conditional (so just `up`) *and* add the right labels, in this case
this should do the trick:
up{instance="rdsys-test-01.torproject.org:3903",job="mtail"}
which, when we query Prometheus directly, gives us the following
metric:
up{alias="rdsys-test-01.torproject.org",classes="role::rdsys::backend",instance="rdsys-test-01.torproject.org:3903",job="mtail"}
0
There you can see *all* the labels associated with the metric. Those
match the alerting rule labels, but that may not always be the case,
so that step can be helpful to confirm root cause.
So, in this case, the `mtail` job doesn't have the right team
label. The fix was to add the team label to the scrape job:
```
commit 68e9b463e10481745e2fd854aa657f804ab3d365
Author: Antoine Beaupré <anarcat@debian.org>
Date: Wed Jul 3 10:18:03 2024 -0400
properly pass team label to postfix mtail job
Closes: tpo/tpa/team#41667
diff --git a/modules/mtail/manifests/postfix.pp b/modules/mtail/manifests/postfix.pp
index 542782a33..4c30bf563 100644
--- a/modules/mtail/manifests/postfix.pp
+++ b/modules/mtail/manifests/postfix.pp
@@ -8,6 +8,11 @@ class mtail::postfix (
class { 'mtail':
logs => '/var/log/mail.log',
scrape_job => $scrape_job,
+ scrape_job_labels => {
+ 'alias' => $::fqdn,
+ 'classes' => "role::${pick($::role, 'undefined')}",
+ 'team' => 'TPA',
+ },
}
mtail::program { 'postfix':
source => 'puppet:///modules/mtail/postfix.mtail',
```
See also [testing alerts][] to drill down into queries and alert
routing, in case the above doesn't work.
[Prometheus alerts dashboard]: https://prometheus.torproject.org/classic/alerts
[testing alerts]: #testing-alerts
### Exporter job down warnings
If you see an error like:
Exporter job gitlab_runner on tb-build-02.torproject.org:9252 is down
That is because Prometheus cannot reach the exporter at the given
address. The right way forward is to looks at the [targets listing][]
and see why Prometheus is failing to scrape the target.
[targets listing]: https://prometheus.torproject.org/classic/targets
#### Service down
The simplest and most obvious case is that the service is just
down. For example, Prometheus has this to say about the above
`gitlab_runner` job:
Get "http://tb-build-02.torproject.org:9252/metrics": dial tcp [2620:7:6002:0:3eec:efff:fed5:6c40]:9252: connect: connection refused
In this case, the `gitlab-runner` was just not running (yet). It was
being configured and had been added to Puppet, but wasn't yet
correctly setup.
In another scenario, however, it might just be that the service is
down. Use `curl` to confirm Prometheus' view, restricting to IPv4 and
IPv6:
curl -4 http://tb-build-02.torproject.org:9252/metrics
curl -6 http://tb-build-02.torproject.org:9252/metrics
Try this from the server itself as well.
If you know which service it is (and the job name should be a good
hint), check the service on the server, in this case:
systemctl status gitlab-runner
#### Invalid exporter output
In another case:
Exporter job civicrm@crm.torproject.org:443 is down
Prometheus was failing with this error:
expected value after metric, got "INVALID"
That means there's a syntax error in the metrics output, in this case
no value was provided for a metric, like this:
# HELP civicrm_torcrm_resque_processor_status_up Resque processor status
# TYPE civicrm_torcrm_resque_processor_status_up gauge
civicrm_torcrm_resque_processor_status_up
See [tpo/web/civicrm#149][] for further details on this
outage.
[tpo/web/civicrm#149]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149
#### Forbidden errors
Another example might be:
server returned HTTP status 403 Forbidden
... in which case there's a permission issue on the exporter
endpoint. Try to reproduce the issue by pulling the endpoint directly,
on the Prometheus server, with, for example:
curl -sSL https://donate.torproject.org:443/metrics
... or whatever URL is visible in the targets listing above. This
could be a web server configuration or lack of matching credentials in
the exporter configuration. Look in `tor-puppet.git`, the
`profile::prometheus::server::internal::collect_scrape` in
`hiera/common/prometheus.yaml`, where credentials should be defined
(although they should actually be stored in Trocla).
### Apache exporter scraping failed
If you get the error `Apache Exporter cannot monitor web server on
test.example.com` (`ApacheScrapingFailed`), Apache is up, but the
[Apache exporter][] cannot pull its metrics from there.
That means the exporter cannot pull the URL
`http://localhost/server-status/?auto`. To reproduce, pull the URL
with curl from the affected server, for example:
root@test.example.com:~# curl http://localhost/server-status/?auto
This is a typical configuration error in Apache where the
`/server-status` host is not available to the exporter because the
"default vhost" was disabled (`apache2::default_vhost` in
Hiera).
There is normally a workaround for this in the
`profile::prometheus::apache_exporter` class, which configures a
`localhost` vhost to answer properly on this address. Verify that it's
present, consider using `apache2ctl -S` to see the vhost
configuration.
See also the [Apache web server diagnostics][] in the incident
response docs for broader issues with web servers.
[Apache exporter]: https://github.com/Lusitaniae/apache_exporter/
[Apache web server diagnostics]: #apache-web-server-diagnostics
### Textfile collector errors
The `NodeTextfileCollectorErrors` looks like this:
Node exporter textfile collector errors on test.torproject.org
It means that the [textfile collector][] is having trouble parsing one
or many of the files in its `--collector.textfile.directory` (defaults
to `/var/lib/prometheus/node-exporter`).
[textfile collector]: https://github.com/prometheus/node_exporter#textfile-collector
The error should be visible in the node exporter logs, run the
following command to see it:
journalctl -u prometheus-node-exporter -e
Here's a list of issues found in the wild, but your particular issue
might be different.
#### Wrong permissions
```
Sep 24 20:56:53 bungei prometheus-node-exporter[1387]: ts=2024-09-24T20:56:53.280Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa_backuppg.prom err="failed to open textfile data file \"/var/lib/prometheus/node-exporter/tpa_backuppg.prom\": open /var/lib/prometheus/node-exporter/tpa_backuppg.prom: permission denied"
```
In this case, the file was created as a tempfile and moved into place
without fixing the permission. The fix was to simply create the file
without the `tempfile` Python library, with a `.tmp` suffix, and just
move it into place.
#### Garbage in a text file
```
Sep 24 21:14:41 perdulce prometheus-node-exporter[429]: ts=2024-09-24T21:14:41.783Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=scheduled_shutdown_metric.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/scheduled_shutdown_metric.prom\": text format parsing error in line 3: expected '\"' at start of label value, found 'r'"
```
This was an experimental metric designed in [tpo/tpa/team#41734][] to
keep track of scheduled reboot times, but it was formatted
incorrectly. The entire file content was:
```
# HELP node_shutdown_scheduled_timestamp_seconds time of the next scheduled reboot, or zero
# TYPE node_shutdown_scheduled_timestamp_seconds gauge
node_shutdown_scheduled_timestamp_seconds{kind=reboot} 1725545703.588789
```
It was missing quotes around `reboot`, the proper output would have
been:
```
# HELP node_shutdown_scheduled_timestamp_seconds time of the next scheduled reboot, or zero
# TYPE node_shutdown_scheduled_timestamp_seconds gauge
node_shutdown_scheduled_timestamp_seconds{kind="reboot"} 1725545703.588789
```
But the file was simply removed in this case.
[tpo/tpa/team#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734
## Disaster recovery
If a Prometheus/Grafana is destroyed, it should be completely
rebuildable from Puppet. Non-configuration data should be restored
from backup, with `/var/lib/prometheus/` being sufficient to
reconstruct history. If even backups are destroyed, history will be
lost, but the server should still recover and start tracking new
metrics.
# Reference
## Installation
### Puppet implementation
Every TPA server is configured as a `node-exporter` through the
`roles::monitored` that is included everywhere. The role might
eventually be expanded to cover alerting and other monitoring
resources as well. This role, in turn, includes the
`profile::prometheus::client` which configures each client correctly
with the right firewall rules.
The firewall rules are exported from the server, defined in
`profile::prometheus::server`. We hacked around limitations of the
upstream Puppet module to install Prometheus using backported Debian
packages. The monitoring server itself is defined in
`roles::monitoring`.
The [Prometheus Puppet module][] was heavily patched to [allow scrape
job collection][] and [use of Debian packages for
installation][], among [many other patches sent by anarcat][].
Much of the initial Prometheus configuration was also documented in
[ticket 29681][] and especially [ticket 29388][] which investigates
storage requirements and possible alternatives for data retention
policies.
[ticket 29388]: https://bugs.torproject.org/29388
[ticket 29681]: https://bugs.torproject.org/29681
[use of Debian packages for installation]: https://github.com/voxpupuli/puppet-prometheus/pull/303
[allow scrape job collection]: https://github.com/voxpupuli/puppet-prometheus/pull/304
[Prometheus Puppet module]: https://github.com/voxpupuli/puppet-prometheus/
[many other patches sent by anarcat]: https://github.com/voxpupuli/puppet-prometheus/pulls?q=author%3Aanarcat+
### Pushgateway
The [Pushgateway][] was configured on the external Prometheus server
to allow for the metrics people to push their data inside Prometheus
without having to write a Prometheus exporter inside Collector.
[Pushgateway]: https://github.com/prometheus/pushgateway
This was done directly inside the
`profile::prometheus::server::external` class, but could be moved to a
separate profile if it needs to be deployed internally. It is assumed
that the gateway script will run directly on `prometheus2` to avoid
setting up authentication and/or firewall rules, but this could be
changed.
### Alertmanager
The [Alertmanager][] is configured on the external Prometheus server
for the metrics and anti-censorship teams to monitor the health of the
network. It may eventually also be used to replace or enhance
[Nagios][] ([ticket 29864][]).
It is installed through Puppet, in
`profile::prometheus::server::external`, but could be moved to its own
profile if it is deployed on more than one server.
Note that Alertmanager only dispatches alerts, which are actually
generated on the Prometheus server side of things. Make sure the
following block exists in the `prometheus.yml` file:
alerting:
alert_relabel_configs: []
alertmanagers:
- static_configs:
- targets:
- localhost:9093
[Nagios]: howto/nagios
### Manual node configuration
External services can be monitored by Prometheus, as long as they
comply with the [OpenMetrics][] protocol, which is simply to expose
metrics such as this over HTTP:
metric{label=label_val} value
A real-life (simplified) example:
node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392
The above says that the node alberti has the device `/dev/sda` mounted
on `/`, formatted as an `ext4` filesystem which has 16160059392 bytes
(~16GB) free.
[OpenMetrics]: https://openmetrics.io/
System-level metrics can easily be monitored by the secondary
Prometheus server. This is usually done by installing the "node
exporter", with the following steps:
* On Debian Buster and later:
apt install prometheus-node-exporter
* On Debian stretch:
apt install -t stretch-backports prometheus-node-exporter
... assuming that backports is already configured. if it isn't, such a line in `/etc/apt/sources.list.d/backports.debian.org.list` should suffice:
deb https://deb.debian.org/debian/ stretch-backports main contrib non-free
... followed by an `apt update`, naturally.
The firewall on the machine needs to allow traffic on the exporter
port from the server `prometheus2.torproject.org`. Then [open a
ticket][new-ticket] for TPA to configure the target. Make sure to
mention:
* the hostname for the exporter
* the port of the exporter (varies according to the exporter, 9100
for the node exporter)
* how often to scrape the target, if non-default (default: 15s)
Then TPA needs to hook those as part of a new node `job` in the
`scrape_configs`, in `prometheus.yml`, from Puppet, in
`profile::prometheus::server`.
See also [Adding metrics to applications][], above.
[Adding metrics to applications]: #adding-metrics-to-applications
## Monitored services
Those are the actual services monitored by Prometheus.
### Internal server (prometheus1)
The "internal" server scrapes all hosts managed by Puppet for
TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
takes care of metrics like CPU, memory, disk usage, time accuracy, and
so on. Then other exporters might be enabled on specific services,
like email or web servers.
Access to the internal server is fairly public: the metrics there are
not considered to be security sensitive and protected by
authentication only to keep bots away.
[`node_exporter`]: https://github.com/prometheus/node_exporter
### External server (prometheus2)
The "external" server, on the other hand, is more restrictive and does
not allow public access. This is out of concern that specific metrics
might lead to timing attacks against the network and/or leak sensitive
information. The external server also explicitly does *not* scrape TPA
servers automatically: it only scrapes certain services that are
manually configured by TPA.
Those are the services currently monitored by the external server:
* [bridgestrap][]
* [rdsys][]
* OnionPerf external nodes' `node_exporter`s
* connectivity test on (some?) bridges (using the
[`blackbox_exporter`][])
Note that this list might become out of sync with the actual
implementation, look into [Puppet][] in
`profile::prometheus::server::external` for the actual deployment.
This separate server was actually provisioned for the anti-censorship
team (see [this comment for background][]). The server was setup in
July 2019 following [#31159][].
[bridgestrap]: https://bridges.torproject.org/bridgestrap-metrics
[rdsys]: https://bridges.torproject.org/rdsys-backend-metrics
[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
[Puppet]: howto/puppet
[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
[this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
[#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
### Other possible services to monitor
Many more exporters could be configured. A non-exaustive list was
built in [ticket tpo/tpa/team#30028][] around launch time. Here we
can document more such exporters we find along the way:
* [Prometheus Onion Service Exporter][] - "Export the status and
latency of an onion service"
* [hsprober][] - similar, but also with histogram buckets, multiple
attempts, warm-up and error counts
* [haproxy_exporter][]
There's also a [list of third-party exporters][] in the Prometheus documentation.
[ticket tpo/tpa/team#30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
[hsprober]: https://git.autistici.org/ale/hsprober
[haproxy_exporter]: https://github.com/prometheus/haproxy_exporter
[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/
## SLA
Prometheus is currently not doing alerting so it doesn't have any sort
of guaranteed availability. It should, hopefully, not lose too many
metrics over time so we can do proper long-term resource planning.
## Design
Here is, from the [Prometheus overview documentation][], the
basic architecture of a Prometheus site:
[Prometheus overview documentation]: https://prometheus.io/docs/introduction/overview/
<img src="https://prometheus.io/assets/architecture.png" alt="A
drawing of Prometheus' architecture, showing the push gateway and
exporters adding metrics, service discovery through file_sd and
Kubernetes, alerts pushed to the Alertmanager and the various UIs
pulling from Prometheus" />
As you can see, Prometheus is somewhat tailored towards
[Kubernetes][] but it can be used without it. We're deploying it with
the `file_sd` discovery mechanism, where Puppet collects all exporters
into the central server, which then scrapes those exporters every
`scrape_interval` (by default 15 seconds). The architecture graph also
shows the Alertmanager which could be used to (eventually) replace our
Nagios deployment.
[Kubernetes]: https://kubernetes.io/
It does not show that Prometheus can federate to multiple instances
and the Alertmanager can be configured with High availability.
### Alert routing details
Once Prometheus has created an alert, it sends it to one or more instances of
Alertmanager. This one in turn is responsible for routing the alert to the right
communication channel.
That is, if Alertmanager is correctly configured, that is if it's
configured in `prometheus.yml`, the `alerting` section, see
[Installation][] section.
Alert routes are set as a hierarchical tree in which the first route that
matches gets to handle the alert. The first-matching route may decide to ask
Alertmanager to continue processing with other routes so that the same alert can
match multiple routes. This is how TPA receives emails for critical alerts and
also IRC notifications for both warning and critical.
Each route needs to have one or more receivers set.
Receivers are and routes are defined in hiera in `hiera/common/prometheus.yaml`.
#### Receivers
Receivers are set in the key `prometheus::alertmanager::receivers` and look like
this:
- name: 'TPA-email'
email_configs:
- to: 'recipient@example.com'
require_tls: false
text: '{{ template "email.custom.txt" . }}'
headers:
subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " -- " }}'
Here we've configured an email recipient. Alertmanager can send alerts with a
bunch of other communications channels. For example to send IRC notifications,
we have a daemon binding to `localhost` on the Prometheus server waiting for
web hook calls, and the corresponding receiver has a section `webhook_configs`
instead of `email_configs`.
#### Routes
Alert routes are set in the key `prometheus::alertmanager::route` in hiera. The
default route, the one set at the top level of that key, uses the receiver
`fallback` and some default options for other routes.
The default route _should not be explicitly used_ by alerts. We always want to
explicitly match on a set of labels to send alerts to the correct destination.
Thus, the default recipient uses a different message template that explicitly
says there is a configuration error. This way we can more easily catch what's
been wrongly configured.
The default route has a key `routes`. This is where additional routes are set.
A route needs to set a recipient and then can match on certain label values,
using the `matchers` list. Here's an example for the TPA IRC route:
- receiver: 'irc-tor-admin'
matchers:
- 'team = "TPA"'
- 'severity =~ "critical|warning"'
## Pushgateway
The [Pushgateway][] is a separate server from the main Prometheus
server that is designed to "hold" onto metrics for ephemeral jobs that
would otherwise be around long enough for Prometheus to scrape their
metrics. We use it as a workaround to bridge Metrics data with
Prometheus/Grafana.
## Blackbox exporter
Most exporters are pretty straightforward: a service binds to a port and exposes
metrics through HTTP requests on that port, generally on the `/metrics` URL.
The blackbox exporter, however, is a little bit more contrived. The exporter can
be configured to run a bunch of different tests (e.g. tcp connections, http
requests, ICMP ping, etc) for a list of targets of its own. So the prometheus
server has one target, the host with the port for the blackbox exporter, but
that exporter in turn is set to check other hosts.
The [upstream documentation][] has some details that can help. We also
have examples [above][] for how to configure it in our setup.
One thing that's nice to know in addition to how it's configured is how you can
debug it. You can query the exporter from localhost in order to get more
information. If you are using this method for debugging, you'll most probably
want to include debugging output. For example, to run an ICMP test on host
pauli.torproject.org:
curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true
Note that the above trick can be used for _any_ target, not just for ones
currently configured in the blackbox exporter. So you can also use this to test
things before creating the final configuration for the target.
[upstream documentation]: https://github.com/prometheus/blackbox_exporter
[above]: #adding-alert-rules
## Alertmanager
The [Alertmanager][] is a separate program that receives notifications
generated by Prometheus servers through an API, groups, and
deduplicates notifications before sending them by email or other
mechanisms.
[Alertmanager]: https://github.com/prometheus/alertmanager
Here's how the internal design of the Alertmanager looks like:
<img src="https://raw.githubusercontent.com/prometheus/alertmanager/master/doc/arch.svg" alt="Internal architecture of the Alert manager, showing how they get the alerts from Prometheus through an API and internally pushes this through various storage queues and deduplicating notification pipelines, along with a clustered gossip protocol" />
The first deployments of the Alertmanager at TPO do not feature
a "cluster", or high availability (HA) setup.
Alerts are typically sent over email, but Alertmanager also has
builtin support for:
* Email
* Slack
* [Victorops][] (now Splunk)
* [Pagerduty][]
* [Opsgenie][] (now Atlassian)
* Wechat
There's also a [generic webhook receiver][] which is typically used
to send notifications. Many other endpoints are implemented through
that webhook, for example:
* [Cachet][]
* [Dingtalk][]
* [Discord][]
* [Google Chat][]
* [IRC][]
* Matrix: [matrix-alertmanager][] (JS) or [knopfler][] (Python), see
also [#40216][]
* [Mattermost][]
* [Microsoft teams][]
* [Phabricator][]
* [Sachet][] supports *many* messaging systems (Twilio, Pushbullet,
Telegram, Sipgate, etc)
* [Sentry][]
* [Signal][] (or [Signald][])
* [Splunk][]
* [SNMP][]
* Telegram: [nopp/alertmanager-webhook-telegram-python][] or [metalmatze/alertmanager-bot][]
* [Twilio][]
* [Wechat][]
* Zabbix: [alertmanager-zabbix-webhook][] or [zabbix-alertmanager][]
And that is only what was available at the time of writing, the
[alertmanager-webhook][] and [alertmanager tags][] GitHub might have more.
The Alertmanager has its own web interface to see and silence alerts,
but there are also alternatives like [Karma][] (previously
Cloudflare's [unsee][]). The web interface is
not shipped with the Debian package, because it depends on the [Elm
compiler][] which is [not in Debian][]. It can be built by hand
using the `debian/generate-ui.sh` script, but only in newer, post
buster versions. Another alternative to consider is [Crochet][].
In general, when working on alerting, keeping [the "My Philosophy on
Alerting" paper from a Google engineer][] (now the [Monitoring
distributed systems][] chapter of the [Site Reliability
Engineering][] O'Reilly book.
Another issue with alerting in Prometheus is that you can only silence
warnings for a certain amount of time, then you get a notification
again. The [kthxbye bot][] works around that issue.
[Victorops]: https://victorops.com
[Pagerduty]: https://pagerduty.com/
[Opsgenie]: https://opsgenie.com
[generic webhook receiver]: https://prometheus.io/docs/alerting/latest/configuration/#webhook_config
[Cachet]: https://github.com/oxyno-zeta/prometheus-cachethq
[Dingtalk]: https://github.com/timonwong/prometheus-webhook-dingtalk
[Discord]: https://github.com/rogerrum/alertmanager-discord
[Google Chat]: https://github.com/mr-karan/calert
[IRC]: https://github.com/crisidev/alertmanager_irc
[#40216]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
[matrix-alertmanager]: https://github.com/jaywink/matrix-alertmanager
[knopfler]: https://github.com/sinnwerkstatt/knopfler
[Mattermost]: https://github.com/cpanato/mattermost-plugin-alertmanager
[Microsoft teams]: https://github.com/prometheus-msteams/prometheus-msteams
[Phabricator]: https://github.com/knyar/phalerts
[Sachet]: https://github.com/messagebird/sachet
[Sentry]: https://github.com/mikeroll/alertmanager-sentry-gateway
[Signal]: https://github.com/dadevel/alertmanager-signal-receiver
[Signald]: https://github.com/dgl/alertmanager-webhook-signald
[Splunk]: https://github.com/sylr/alertmanager-splunkbot
[SNMP]: https://github.com/maxwo/snmp_notifier
[nopp/alertmanager-webhook-telegram-python]: https://github.com/nopp/alertmanager-webhook-telegram-python
[metalmatze/alertmanager-bot]: https://github.com/metalmatze/alertmanager-bot
[Twilio]: https://github.com/Swatto/promtotwilio
[Wechat]: https://github.com/daozzg/work_wechat_robot
[alertmanager-zabbix-webhook]: https://github.com/gmauleon/alertmanager-zabbix-webhook
[zabbix-alertmanager]: https://github.com/devopyio/zabbix-alertmanager
[alertmanager-webhook]: https://github.com/topics/alertmanager-webhook
[alertmanager tags]: https://github.com/topics/alertmanager
[Karma]: https://karma-dashboard.io/
[unsee]: https://github.com/cloudflare/unsee
[Elm compiler]: https://github.com/elm/compiler
[not in Debian]: http://bugs.debian.org/973915
[Crochet]: https://github.com/simonpasquier/crochet
[the "My Philosophy on Alerting" paper from a Google engineer]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
[Monitoring distributed systems]: https://www.oreilly.com/radar/monitoring-distributed-systems/
[Site Reliability Engineering]: https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
[kthxbye bot]: https://github.com/prymitive/kthxbye
### Alert timing details
Alert timing can be a hard topic to understand in Prometheus alerting,
because there are many components associated with it, and Prometheus
documentation is not great at explaining how things work clearly. This
is an attempt at explaining various parts of it as I (anarcat)
understand it as of 2024-09-19, based the latest documentation
available on <https://prometheus.io> and the current [Alertmanager git
HEAD][].
First, there might be a time vector involved in the Prometheus
query. For example, take the query:
increase(django_http_exceptions_total_by_type_total[5m]) > 0
Here, the "vector range" is `5m` or five minutes. You might think this
will fire only after 5 minutes have passed. I'm not actually sure. In
my observations, I have found this fires as soon as an increase is
detected, but will *stop* after the vector range has passed.
Second, there's the `for:` parameter in the alerting rule. Say this
was set to 5 minutes again:
- alert: DjangoExceptions
expr: increase(django_http_exceptions_total_by_type_total[5m]) > 0
for: 5m
This means that the alert will be considered only `pending` for that
period. Prometheus will *not* send an alert to the Alertmanager at all
unless `increase()` was sustained for the period. If *that* happens,
then the alert is marked as `firing` and Alertmanager will start
getting the alert.
(Alertmanager *might* be getting the alert in the `pending` state, but
that makes no difference to our discussion: it will not send alerts
before that period has passed.)
Third, there's another setting, `keep_firing_for`, that will make
Prometheus keep firing the alert even after the query evaluates to
false. We're ignoring this for now.
At this point, the alert has reached Alertmanager and it needs to make a
decision of what to do with it. More timers are involved.
Alerts will be evaluated against the alert routes, thus aggregated
into a new group or added to an existing group according to that
route's `group_by` setting, and then Alertmanager will evaluate the
timers set on the particular route that was matched. An alert group is
created when an alert is received and no other alerts already match
the same values for the `group_by` criteria. An alert group is removed
when all alerts in a group are in state `inactive` (e.g. resolved).
Fourth, there's the `group_wait` setting (defaults to 5 seconds, can
be [customized by route][]). This will keep Alertmanager from
routing any alerts for a while thus allowing it to group the _first_
alert notification for all alerts in the same group in one batch. It
implies that you will not receive a notification for a new alert
before that timer has elapsed. See also the too short [documentation
on grouping][].
(The `group_wait` timer is initialized when the alerting group is
created, see [`dispatch/dispatch.go`, line 415, function
`newAggrGroup`][].)
Now, *more* alerts might be sent by Prometheus if more metrics match
the above expression. They are *different* alerts because they have
different labels (say, another host might have exceptions, above, or,
more commonly, other hosts require a reboot). Prometheus will then
relay that alert to the Alertmanager, and another timer comes in.
Fifth, before relaying that new alert that's already part of a firing
group, Alertmanager will wait `group_interval` (defaults to 5m) before
resending a notification to a group.
When Alertmanager first creates an alert group, a thread is started
for that group and the _route_'s `group_interval` acts like a time
ticker. Notifications are only sent when the `group_interval` period
repeats.
So new alerts merged in a group will wait _up to_ `group_interval` before
being relayed.
(The `group_interval` timer is also initialized [in `dispatch.go`, line
460, function `aggrGroup.run()`][]. It's done *after* that function
waits for the previous timer which is normally based on the
`group_wait` value, but can be switched to `group_interval` after that
very iteration, of course.)
So, conclusions:
- If an alert flaps because it pops in and out of existence, consider
tweaking the query to cover a longer vector, by increasing the time
range (e.g. switch from `5m` to `1h`), or by comparing against a
moving average
- If an alert triggers too quickly due to a transient event (say
network noise, or someone messing up a deployment but you want to
give them a chance to fix it), increase the `for:` timer.
- Inversely, if you *fail* to detect transient outages, *reduce* the
`for:` timer, but be aware this might pick up other noises.
- If alerts come too soon and you get a flood of alerts
when an outage *starts*, increase `group_wait`.
- If alerts come in slowly but fail to be group because they don't
arrive at the same time, increase `group_interval`.
This analysis was done in response to a [mysterious failure to send
notification in a particularly flappy alert][].
[Alertmanager git HEAD]: https://github.com/prometheus/alertmanager/tree/e9904f93a7efa063bac628ed0b74184acf1c7401
[customized by route]: https://prometheus.io/docs/alerting/latest/configuration/#route
[documentation on grouping]: https://prometheus.io/docs/alerting/latest/alertmanager/#grouping
[`dispatch/dispatch.go`, line 415, function `newAggrGroup`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L415
[in `dispatch.go`, line 460, function `aggrGroup.run()`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460
[mysterious failure to send notification in a particularly flappy alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18
## Issues
There is no issue tracker specifically for this project, [File][new-ticket] or
[search][] for issues in the [team issue tracker][search] with the
~Prometheus label.
[new-ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
[search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues?label_name%5B%5D=Prometheus
### Known issues
Those are major issues that are worth knowing about Prometheus in
general, and our setup in particular:
- bind mounts generate duplicate metrics, upstream issue: [Way to
distinguish bind mounted path ?][], possible workaround: manually
specify known bind mount points
(e.g. `node_filesystem_avail_bytes{instance=~"$instance:.*",fstype!='tmpfs',fstype!='shm',mountpoint!~"/home|/var/lib/postgresql"}`),
but that can hide actual, real mountpoints, possible fix: the
`node_filesystem_mount_info` metric, [added in PR 2970 from
2024-07-14][], unreleased as of 2024-08-28
- high cardinality metrics from exporters we do not control can fill
the disk
- no long-term metrics storage, issue: [multi-year metrics storage][]
In general, the service is still being launched, see [TPA-RFC-33][]
for the full deployment plan.
[Way to distinguish bind mounted path ?]: https://github.com/prometheus/node_exporter/issues/600
[added in PR 2970 from 2024-07-14]: https://github.com/prometheus/node_exporter/pull/2970
[multi-year metrics storage]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330
### Resolved issues
No major issue resolved so far is worth mentioning here.
## Maintainer, users, and upstream
The Prometheus services have been setup and are managed by anarcat
inside TPA. The internal Prometheus server is mostly used by TPA staff
to diagnose issues. The external Prometheus server is used by various
TPO teams for their own monitoring needs.
The upstream Prometheus projects are diverse and generally active as
of early 2021. Since Prometheus is used as an ad-hoc standard in the
new "cloud native" communities like Kubernetes, it has seen an upsurge
of development and interest from various developers, and
companies. The future of Prometheus should therefore be fairly bright.
The individual exporters, however, can be hit and miss. Some exporters
are "code dumps" from companies and not very well maintained. For
example, [Digital Ocean][] dumped the [bind_exporter][] on GitHub,
but it was [salvaged][] by the [Prometheus community][].
Another important layer is the large amount of Puppet code that is
used to deploy Prometheus and its components. This is all part of a
big Puppet module, [`puppet-prometheus`][], managed by the [voxpupuli
collective][]. Our integration with the module is not yet complete:
we have a lot of glue code on top of it to correctly make it work with
Debian packages. A lot of work has been done to complete that work by
anarcat, but work still remains, see [upstream issue 32][] for
details.
[`puppet-prometheus`]: https://github.com/voxpupuli/puppet-prometheus/
[Digital Ocean]: https://github.com/digitalocean/
[bind_exporter]: https://github.com/digitalocean/bind_exporter/
[salvaged]: https://github.com/prometheus-community/bind_exporter/issues/55
[Prometheus community]: https://github.com/prometheus-community/community/issues/15
[voxpupuli collective]: https://github.com/voxpupuli
[upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32
## Monitoring and testing
Prometheus doesn't have specific tests, but there *is* a test suite in
the upstream prometheus Puppet module.
The server is monitored for basic system-level metrics by Nagios. It
also monitors itself for system-level metrics but also
application-specific metrics.
## Logs and metrics
Prometheus servers typically do not generate many logs, except when
errors and warnings occur. They should hold very little PII. The web
frontends collect logs in accordance with our regular policy.
Actual metrics *may* contain PII, although it's quite unlikely:
typically, data is anonymized and aggregated at collection time. It
would still be able to deduce some activity patterns from the metrics
generated by Prometheus, and use it to leverage side-channel attacks,
which is why the external Prometheus server access is restricted.
Metrics are held for about a year or less, depending on the server,
see [ticket 29388][] for storage requirements and possible
alternatives for data retention policies.
Note that [TPA-RFC-33][] discusses alternative metrics retention
policies.
[TPA-RFC-33]: policy/tpa-rfc-33-monitoring
## Backups
Prometheus servers should be fully configured through Puppet and
require little backups. The metrics themselves are kept in
`/var/lib/prometheus2` and should be backed up along with our regular
[backup procedures][].
WAL (write-ahead log) files are ignored by the backups, which can lead
to an extra 2-3 hours of data loss since the last backup in the case
of a total failure, see [tpo/tpa/team#41627][] for the
discussion. This should eventually be mitigated by a high availability
setup ([tpo/tpa/team#41643][]).
[backup procedures]: service/backup
[tpo/tpa/team#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627
[tpo/tpa/team#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643
## Other documentation
* [Prometheus home page][]
* [Prometheus documentation][]
* [Prometheus developer blog][]
[Prometheus home page]: https://prometheus.io/
[Prometheus documentation]: https://prometheus.io/docs/introduction/overview/
[Prometheus developer blog]: https://www.robustperception.io/tag/prometheus/
# Discussion
## Overview
The Prometheus and [Grafana][] services were setup after anarcat
realized that there was no "trending" service setup inside TPA after
Munin had died ([ticket 29681][]). The "node exporter" was deployed on
all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining
traces of Munin were removed in early April 2019 ([ticket 29682][]).
[ticket 29683]: https://bugs.torproject.org/29683
[ticket 29682]: https://bugs.torproject.org/29682
Resource requirements were researched in [ticket 29388][] and it was
originally planned to retain 15 days of metrics. This was expanded to
one year in November 2019 ([ticket 31244][]) with the hope this could
eventually be expanded further with a downsampling server in the
future.
[ticket 31244]: https://bugs.torproject.org/31244
Eventually, a second Prometheus/Grafana server was setup to monitor
external resources ([ticket 31159][]) because there were concerns
about mixing internal and external monitoring on TPA's side. There
were also concerns on the metrics team about exposing those metrics
publicly.
[ticket 31159]: https://bugs.torproject.org/31159
It was originally thought Prometheus could completely replace
[Nagios][] as well [ticket 29864][], but this turned out to be more
difficult than planned. The main difficulty is that Nagios checks come
with builtin threshold of acceptable performance. But Prometheus
metrics are just that: metrics, without thresholds... This makes it
more difficult to replace Nagios because a ton of alerts need to be
rewritten to replace the existing ones. A lot of reports and
functionality built-in to Nagios, like availability reports,
acknowledgements and other reports, would need to be reimplemented as
well.
## Goals
This section didn't exist when the project was launched, so this is
really just second-guessing...
### Must have
* Munin replacement: long-term trending metrics to predict resource
allocation, with graphing
* Free software, self-hosted
* Puppet automation
### Nice to have
* Possibility of eventual Nagios phase-out ([ticket 29864][])
[ticket 29864]: https://bugs.torproject.org/29864
### Non-Goals
* Data retention beyond one year
## Approvals required
Primary Prometheus server was decided [in the Brussels 2019
devmeeting][], before anarcat joined the team ([ticket
29389][]). Secondary Prometheus server was approved in
[meeting/2019-04-08][]. Storage expansion was approved in
[meeting/2019-11-25][].
[in the Brussels 2019 devmeeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
[ticket 29389]: https://bugs.torproject.org/29389
[meeting/2019-04-08]: meeting/2019-04-08
[meeting/2019-11-25]: meeting/2019-11-25
## Proposed Solution
Prometheus was chosen, see also [Grafana][].
## Cost
N/A.
## Alternatives considered
We considered retaining Nagios/Icinga as an alerting system, separate
from Prometheus, but ultimately decided against it in [TPA-RFC-33][].
### Alerting rules in Puppet
Alerting rules are currently stored in an external
[`prometheus-alerts.git` repository][] that holds not only TPA's
alerts, but also those of other teams. So the rules
are _not_ directly managed by puppet -- although puppet will ensure
that the repository is checked out with the most recent commit on the
Prometheus servers.
The rationale is that rule definitions should appear only once and we
already had the above-mentioned repository that could be used to
configure alerting rules.
We were concerned we would potentially have multiple sources of truth
for alerting rules. We already have that for scrape targets, but that
doesn't seem to be an issue. It did feel, however, critical for the
more important alerting rules to have a single source of truth.
### Migrating from Munin
Here's a quick cheat sheet from people used to Munin and switching to
Prometheus:
| What | Munin | Prometheus |
|-------------------|-----------------|----------------------------------------|
| Scraper | `munin-update` | Prometheus |
| Agent | `munin-node` | Prometheus, `node-exporter` and others |
| Graphing | `munin-graph` | Prometheus or Grafana |
| Alerting | `munin-limits` | Prometheus, Alertmanager |
| Network port | 4949 | 9100 and others |
| Protocol | TCP, text-based | HTTP, [text-based][] |
| Storage format | RRD | Custom time series database |
| Down-sampling | Yes | No |
| Default interval | 5 minutes | 15 seconds |
| Authentication | No | No |
| Federation | No | Yes (can fetch from other servers) |
| High availability | No | Yes (alert-manager gossip protocol) |
[text-based]: https://prometheus.io/docs/instrumenting/exposition_formats/
Basically, Prometheus is similar to Munin in many ways:
* It "pulls" metrics from the nodes, although it does it over HTTP
(to <http://host:9100/metrics>) instead of a custom TCP protocol
like Munin
* The agent running on the nodes is called `prometheus-node-exporter`
instead of `munin-node`. it scrapes only a set of built-in
parameters like CPU, disk space and so on, different exporters are
necessary for different applications (like
`prometheus-apache-exporter`) and any application can easily
implement an exporter by exposing a Prometheus-compatible
`/metrics` endpoint
* Like Munin, the node exporter doesn't have any form of
authentication built-in. we rely on IP-level firewalls to avoid
leakage
* The central server is simply called `prometheus` and runs as a
daemon that wakes up on its own, instead of `munin-update` which is
called from `munin-cron` and before that `cron`
* graphics are generated on the fly through the crude Prometheus web
interface or by frontends like Grafana, instead of being constantly
regenerated by `munin-graph`
* samples are stored in a custom "time series database" (TSDB) in
Prometheus instead of the (ad-hoc) RRD standard
* Prometheus performs *no* down-sampling like RRD and Prom relies on
smart compression to spare disk space, but it uses more than Munin
* Prometheus scrapes samples much more aggressively than Munin by
default, but that interval is configurable
* Prometheus can scale horizontally (by sharding different services
to different servers) and vertically (by aggregating different
servers to a central one with a different sampling frequency)
natively - `munin-update` and `munin-graph` can only run on a
single (and same) server
* Prometheus can act as a high availability alerting system thanks
to its `alertmanager` that can run multiple copies in parallel
without sending duplicate alerts - `munin-limits` can only run on a
single server