Changes

anarcat · 998dd7fd
--- a/service/prometheus.md
+++ b/service/prometheus.md
+[Prometheus][] is a monitoring system that is designed to process a
+large number of metrics, centralize them on one (or multiple) servers
+and serve them with a well-defined API. That API is queried through a
+domain-specific language (DSL) called "PromQL" or "Prometheus Query
+Language". Prometheus also supports basic graphing capabilities
+although those are limited enough that we use a separate graphing
+layer on top (see [Grafana][]).
+
+[Prometheus]: https://prometheus.io/
+[Grafana]: howto/grafana
+
+[[_TOC_]]
+
+# Tutorial
+
+## Web dashboards
+
+The main Prometheus web interface is available at:
+
+<https://prometheus.torproject.org>
+
+A simple query you can try is to pick any metric in the list and click
+`Execute`. For example, [this link][] will show the 5-minute load
+over the last two weeks for the known servers.
+
+[this link]: https://prometheus1.torproject.org/graph?g0.range_input=2w&g0.expr=node_load5&g0.tab=0
+
+The Prometheus web interface is crude: it's better to use [Grafana][]
+dashboards for most purposes other than debugging.
+
+It also shows alerts, but for that, there are better dashboards, see
+below.
+
+### Alerting dashboards
+
+There are a couple o0f web interfaces to see alerts in our setup:
+
+* [Karma dashboard][] - our primary view on
+  currently firing alerts. The alerts are grouped by labels.
+  * This web interface only shows what's current, not some form of
+    alert history.
+  * Shows links to "run books" related to alerts
+* [Grafana availability dashboard][] - drills down into alerts and,
+  more importantly shows their past values.
+* [Prometheus' Alerts dashboard][] - show all alerting rules and which
+  file they are from
+  * Also contains links to graphs based on alerts' PromQL expressions
+
+Normally, all rules are defined in the [`prometheus-alerts.git`
+repository][]. Another view of this is the [rules configuration
+dump][] which also shows when the rule was last evaluated and how long
+it took.
+
+Each alert should have a URL to a "run book" in its annotations, typically a link
+to this very wiki, in the "Pager playbook" section, which shows how to handle
+any particular outage. If it's not present, it's a bug and can be filed as such.
+
+[Karma dashboard]: https://karma.torproject.org
+[Grafana availability dashboard]: https://grafana.torproject.org/d/adwbl8mxnaneoc/availability
+[Prometheus' Alerts dashboard]: https://prometheus.torproject.org/classic/alerts
+[`prometheus-alerts.git` repository]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
+[rules configuration dump]: https://prometheus.torproject.org/classic/rules
+
+## Adding metrics to applications
+
+If you want your service to be monitored by Prometheus, you need to
+[write][] or [reuse an existing exporter][]. [Writing an
+exporter][] is more involved, but still fairly easy and might be
+necessary if you are the maintainer of an application not already
+instrumented for Prometheus. 
+
+[Writing an exporter]: https://prometheus.io/docs/instrumenting/writing_exporters/
+
+The [actual documentation][Writing an exporter] is fairly good, but basically: a
+Prometheus exporter is a simple HTTP server which responds to a
+specific HTTP URL (`/metrics`, by convention, but it can be
+anything). It responds with a key/value list of entries, one on each
+line, in a simple text format more or less following the
+[OpenMetrics][] standard.
+
+Each "key" is a simple string with an arbitrary list of "[labels][]"
+enclosed in curly braces. The [value][] is a float or integer.
+
+For example, here's how the "node exporter" exports CPU usage:
+
+    # HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
+    # TYPE node_cpu_seconds_total counter
+    node_cpu_seconds_total{cpu="0",mode="idle"} 948736.11
+    node_cpu_seconds_total{cpu="0",mode="iowait"} 1659.94
+    node_cpu_seconds_total{cpu="0",mode="irq"} 0
+    node_cpu_seconds_total{cpu="0",mode="nice"} 516.23
+    node_cpu_seconds_total{cpu="0",mode="softirq"} 16491.47
+    node_cpu_seconds_total{cpu="0",mode="steal"} 0
+    node_cpu_seconds_total{cpu="0",mode="system"} 35893.84
+    node_cpu_seconds_total{cpu="0",mode="user"} 67711.74
+
+Note that the `HELP` and `TYPE` lines look like comments, but they are
+actually important, and misusing them will lead to the metric being
+ignored by Prometheus.
+
+Also note that Prometheus's [actual support for OpenMetrics][] varies
+across the ecosystem. It's better to rely on Prometheus' documentation
+than OpenMetrics when writing metrics for Prometheus.
+
+Obviously, you don't necessarily have to write all that logic
+yourself, however: there are [client libraries][] (see the [Golang
+guide][], [Python demo][] or [C documentation][] for examples) that
+do most of the job for you.
+
+In any case, you should be careful about the names and labels of the
+metrics. See the [metric and label naming best practices][].
+
+Once you have an exporter endpoint (say at
+`http://example.com:9090/metrics`), make sure it works:
+
+    curl http://example.com:9090/metrics
+
+This should return a number of metrics that change (or not) at each
+call. Note that there's a [registry of official Prometheus export port
+numbers][] that should be respected, but [it's full][] (oops).
+
+From there on, provide that endpoint to the sysadmins (or someone with
+access to the external monitoring server), which will follow the
+procedure below to add the metric to Prometheus.
+
+Once the exporter is hooked into Prometheus, you can browse the
+metrics directly at: <https://prometheus.torproject.org>. Graphs
+should be available at <https://grafana.torproject.org>, although
+those need to be created and committed into git by sysadmins to
+persist, see the [`grafana-dashboards.git` repository][] for more
+information.
+
+[write]: https://prometheus.io/docs/instrumenting/writing_exporters/
+[reuse an existing exporter]: https://prometheus.io/docs/instrumenting/exporters/
+[labels]: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#label
+[value]: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#values
+[actual support for OpenMetrics]: https://github.com/prometheus/prometheus/issues/14762
+[client libraries]: https://prometheus.io/docs/instrumenting/clientlibs/
+[Golang guide]: https://prometheus.io/docs/guides/go-application/
+[Python demo]: https://github.com/prometheus/client_python#three-step-demo
+[C documentation]: https://digitalocean.github.io/prometheus-client-c/
+[metric and label naming best practices]: https://prometheus.io/docs/practices/naming/
+[registry of official Prometheus export port numbers]: https://github.com/prometheus/prometheus/wiki/Default-port-allocations
+[it's full]: https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusExportersFixedPorts
+[`grafana-dashboards.git` repository]: https://gitlab.torproject.org/tpo/tpa/grafana-dashboards
+
+## Adding scrape targets
+
+"Scrape targets" are remote endpoints that Prometheus "scrapes" (or
+fetches content from) to get metrics.
+
+There are two ways of adding metrics, depending on whether or not you
+have access to the Puppet server.
+
+### Adding metrics through the git repository
+
+People outside of TPA without access to the Puppet server can
+contribute targets through a repo called
+[`prometheus-alerts.git`][]. To add a scrape target:
+
+ 1. Clone the repository, if not done already:
+
+        git clone https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/
+        cd prometheus-alerts
+
+ 2. Assuming you're adding a node exporter, to add the target:
+
+        cat > targets.d/node_myproject.yaml <<EOF
+        # scrape the external node exporters for project Foo
+        ---
+        - targets:
+          - targetone.example.com
+          - targettwo.example.com
+
+ 3. Add, commit, and push:
+
+        git checkout -b myproject
+        git add targets.d
+        git commit -m"add node exporter targets for my project"
+        git push origin -u myproject
+
+The last push command should show you the URL where you can submit
+your merge request.
+
+After being merged, the changes should propagate within [4 to 6
+hours][]. Prometheus automatically reloads those rules when they are
+deployed.
+
+See also the [`targets.d` documentation in the git repository][].
+
+[4 to 6 hours]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/puppet/#cron-and-scheduling
+[`targets.d` documentation in the git repository]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/targets.d
+[`prometheus-alerts.git`]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
+
+### Adding metrics through Puppet
+
+TPA-managed services should define their scrape jobs, and thus targets, via
+puppet profiles.
+
+To add a scrape job in a puppet profile, you can use the
+`prometheus::scrape_job` defined type, or one of the defined types which are
+convenience wrappers around that.
+
+Here is, for example, how the gitlab runners are scraped:
+
+```
+# tell Prometheus to scrape the exporter
+@@prometheus::scrape_job { "gitlab-runner_${facts['networking']['fqdn']}_9252":
+  job_name => 'gitlab_runner',
+  targets  => [ "${facts['networking']['fqdn']}:9252" ],
+  labels   => {
+    'alias' => $facts['networking']['fqdn'],
+    'team'  => 'TPA',
+  },
+}
+```
+
+The `job_name` (`gitlab_runner` above) needs to be added to the
+`profile::prometheus::server::internal::collect_scrape_jobs` list in
+`hiera/common/prometheus.yaml`, for example:
+
+```
+profile::prometheus::server::internal::collect_scrape_jobs:
+  # [...]
+  - job_name: 'gitlab_runner'
+  # [...]
+```
+
+Note that you will likely need a firewall rule to poke a hole for the
+exporter:
+
+    # grant Prometheus access to the exporter, activated with the
+    # listen_address parameter above
+    Ferm::Rule <<| tag == 'profile::prometheus::server-gitlab-runner-exporter' |>>
+
+That rule, in turn, is defined with the
+`profile::prometheus::server::rule` define, in
+`profile::prometheus::server::internal`, like so:
+
+    profile::prometheus::server::rule {
+      # [...]
+      'gitlab-runner': port => 9252;
+      # [...]
+    }
+
+In another example, to configure the ssh scrape jobs (in
+`modules/profile/manifests/ssh.pp`), the scrape job is created with:
+
+    @@prometheus::scrape_job { "blackbox_ssh_banner_${facts['networking']['fqdn']}":
+      job_name => 'blackbox_ssh_banner',
+      targets  => [ "${facts['networking']['fqdn']}:22" ],
+      labels   => {
+        'alias' => $facts['networking']['fqdn'],
+        'team'  => 'TPA',
+      },
+    }
+
+But because this is a blackbox exporter, the `scrape_configs`
+configuration is more involved, as it needs to define the
+`relabel_configs` element that make the blackbox exporter work:
+
+    - job_name: 'blackbox_ssh_banner'
+      metrics_path: '/probe'
+      params:
+        module:
+          - 'ssh_banner'
+      relabel_configs:
+        - source_labels:
+            - '__address__'
+          target_label: '__param_target'
+        - source_labels:
+            - '__param_target'
+          target_label: 'instance'
+        - target_label: '__address__'
+          replacement: 'localhost:9115'
+
+Scrape jobs for non-TPA services are defined in hiera under keys named
+`scrape_configs` in `hiera/common/prometheus.yaml`. Here's one example of such a
+scrape job definition:
+
+    profile::prometheus::server::external::scrape_configs:
+    # generic blackbox exporters from any team
+    - job_name: blackbox
+      metrics_path: "/probe"
+      params:
+        module:
+        - http_2xx
+      file_sd_configs:
+      - files:
+        - "/etc/prometheus-alerts/targets.d/blackbox_*.yaml"
+      relabel_configs:
+      - source_labels: [__address__]
+        target_label: __param_target
+      - source_labels: [__param_target]
+        target_label: instance
+      - target_label: __address__
+        replacement: localhost:9115
+
+Some scrape jobs can be simpler and not require the relabeling part. In the
+above case, the relabeling is done since the exporter runs on the Prometheus
+server itself instead of the actual target.
+
+Targets for scrape jobs defined in Hiera are however not managed by
+puppet. They are defined through files in the [`prometheus-alerts.git`
+repository][]. See the section below for more details on how things
+are maintained there. In the above example, we can see that targets
+are obtained via files on disk. The [`prometheus-alerts.git`
+repository][] is cloned in `/etc/prometheus-alerts` on the Prometheus
+servers.
+
+[prometheus-alerts]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
+
+Note: we currently have a handful of `blackbox-exporter`-related targets for TPA
+services, namely for the HTTP checks. We intend to move those into puppet
+profiles whenever possible.
+
+#### Manually adding targets in Puppet
+
+Normally, services configured in Puppet SHOULD automatically be
+scraped by Prometheus (see above). If, however, you need to manually
+configure a service, you *may* define extra jobs in the
+`$scrape_configs` array, in the
+`profile::prometheus::server::internal` Puppet class.
+
+For example, because the GitLab setup is fully managed by Puppet
+(e.g. [tpo/tpa/gitlab#20][], but other similar issues remain), we
+cannot use this automatic setup, so manual scrape targets are defined
+like this:
+
+      $scrape_configs =
+      [
+        {
+          'job_name'       => 'gitaly',
+          'static_configs' => [
+            {
+              'targets' => [
+                'gitlab-02.torproject.org:9236',
+              ],
+              'labels'  => {
+                'alias' => 'Gitaly-Exporter',
+              },
+            },
+          ],
+        },
+        [...]
+      ]
+
+But ideally those would be configured with automatic targets, below.
+
+Metrics for the internal server are scraped automatically if the
+exporter is configured by the [`puppet-prometheus`][] module. This is
+done almost automatically, apart from the need to open a firewall port
+in our configuration. 
+
+To take the `apache_exporter`, as an example, in
+`profile::prometheus::apache_exporter`, include the
+`prometheus::apache_exporter` class from the upstream Puppet module,
+then we open the port to the Prometheus server on the exporter, with:
+
+    Ferm::Rule <<| tag == 'profile::prometheus::server-apache-exporter' |>>
+
+Those rules are declared on the server, in `prometheus::prometheus::server::internal`.
+
+[tpo/tpa/gitlab#20]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
+
+# How-to
+
+## Queries cheat sheet
+
+Some handy queries I often find myself looking for and forgetting.
+
+### Availability
+
+Those are almost all visible from the [availability dashboard][].
+
+[Currently firing alerts][]:
+
+    ALERTS{alertstate="firing"}
+
+[Unreachable hosts][] (technically, unavailable node exporters):
+
+    up{job="node"} != 1
+
+[How much time was the given service (`node` job, in this case) `up` in the past period (`30d`)][]:
+
+    avg(avg_over_time(up{job="node"}[30d]))
+
+[How many hosts are online at any given point in time][]:
+
+    sum(count(up==1))/sum(count(up)) by (alias)
+
+[How long did an alert fire over a given period of time][], in seconds per
+day:
+
+    sum_over_time(ALERTS{alertname="MemFullSoon"}[1d:1s])
+
+[availability dashboard]: https://grafana.torproject.org/d/adwbl8mxnaneoc/availability?var-alertstate=All
+[Currently firing alerts]: https://prometheus.torproject.org/graph?g0.expr=ALERTS{alertstate%3D"firing"}
+[Unreachable hosts]: https://prometheus.torproject.org/graph?g0.expr=up{job%3D"node"}+!%3D+1
+[How much time was the given service (`node` job, in this case) `up` in the past period (`30d`)]: https://prometheus.torproject.org/graph?g0.expr=avg(avg_over_time(up{job%3D"node"}[30d]))
+[How many hosts are online at any given point in time]: https://prometheus.torproject.org/graph?g0.expr=sum(count(up%3D=1))/sum(count(up))+by+(alias)
+[How long did an alert fire over a given period of time]: https://prometheus.torproject.org/graph?g0.expr=sum_over_time(ALERTS{alertname%3D"MemFullSoon"}[1d:1s])
+
+### Disk usage
+
+This is a less strict version of the [`DiskWillFillSoon` alert][],
+see also the [disk usage dashboard][].
+
+[Find disks that will be full in 6 hours][]:
+
+    predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0
+
+[Find disks that will be full in 6 hours]: https://prometheus.torproject.org/graph?g0.expr=predict_linear(node_filesystem_avail_bytes[6h],+24*60*60)+<+0
+[`DiskWillFillSoon` alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/6a27846edfba9b0fcb8fa3230f0f929ceeeb0fc2/rules.d/tpa_node.rules#L15-23
+[disk usage dashboard]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage
+
+### Inventory
+
+Those are visible in the [main Grafana dashboard][].
+
+[Number of machines][]:
+
+    count(up{job="node"})
+
+[Number of machine per OS version][]:
+
+    count(node_os_info) by (version_id, version_codename)
+
+[Number of machines per exporters, or technically, number of machines per job][]:
+
+    sort_desc(sum(up{job=~\"$job\"}) by (job)
+
+[Number of CPU cores, memory size, filesystem and LVM sizes][]:
+
+    count(node_cpu_seconds_total{classes=~\"$class\",mode=\"system\"})
+    sum(node_memory_MemTotal_bytes{classes=~\"$class\"}) by (alias)
+    sum(node_filesystem_size_bytes{classes=~\"$class\"}) by (alias)
+    sum(node_volume_group_size{classes=~\"$class\"}) by (alias)
+
+See also the [CPU][], [memory][], and [disk][] dashboards.
+
+[Uptime, in days][]:
+
+    round((time() - node_boot_time_seconds) / (24*60*60))
+
+[Number of machines]: https://prometheus.torproject.org/graph?g0.expr=count(up{job%3D"node"})
+[Number of machine per OS version]: https://prometheus.torproject.org/graph?g0.expr=count(node_os_info)+by+(version_id,+version_codename)
+[Number of machines per exporters, or technically, number of machines per job]: https://prometheus.torproject.org/graph?g0.expr=sort_desc(sum(up{job%3D~\"$job\"})+by+(job)
+[Number of CPU cores, memory size, filesystem and LVM sizes]: https://prometheus.torproject.org/graph?g0.expr=count(node_cpu_seconds_total{classes%3D~\"$class\",mode%3D\"system\"})
+[Uptime, in days]: https://prometheus.torproject.org/graph?g0.expr=round((time()+-+node_boot_time_seconds)+/+(24*60*60))
+[main Grafana dashboard]: https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview
+[CPU]: https://grafana.torproject.org/d/gex9eLcWz/cpu-usage
+[memory]: https://grafana.torproject.org/d/amgrk2Qnk/memory-usage
+[disk]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?from=now-24h&to=now&var-class=All&var-node=All
+
+### Running commands on hosts matching a PromQL query
+
+Say you have an alert or situation (e.g. high load) affecting multiple
+servers. Say, for example, that you have some issue that you fixed in
+Puppet that will clear such an alert, and want to run Puppet on all
+affected servers.
+
+You can use the [Prometheus JSON API][] to return the host list of the
+hosts matching the query (in this case `up < 1`) and run commands (in
+this case, Puppet, or `patc`) with [Cumin][]:
+
+    cumin "$(curl -sSL --data-urlencode='up < 1' 'https://$HTTP_USER@prometheus.torproject.org/api/v1/query | jq -r .data.result[].metric.alias | grep -v '^null$' | paste -sd,)" 'patc'
+
+Make sure to populate the `HTTP_USER` environment to authenticate with
+the Prometheus server.
+
+[Prometheus JSON API]: https://prometheus.io/docs/prometheus/latest/querying/api/
+[Cumin]: howto/cumin
+
+## Alerting
+
+We are now using Prometheus for alerting for TPA services. Here's a basic
+overview of how things interact around alerting:
+
+1. Prometheus is configured to create alerts on certain conditions on metrics.
+   * When the PromQL expression produces a result, an alert is created in state
+     `pending`.
+   * If the PromQL keeps on producing a result for the whole `for` duration
+     configured in the alert, then the alert changes to state `firing` and
+     Prometheus then sends the alert to one or more Alertmanager instance.
+2. Alertmanager receives alerts from Prometheus and is responsible for routing
+   the alert to the appropriate channels. For example:
+   * A team's or service operator's email address
+   * TPA's IRC channel for alerts, `#tor-alerts`
+3. Karma and Grafana read alert data from Alertmanager and displays them in a
+   way that can be used by humans.
+
+Currently, the secondary Prometheus server (`prometheus2`) reproduces this setup
+specifically for sending out alerts to other teams with metrics that are not
+made public.
+
+This section details how the alerting setup mentioned above works.
+
+Note that the [Icinga][] service is still in service, but it
+is planned to eventually be shut down and replaced by the Prometheus +
+Alertmanager setup ([ticket 29864][]).
+
+In general, the upstream documentation for alerting starts from [the
+Alerting Overview][] but it can be lacking at times. [This tutorial][]
+can be quite helpful in better understanding how things are working.
+
+Note that Grafana also has its own [alerting system][] but we are
+_not_ using that, see the [Grafana for alerting section of the
+TPA-RFC-33 proposal][].
+
+[Icinga]: howto/nagios
+[the Alerting Overview]: https://prometheus.io/docs/alerting/latest/overview/
+[This tutorial]: https://ashish.one/blogs/setup-alertmanager/
+[alerting system]: https://grafana.torproject.org/alerting/
+[Grafana for alerting section of the TPA-RFC-33 proposal]: policy/tpa-rfc-33-monitoring#grafana-for-alerting
+
+### Writing alerting rules
+
+TODO
+
+### Writing a playbook
+
+Every alert in Prometheus *must* have a playbook annotation. This is
+(if done well), a URL pointing at a service page like this one,
+typically in the `Pager playbook` section, that explains how to deal
+with the alert.
+
+The playbook *must* include those things:
+
+ 1. the actual code name of the alert (e.g. `JobDown` or
+    `DiskWillFillSoon`)
+
+ 2. an example of the alert output (e.g. `Exporter job gitlab_runner
+    on tb-build-02.torproject.org:9252 is down`)
+
+ 3. why this alert triggered, what is its impact
+
+ 4. optionally, how to reproduce the issue
+
+ 5. how to fix it
+
+How to reproduce the issue is optional, but important. Think of
+yourself in the future, tired and panicking because things are
+broken:
+
+ - Where do you think the error will be visible?
+ - Can we `curl` something to see it happening?
+ - Is there a dashboard where you can see trends?
+ - Is there a specific Prometheus query to run live?
+ - Which log file can we inspect?
+ - Which systemd service is running it?
+
+The "how to fix it" can be a simple one line, or it can go into a
+multiple case example of scenarios that were found in the wild. It's
+the hard part: sometimes, when you make an alert, you don't actually
+*know* how to handle the situation. If so, explicitly state that
+problem in the playbook, and say you're sorry, and that it should be
+fixed.
+
+If the playbook becomes too complicated, consider making a [Fabric][]
+script out of it.
+
+A good example of a proper playbook is the [Textfile collector errors
+playbook here][]. It has all of the above points, including actual
+fixes for different actual scenarios.
+
+Here's a template to get started:
+
+```
+### Foo errors
+
+The `FooAlert` looks like this:
+
+    Service Foo has too many errors on test.torproject.org
+
+It means that the service Foo is having some kind of trouble. [Explain
+why this happened, and what the impact is, what means for which
+users. Are we losing money, data, exposing users, etc.]
+
+[Optional] You can tell this is a real issue by going to place X and
+trying Y.
+
+[Ideal] To fix this issue, [inverse the polarity of the shift inverter
+in service Foo].
+
+[Optional] We do not yet exactly know how to fix issue, sorry. Please
+document here how you fix this next time.
+```
+
+[Fabric]: howto/fabric
+[Textfile collector errors playbook here]: #textfile-collector-errors
+
+### Adding alerting rules
+
+Adding alerts is mainly an alerting rule definition that matches on a
+PromQL expression, defined in a Git repository.
+
+But it already assumes some metrics are available and scraped by
+Prometheus. For this, ensure you have followed the tutorials [Adding
+metrics to applications][] and [Adding scrape targets][].
+
+[Adding scrape targets]: #adding-scrape-targets
+
+The Prometheus servers regularly pull the [`prometheus-alerts.git`
+repository][] for alerting rule and target definitions. Alert rules
+can be added through the repository by adding a file in the `rules.d`
+directory, see [`rules.d`][] directory for more documentation on that.
+
+[`rules.d`]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d
+
+After being merged, the changes should propagate within [4 to 6
+hours][]. Prometheus does _not_ automatically reload those rules by
+itself, but Puppet should handle reloading the service as a
+consequence of the file changes. TPA members can accelerate this by
+running Puppet on the Prometheus servers, or pulling the code and
+reloading the Prometheus server with:
+
+    git -C /etc/prometheus-alerts/ pull
+    systemctl reload prometheus
+
+### Diagnosing alerting failures
+
+Normally, alerts should fire on the Prometheus server and be sent out
+to the Alertmanager server, and be visible in Karma. See also the
+[alert routing details reference][].
+
+If you're not sure alerts are working, head to the Prometheus
+dashboard and look at the `/alerts`, and `/rules` pages. For example:
+
+ * <https://prometheus.torproject.org/alerts> - should show the configure alerts,
+   and if they are firing
+ * <https://prometheus.torproject.org/rules> - should show the configured rules,
+   and whether they match
+
+Typically, the Alertmanager address (currently
+<http://localhost:9093>, but to be [exposed][]) should also be useful
+to manage the Alertmanager, but in practice the Debian package does
+not ship the web interface, so its interest is limited in that
+regard. See the `amtool` section below for more information.
+
+Note that the [`/targets`][] URL is also useful to diagnose problems
+with exporters, in general, see also the [troubleshooting section][]
+below.
+
+If you can't access the dashboard at all or if the above seems too
+complicated, [Grafana][] can be used as a debugging tool for metrics
+as well. In the [Explore](https://grafana.torproject.org/explore) section, you can input Prometheus
+metrics, with auto-completion, and inspect the output directly.
+
+There's also the [Grafana availability dashboard][], see the [Alerting
+dashboards][] section for details.
+
+[Installation]: #installation
+[the access instructions]: #web-dashboard-access
+[troubleshooting section]: #troubleshooting-missing-metrics
+[alert routing details reference]: #alert-routing-details
+[exposed]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41733
+[Alerting dashboards]: #alerting-dashboards
+
+### Managing alerts with amtool
+
+Since the Alertmanager web UI is not available in Debian, you need to
+use the [amtool][] command. A few useful commands:
+
+ * `amtool alert`: show firing alerts
+ * `amtool silence add --duration=1h --author=anarcat
+   --comment="working on it" ALERTNAME`: silence alert ALERTNAME for
+   an hour, with some comments
+
+[amtool]: https://manpages.debian.org/amtool.1
+
+### Checking alert history
+
+Note that all alerts sent through the Alertmanager are dumped in
+system logs, through a first "fall through" web hook route:
+
+```
+  routes:
+    # dump *all* alerts to the debug logger
+    - receiver: 'tpa_http_post_dump'
+      continue: true
+```
+
+The receiver is configured below:
+
+```
+  - name: 'tpa_http_post_dump'
+    webhook_configs:
+      - url: 'http://localhost:8098/'
+```
+
+This URL, in turn, runs a simple Python script that just dumps to
+standard output all POST requests it receives, which provides us with,
+basically, a JSON log of all notifications sent through the
+Alertmanager. All logged entries since last boot can be seen with:
+
+    journalctl -u tpa_http_post_dump.service -b
+
+You can see a prettier version of recent entries with the `jq`
+command, for example:
+
+    journalctl -u tpa_http_post_dump.service -o cat -e  | grep '^{' | jq -s .[].alerts
+
+Note that the `grep` is required because `journalctl` insists on
+bundling supervisor messages in its output, so we filter for JSON
+objects, basically.
+
+### Testing alerts
+
+Prometheus can run unit tests for your defined alerts. See [upstream unit test
+documentation][].
+
+We managed to build a minimal unit test for an alert. Note that for a unit test
+to succeed, the test must match _all_ the tags and annotations for alerts
+that are expected, including ones that are added by `rewrite` in Prometheus:
+
+```yaml
+root@hetzner-nbg1-02:~/tests# cat tpa_system.yml
+rule_files:
+  - /etc/prometheus-alerts/rules.d/tpa_system.rules
+
+evaluation_interval: 1m
+
+tests:
+  # NOTE: interval is *necessary* here. contrary to what the documentation
+  #  shows, leaving it out will not default to the evaluation_interval set
+  #  above
+  - interval: 1m
+    # Set of fixtures for the tests below
+    input_series:
+      - series: 'node_reboot_required{alias="NetworkHealthNodeRelay",instance="akka.0x90.dk:9100",job="relay",team="network"}'
+        # that's "one" for 60 samples, or 60 minutes
+        values: '1x60'
+
+    alert_rule_test:
+        # NOTE: eval_time is the offset from 0s at which the alert should be
+        #  evaluated. if it is shorter than the alert's `for` setting, you will
+        #  have some missing values for a while (which might be something you
+        #  need to test?). You can play with the eval_time in other test
+        #  entries to evaluate the same alert at different offsets in the
+        #  timeseries above.
+        - eval_time: 60m
+          alertname: NeedsReboot
+          exp_alerts:
+              # Alert 1.
+              - exp_labels:
+                    severity: warning
+                    instance: akka.0x90.dk:9100
+                    job: relay
+                    team: network
+                    alias: "NetworkHealthNodeRelay"
+                exp_annotations:
+                    description: "Found pending kernel upgrades for host NetworkHealthNodeRelay"
+                    playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/upgrades#reboots"
+                    summary: "Host NetworkHealthNodeRelay needs to reboot"
+```
+
+The success result:
+
+```
+root@hetzner-nbg1-01:~/tests# promtool test rules tpa_system.yml
+Unit Testing:  tpa_system.yml
+  SUCCESS
+```
+
+A failing test will show you what alerts were obtained and how they compare to
+what your failing test was expecting:
+
+```
+root@hetzner-nbg1-02:~/tests# promtool test rules tpa_system.yml
+Unit Testing:  tpa_system.yml
+  FAILED:
+    alertname: NeedsReboot, time: 10m,
+        exp:[
+            0:
+              Labels:{alertname="NeedsReboot", instance="akka.0x90.dk:9100", job="relay", severity="warning", team="network"}
+              Annotations:{}
+            ],
+        got:[]
+```
+
+The above allows us to confirm that, under a specific set of circumstances (the
+defined series), a specific query will generate a specific alert with a given
+set of labels and annotations.
+
+Those labels can then be fed into `amtool` to test routing. For
+example, the above alert can be tested against the alertmanager
+configuration with:
+
+    amtool config routes test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"
+
+Or really, what matters in most cases are `severity` and `team`, so
+this also works, and gives out the proper route:
+
+    amtool config routes test severity="warning" team="network" ; echo $?
+
+Example:
+
+    root@hetzner-nbg1-02:~/tests# amtool config test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"
+    network team
+
+Ignore the warning, it's the difference between testing the live
+server and the local configuration. Naturally, you can test what
+happens if the `team` label is missing or incorrect, to confirm
+[default route errors][]:
+
+    root@hetzner-nbg1-02:~/tests# amtool config routes test severity="warning" team="networking"
+    fallback
+
+The above, for example, confirms that `networking` is not the correct
+team name (it should be `network`).
+
+Note that you can also deliver an alert to a webhook receiver
+syntetically. For example, this will deliver and empty message to the
+IRC relay:
+
+    curl --header "Content-Type: application/json" --request POST --data "{}" http://localhost:8098
+
+[upstream unit test documentation]: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/
+
+[default route errors]: #default-route-errors
+
+## Advanced metrics ingestion
+
+This section documents more advanced metrics injection topics that we
+rarely need or use.
+
+### Backfilling
+
+Starting from Prometheus 2.24, Prometheus [now
+supports][] [backfilling][]. This is untested, but [this guide][]
+might provide a good tutorial.
+
+[now supports]: https://github.com/prometheus/prometheus/issues/535
+[backfilling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
+[this guide]: https://tlvince.com/prometheus-backfilling
+
+### Push metrics to the Pushgateway
+
+The [Pushgateway][] is setup on the secondary Prometheus server
+(`prometheus2`). Note that you might not need to use the Pushgateway,
+see the [article about pushing metrics][] before going down this
+route.
+
+The Pushgateway is fairly particular: it listens on port 9091 and gets
+data through a fairly simple [curl-friendly commandline][] [API][]. We
+have found that, once installed, this command just "does the right
+thing", more or less:
+
+    echo 'some_metrics{foo="bar"} 3.14 | curl --data-binary @- http://localhost:9091/metrics/job/jobtest/instance/instancetest
+
+To confirm the data was injected by the Push gateway, this can be
+done:
+
+    curl localhost:9091/metrics | head
+
+The Pushgateway is scraped, like other Prometheus jobs, every minute,
+with metrics kept for a year, at the time of writing. This is
+configured, inside Puppet, in `profile::prometheus::server::external`.
+
+Note that it's [not possible to push timestamps][] into the
+Pushgateway, so it's not useful to ingest past historical data.
+
+[article about pushing metrics]: https://prometheus.io/docs/practices/pushing/
+[curl-friendly commandline]: https://github.com/prometheus/pushgateway#command-line
+[API]: https://github.com/prometheus/pushgateway#api
+[not possible to push timestamps]: https://github.com/prometheus/pushgateway#about-timestamps
+
+### Deleting metrics
+
+Deleting metrics can be done through the Admin API. That first needs
+to be enabled in `/etc/default/prometheus`, by adding
+`--web.enable-admin-api` to the `ARGS` list, then Prometheus needs to
+be restarted:
+
+    service prometheus restart
+
+WARNING: make sure there is authentication in front of Prometheus
+because this could expose the server to more destruction.
+
+Then you need to issue a special query through the API. This, for
+example, will wipe all metrics associated with the given instance:
+
+    curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}'
+
+The same, but only for about an hour, good for testing that only the
+wanted metrics are destroyed:
+
+    curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}&start=2021-10-25T19:00:00Z&end=2021-10-25T20:00:00Z'
+
+To match only a job on a specific instance:
+
+    curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}&match[]={job="gitlab"}'
+
+Deleted metrics are not necessarily immediately removed from disk but
+are "eligible for compaction". Changes *should* show up immediately
+however. The "Clean Tombstones" should be used to remove samples from
+disk, if that's absolutely necessary:
+
+    curl -XPOST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
+
+Make sure to disable the Admin API when done.
+
+## Pager playbook
+
+This section documents alerts and issues with the Prometheus service
+itself. Do *NOT* document *all* alerts possibly generated from the
+Prometheus here! Document those in the individual services pages, and
+link to that in the alert's `playbook` annotation.
+
+What belong here are only alerts that truly don't have any other place
+to go, or that are completely generic to any service (e.g. `JobDown`
+is in its place here). Generic operating system issues like "disk
+full" or else *must* be documented elsewhere.
+
+### Troubleshooting missing metrics
+
+If metrics do not correctly show up in Grafana, it might be worth
+checking in the [Prometheus dashboard][] itself for the same
+metrics. Typically, if they do not show up in Grafana, they won't show
+up in Prometheus either, but it's worth a try, even if only to see the
+raw data.
+
+Then, if data truly isn't present in Prometheus, you can track down
+the "target" (the exporter) responsible for it in the [`/targets`][]
+listing. If the target is "unhealthy", it will be marked in red and an
+error message will show up.
+
+[`/targets`]: https://prometheus.torproject.org/targets
+
+If the target is marked healthy, the next step is to scrape the
+metrics manually. This, for example, will scrape the Apache exporter
+from the host `gayi`:
+
+    curl -s http://gayi.torproject.org:9117/metrics | grep apache
+
+In the case of [this bug][], the metrics were not showing up at all:
+
+    root@hetzner-nbg1-01:~# curl -s http://gayi.torproject.org:9117/metrics | grep apache
+    # HELP apache_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which apache_exporter was built.
+    # TYPE apache_exporter_build_info gauge
+    apache_exporter_build_info{branch="",goversion="go1.7.4",revision="",version=""} 1
+    # HELP apache_exporter_scrape_failures_total Number of errors while scraping apache.
+    # TYPE apache_exporter_scrape_failures_total counter
+    apache_exporter_scrape_failures_total 18371
+    # HELP apache_up Could the apache server be reached
+    # TYPE apache_up gauge
+    apache_up 0
+
+Notice, however, the `apache_exporter_scrape_failures_total`, which
+was incrementing. From there, we reproduced the work the exporter was
+doing manually and fixed the issue, which involved passing the correct
+argument to the exporter.
+
+[Prometheus dashboard]: https://prometheus.torproject.org/
+[this bug]: https://github.com/voxpupuli/puppet-prometheus/pull/541
+
+### Slow startup times
+
+If Prometheus takes a long time to start, and floods logs with lines
+like this every second:
+
+    Nov 01 19:43:03 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:03.788Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=30182 maxSegment=30196
+
+... it's somewhat normal. At the time of writing, Prometheus2 takes
+over a minute to start because of this problem. When it's done, it
+will show the timing information, which is currently:
+
+    Nov 01 19:43:04 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:04.533Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=314.859946ms wal_replay_duration=1m16.079474672s total_replay_duration=1m16.396139067s
+
+The solution for this is to use the [memory-snapshot-on-shutdown
+feature flag][], but that is available only from 2.30.0 onward (not
+in Debian bullseye), and there are critical bugs in the feature flag
+before 2.34 (see [PR 10348][]), so thread carefully.
+
+In other words, this is frustrating, but expected for older releases
+of Prometheus. Newer releases may have optimizations for this, but
+they need a restart to apply.
+
+[memory-snapshot-on-shutdown feature flag]: https://prometheus.io/docs/prometheus/latest/feature_flags/#memory-snapshot-on-shutdown
+[PR 10348]: https://github.com/prometheus/prometheus/pull/10348
+
+### Pushgateway errors
+
+The Pushgateway web interface provides some basic information about
+the metrics it collects, and allow you to view the pending metrics
+before they get scraped by Prometheus, which may be useful to
+troubleshoot issues with the gateway.
+
+To pull metrics by hand, you can pull directly from the pushgateway:
+
+    curl localhost:9091/metrics
+
+If you get this error while pulling metrics from the exporter:
+
+    An error has occurred while serving metrics:
+
+    collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values
+
+It's because similar metrics were sent twice into the gateway, which
+corrupts the state of the pushgateway, a [known problems][] in
+earlier versions and [fixed in 0.10][] (Debian bullseye and later). A
+workaround is simply to restart the Pushgateway (and clear the
+storage, if persistence is enabled, see the `--persistence.file`
+flag).
+
+[known problems]: https://github.com/prometheus/pushgateway/issues/232
+[fixed in 0.10]: https://github.com/prometheus/pushgateway/pull/290
+
+### Running out of disk space
+
+In [tpo/tpa/team#41070][], we encountered a situation where disk
+usage on the main Prometheus server was growing linearly even if the
+number of targets didn't change. This is a typical problem in time
+series like this where the "cardinality" of metrics grows without
+bound, consuming more and more disk space as time goes by.
+
+The first step is to confirm the diagnosis by looking at the [Grafana
+graph showing Prometheus disk usage][] over time. This should show a
+"sawtooth" pattern where compactions happen regularly (about once
+every three weeks), but without growing much over longer periods of
+time. In the above ticket, the usage was growing despite
+compactions. There are also shorter-term (~4h) and smaller compactions
+happening. This information is also available in the normal [disk
+usage graphic][].
+
+We then headed for the self-diagnostics Prometheus provides at:
+
+<https://prometheus.torproject.org/classic/status>
+
+The "Most Common Label Pairs" section will show us which `job` is
+responsible for the most number of metrics. It should be `job=node`,
+as that collects a lot of information for *all* the machines managed
+by TPA. About 100k pairs is expected there.
+
+It's also expected to see the "Highest Cardinality Labels" to be
+`__name__` at around 1600 entries.
+
+We haven't implemented it yet, but the [upstream Storage
+documentation][] has some interesting tips, including [advice on
+long-term storage][] which suggests tweaking the
+`storage.local.series-file-shrink-ratio`.
+
+[This guide from Alexandre Vazquez][] also had some useful queries and
+tips we didn't fully investigate.
+
+[tpo/tpa/team#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070
+[Grafana graph showing Prometheus disk usage]: https://grafana.torproject.org/d/000000012/prometheus-2-0-stats?orgId=1&refresh=1m&viewPanel=40&from=now-1y&to=now
+[disk usage graphic]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=hetzner-nbg1-01.torproject.org&from=now-3d&to=now&viewPanel=2
+[upstream Storage documentation]: https://prometheus.io/docs/prometheus/1.8/storage/
+[advice on long-term storage]: https://prometheus.io/docs/prometheus/1.8/storage/#settings-for-very-long-retention-time
+[This guide from Alexandre Vazquez]: https://alexandre-vazquez.com/how-it-optimize-the-disk-usage-in-the-prometheus-database/
+
+### Default route errors
+
+If you get an email like:
+
+```
+Subject: Configuration error - Default route: [FIRING:1] JobDown
+```
+
+It's because an alerting rule fired with an incorrect
+configuration. Instead of being routed to the proper team, it fell
+through the default route.
+
+This is not an emergency in the sense that it's a normal alert, but it
+just got routed improperly. It should be fixed, in time. If in a rush,
+open a ticket for the team likely responsible for the alerting
+rule.
+
+#### Finding the responsible party
+
+So the first step, even if just filing a ticket, is to find the
+responsible party.
+
+Let's take this email for example:
+
+```
+Date: Wed, 03 Jul 2024 13:34:47 +0000
+From: alertmanager@hetzner-nbg1-01.torproject.org
+To: root@localhost
+Subject: Configuration error - Default route: [FIRING:1] JobDown
+
+
+CONFIGURATION ERROR: The following notifications were sent via the default route node, meaning
+that they had no team label matching one of the per-team routes.
+
+This should not be happening and it should be fixed. See:
+https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/prometheus#reference
+
+Total firing alerts: 1
+
+
+
+## Firing Alerts
+
+-----
+Time: 2024-07-03 13:34:17.366 +0000 UTC
+Summary:  Job mtail@rdsys-test-01.torproject.org is down
+Description:  Job mtail on rdsys-test-01.torproject.org has been down for more than 5 minutes.
+
+-----
+```
+
+in the above, the `mtail` job on `rdsys-test-01` "has been down for
+more than 5 minutes" and has been routed to `root@localhost`. 
+
+The more likely target for that rule would probably be TPA, which
+manages the `mtail` service and jobs, even though the services on that
+host are managed by the anti-censorship team service admins. If the
+host was *not* managed by TPA or this was a notification about a
+*service* operated by the team, then a ticket should be filed there.
+
+In this case, [tpo/tpa/team#41667][] was filed.
+
+[tpo/tpa/team#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667
+
+#### Fixing routing
+
+To *fix* this issue, you must first reproduce the query that triggered
+the alert. This can be found in the [Prometheus alerts dashboard][],
+if the alert is still firing. In this case, we see this:
+
+| Labels                                                                                                                                                                          | State  | Active Since                           | Value |
+|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|----------------------------------------|-------|
+| `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0     |
+
+In this case, we can see there's no `team` label on that metric, which
+is the root cause.
+
+If we *can't* find the alert anymore (say it fixed itself), we can
+still try to look for the matching alerting rule. Grep for the
+`alertname` above in `prometheus-alerts.git`. In this case, we find:
+
+```
+anarcat@angela:prometheus-alerts$ git grep JobDown
+rules.d/tpa_system.rules:  - alert: JobDown
+```
+
+and the following rule:
+
+```
+  - alert: JobDown
+    expr: up < 1
+    for: 5m
+    labels:
+      severity: warning
+    annotations:
+      summary: 'Job {{ $labels.job }}@{{ $labels.alias }} is down'
+      description: 'Job {{ $labels.job }} on {{ $labels.alias }} has been down for more than 5 minutes.'
+      playbook: "TODO"
+```
+
+The query, in this case, is therefore `up < 1`. But since the alert
+has resolved, we can't actually do the exact same query and expect to
+find the same host, we need instead to broaden the query without the
+conditional (so just `up`) *and* add the right labels, in this case
+this should do the trick:
+
+    up{instance="rdsys-test-01.torproject.org:3903",job="mtail"}
+
+which, when we query Prometheus directly, gives us the following
+metric:
+
+    up{alias="rdsys-test-01.torproject.org",classes="role::rdsys::backend",instance="rdsys-test-01.torproject.org:3903",job="mtail"}
+    0
+
+There you can see *all* the labels associated with the metric. Those
+match the alerting rule labels, but that may not always be the case,
+so that step can be helpful to confirm root cause.
+
+So, in this case, the `mtail` job doesn't have the right team
+label. The fix was to add the team label to the scrape job:
+
+```
+commit 68e9b463e10481745e2fd854aa657f804ab3d365
+Author: Antoine Beaupré <anarcat@debian.org>
+Date:   Wed Jul 3 10:18:03 2024 -0400
+
+    properly pass team label to postfix mtail job
+    
+    Closes: tpo/tpa/team#41667
+
+diff --git a/modules/mtail/manifests/postfix.pp b/modules/mtail/manifests/postfix.pp
+index 542782a33..4c30bf563 100644
+--- a/modules/mtail/manifests/postfix.pp
+++ b/modules/mtail/manifests/postfix.pp
+@@ -8,6 +8,11 @@ class mtail::postfix (
+   class { 'mtail':
+     logs       => '/var/log/mail.log',
+     scrape_job => $scrape_job,
+    scrape_job_labels => {
+      'alias'   => $::fqdn,
+      'classes' => "role::${pick($::role, 'undefined')}",
+      'team'    => 'TPA',
+    },
+   }
+   mtail::program { 'postfix':
+     source => 'puppet:///modules/mtail/postfix.mtail',
+```
+
+See also [testing alerts][] to drill down into queries and alert
+routing, in case the above doesn't work.
+
+[Prometheus alerts dashboard]: https://prometheus.torproject.org/classic/alerts
+[testing alerts]: #testing-alerts
+
+### Exporter job down warnings
+
+If you see an error like:
+
+    Exporter job gitlab_runner on tb-build-02.torproject.org:9252 is down
+
+That is because Prometheus cannot reach the exporter at the given
+address. The right way forward is to looks at the [targets listing][]
+and see why Prometheus is failing to scrape the target.
+
+[targets listing]: https://prometheus.torproject.org/classic/targets
+
+#### Service down
+
+The simplest and most obvious case is that the service is just
+down. For example, Prometheus has this to say about the above
+`gitlab_runner` job:
+
+    Get "http://tb-build-02.torproject.org:9252/metrics": dial tcp [2620:7:6002:0:3eec:efff:fed5:6c40]:9252: connect: connection refused
+
+In this case, the `gitlab-runner` was just not running (yet). It was
+being configured and had been added to Puppet, but wasn't yet
+correctly setup.
+
+In another scenario, however, it might just be that the service is
+down. Use `curl` to confirm Prometheus' view, restricting to IPv4 and
+IPv6:
+
+    curl -4 http://tb-build-02.torproject.org:9252/metrics
+    curl -6 http://tb-build-02.torproject.org:9252/metrics
+
+Try this from the server itself as well.
+
+If you know which service it is (and the job name should be a good
+hint), check the service on the server, in this case:
+
+    systemctl status gitlab-runner
+
+#### Invalid exporter output
+
+In another case:
+
+    Exporter job civicrm@crm.torproject.org:443 is down
+
+Prometheus was failing with this error:
+
+    expected value after metric, got "INVALID"
+
+That means there's a syntax error in the metrics output, in this case
+no value was provided for a metric, like this:
+
+    # HELP civicrm_torcrm_resque_processor_status_up Resque processor status
+    # TYPE civicrm_torcrm_resque_processor_status_up gauge
+    civicrm_torcrm_resque_processor_status_up
+
+See [tpo/web/civicrm#149][] for further details on this
+outage.
+
+[tpo/web/civicrm#149]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149
+
+#### Forbidden errors
+
+Another example might be:
+
+    server returned HTTP status 403 Forbidden
+
+... in which case there's a permission issue on the exporter
+endpoint. Try to reproduce the issue by pulling the endpoint directly,
+on the Prometheus server, with, for example:
+
+    curl -sSL https://donate.torproject.org:443/metrics
+
+... or whatever URL is visible in the targets listing above. This
+could be a web server configuration or lack of matching credentials in
+the exporter configuration. Look in `tor-puppet.git`, the
+`profile::prometheus::server::internal::collect_scrape` in
+`hiera/common/prometheus.yaml`, where credentials should be defined
+(although they should actually be stored in Trocla).
+
+### Apache exporter scraping failed
+
+If you get the error `Apache Exporter cannot monitor web server on
+test.example.com` (`ApacheScrapingFailed`), Apache is up, but the
+[Apache exporter][] cannot pull its metrics from there.
+
+That means the exporter cannot pull the URL
+`http://localhost/server-status/?auto`.  To reproduce, pull the URL
+with curl from the affected server, for example:
+
+    root@test.example.com:~# curl http://localhost/server-status/?auto
+
+This is a typical configuration error in Apache where the
+`/server-status` host is not available to the exporter because the
+"default vhost" was disabled (`apache2::default_vhost` in
+Hiera).
+
+There is normally a workaround for this in the
+`profile::prometheus::apache_exporter` class, which configures a
+`localhost` vhost to answer properly on this address. Verify that it's
+present, consider using `apache2ctl -S` to see the vhost
+configuration.
+
+See also the [Apache web server diagnostics][] in the incident
+response docs for broader issues with web servers.
+
+[Apache exporter]: https://github.com/Lusitaniae/apache_exporter/
+[Apache web server diagnostics]: #apache-web-server-diagnostics
+
+### Textfile collector errors
+
+The `NodeTextfileCollectorErrors` looks like this:
+
+    Node exporter textfile collector errors on test.torproject.org
+
+It means that the [textfile collector][] is having trouble parsing one
+or many of the files in its `--collector.textfile.directory` (defaults
+to `/var/lib/prometheus/node-exporter`).
+
+[textfile collector]: https://github.com/prometheus/node_exporter#textfile-collector
+
+The error should be visible in the node exporter logs, run the
+following command to see it:
+
+    journalctl -u prometheus-node-exporter -e
+
+Here's a list of issues found in the wild, but your particular issue
+might be different.
+
+#### Wrong permissions
+
+```
+Sep 24 20:56:53 bungei prometheus-node-exporter[1387]: ts=2024-09-24T20:56:53.280Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa_backuppg.prom err="failed to open textfile data file \"/var/lib/prometheus/node-exporter/tpa_backuppg.prom\": open /var/lib/prometheus/node-exporter/tpa_backuppg.prom: permission denied"
+```
+
+In this case, the file was created as a tempfile and moved into place
+without fixing the permission. The fix was to simply create the file
+without the `tempfile` Python library, with a `.tmp` suffix, and just
+move it into place.
+
+#### Garbage in a text file
+
+```
+Sep 24 21:14:41 perdulce prometheus-node-exporter[429]: ts=2024-09-24T21:14:41.783Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=scheduled_shutdown_metric.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/scheduled_shutdown_metric.prom\": text format parsing error in line 3: expected '\"' at start of label value, found 'r'"
+```
+
+This was an experimental metric designed in [tpo/tpa/team#41734][] to
+keep track of scheduled reboot times, but it was formatted
+incorrectly. The entire file content was:
+
+```
+# HELP node_shutdown_scheduled_timestamp_seconds time of the next scheduled reboot, or zero
+# TYPE node_shutdown_scheduled_timestamp_seconds gauge
+node_shutdown_scheduled_timestamp_seconds{kind=reboot} 1725545703.588789
+```
+
+It was missing quotes around `reboot`, the proper output would have
+been:
+
+```
+# HELP node_shutdown_scheduled_timestamp_seconds time of the next scheduled reboot, or zero
+# TYPE node_shutdown_scheduled_timestamp_seconds gauge
+node_shutdown_scheduled_timestamp_seconds{kind="reboot"} 1725545703.588789
+```
+
+But the file was simply removed in this case.
+
+[tpo/tpa/team#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734
+
+## Disaster recovery
+
+If a Prometheus/Grafana is destroyed, it should be completely
+rebuildable from Puppet. Non-configuration data should be restored
+from backup, with `/var/lib/prometheus/` being sufficient to
+reconstruct history. If even backups are destroyed, history will be
+lost, but the server should still recover and start tracking new
+metrics.
+
+# Reference
+
+## Installation
+
+### Puppet implementation
+
+Every TPA server is configured as a `node-exporter` through the
+`roles::monitored` that is included everywhere. The role might
+eventually be expanded to cover alerting and other monitoring
+resources as well. This role, in turn, includes the
+`profile::prometheus::client` which configures each client correctly
+with the right firewall rules.
+
+The firewall rules are exported from the server, defined in
+`profile::prometheus::server`. We hacked around limitations of the
+upstream Puppet module to install Prometheus using backported Debian
+packages. The monitoring server itself is defined in
+`roles::monitoring`.
+
+The [Prometheus Puppet module][] was heavily patched to [allow scrape
+job collection][] and [use of Debian packages for
+installation][], among [many other patches sent by anarcat][].
+
+Much of the initial Prometheus configuration was also documented in
+[ticket 29681][] and especially [ticket 29388][] which investigates
+storage requirements and possible alternatives for data retention
+policies.
+
+[ticket 29388]: https://bugs.torproject.org/29388
+[ticket 29681]: https://bugs.torproject.org/29681
+[use of Debian packages for installation]: https://github.com/voxpupuli/puppet-prometheus/pull/303
+[allow scrape job collection]: https://github.com/voxpupuli/puppet-prometheus/pull/304
+[Prometheus Puppet module]: https://github.com/voxpupuli/puppet-prometheus/
+[many other patches sent by anarcat]: https://github.com/voxpupuli/puppet-prometheus/pulls?q=author%3Aanarcat+
+
+### Pushgateway
+
+The [Pushgateway][] was configured on the external Prometheus server
+to allow for the metrics people to push their data inside Prometheus
+without having to write a Prometheus exporter inside Collector.
+
+[Pushgateway]: https://github.com/prometheus/pushgateway
+
+This was done directly inside the
+`profile::prometheus::server::external` class, but could be moved to a
+separate profile if it needs to be deployed internally. It is assumed
+that the gateway script will run directly on `prometheus2` to avoid
+setting up authentication and/or firewall rules, but this could be
+changed.
+
+### Alertmanager
+
+The [Alertmanager][] is configured on the external Prometheus server
+for the metrics and anti-censorship teams to monitor the health of the
+network. It may eventually also be used to replace or enhance
+[Nagios][] ([ticket 29864][]).
+
+It is installed through Puppet, in
+`profile::prometheus::server::external`, but could be moved to its own
+profile if it is deployed on more than one server.
+
+Note that Alertmanager only dispatches alerts, which are actually
+generated on the Prometheus server side of things. Make sure the
+following block exists in the `prometheus.yml` file:
+
+    alerting:
+      alert_relabel_configs: []
+      alertmanagers:
+      - static_configs:
+        - targets:
+          - localhost:9093
+
+[Nagios]: howto/nagios
+
+### Manual node configuration
+
+External services can be monitored by Prometheus, as long as they
+comply with the [OpenMetrics][] protocol, which is simply to expose
+metrics such as this over HTTP:
+
+    metric{label=label_val}  value
+
+A real-life (simplified) example:
+
+    node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392
+
+The above says that the node alberti has the device `/dev/sda` mounted
+on `/`, formatted as an `ext4` filesystem which has 16160059392 bytes
+(~16GB) free.
+
+ [OpenMetrics]: https://openmetrics.io/
+
+System-level metrics can easily be monitored by the secondary
+Prometheus server. This is usually done by installing the "node
+exporter", with the following steps:
+
+ * On Debian Buster and later:
+
+        apt install prometheus-node-exporter
+
+ * On Debian stretch:
+
+        apt install -t stretch-backports prometheus-node-exporter
+
+   ... assuming that backports is already configured. if it isn't, such a line in `/etc/apt/sources.list.d/backports.debian.org.list` should suffice:
+
+        deb	https://deb.debian.org/debian/	stretch-backports	main contrib non-free
+
+   ... followed by an `apt update`, naturally.
+
+The firewall on the machine needs to allow traffic on the exporter
+port from the server `prometheus2.torproject.org`. Then [open a
+ticket][new-ticket] for TPA to configure the target. Make sure to
+mention:
+
+ * the hostname for the exporter
+ * the port of the exporter (varies according to the exporter, 9100
+   for the node exporter)
+ * how often to scrape the target, if non-default (default: 15s)
+
+Then TPA needs to hook those as part of a new node `job` in the
+`scrape_configs`, in `prometheus.yml`, from Puppet, in
+`profile::prometheus::server`.
+
+See also [Adding metrics to applications][], above.
+
+[Adding metrics to applications]: #adding-metrics-to-applications
+
+## Monitored services
+
+Those are the actual services monitored by Prometheus.
+
+### Internal server (prometheus1)
+
+The "internal" server scrapes all hosts managed by Puppet for
+TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
+takes care of metrics like CPU, memory, disk usage, time accuracy, and
+so on. Then other exporters might be enabled on specific services,
+like email or web servers.
+
+Access to the internal server is fairly public: the metrics there are
+not considered to be security sensitive and protected by
+authentication only to keep bots away.
+
+[`node_exporter`]: https://github.com/prometheus/node_exporter
+
+### External server (prometheus2)
+
+The "external" server, on the other hand, is more restrictive and does
+not allow public access. This is out of concern that specific metrics
+might lead to timing attacks against the network and/or leak sensitive
+information. The external server also explicitly does *not* scrape TPA
+servers automatically: it only scrapes certain services that are
+manually configured by TPA.
+
+Those are the services currently monitored by the external server:
+
+ * [bridgestrap][]
+ * [rdsys][]
+ * OnionPerf external nodes' `node_exporter`s
+ * connectivity test on (some?) bridges (using the
+   [`blackbox_exporter`][])
+
+Note that this list might become out of sync with the actual
+implementation, look into [Puppet][] in
+`profile::prometheus::server::external` for the actual deployment.
+
+This separate server was actually provisioned for the anti-censorship
+team (see [this comment for background][]). The server was setup in
+July 2019 following [#31159][].
+
+[bridgestrap]: https://bridges.torproject.org/bridgestrap-metrics
+[rdsys]: https://bridges.torproject.org/rdsys-backend-metrics
+[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
+[Puppet]: howto/puppet
+[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
+[this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
+[#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
+
+### Other possible services to monitor
+
+Many more exporters could be configured. A non-exaustive list was
+built in [ticket tpo/tpa/team#30028][] around launch time. Here we
+can document more such exporters we find along the way:
+
+ * [Prometheus Onion Service Exporter][] - "Export the status and
+   latency of an onion service"
+ * [hsprober][] - similar, but also with histogram buckets, multiple
+   attempts, warm-up and error counts
+ * [haproxy_exporter][]
+
+There's also a [list of third-party exporters][] in the Prometheus documentation.
+
+[ticket tpo/tpa/team#30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
+[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
+[hsprober]: https://git.autistici.org/ale/hsprober
+[haproxy_exporter]: https://github.com/prometheus/haproxy_exporter
+[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/
+
+## SLA
+
+Prometheus is currently not doing alerting so it doesn't have any sort
+of guaranteed availability. It should, hopefully, not lose too many
+metrics over time so we can do proper long-term resource planning.
+
+## Design
+
+Here is, from the [Prometheus overview documentation][], the
+basic architecture of a Prometheus site:
+
+[Prometheus overview documentation]: https://prometheus.io/docs/introduction/overview/
+
+<img src="https://prometheus.io/assets/architecture.png" alt="A
+drawing of Prometheus' architecture, showing the push gateway and
+exporters adding metrics, service discovery through file_sd and
+Kubernetes, alerts pushed to the Alertmanager and the various UIs
+pulling from Prometheus" />
+
+As you can see, Prometheus is somewhat tailored towards
+[Kubernetes][] but it can be used without it. We're deploying it with
+the `file_sd` discovery mechanism, where Puppet collects all exporters
+into the central server, which then scrapes those exporters every
+`scrape_interval` (by default 15 seconds). The architecture graph also
+shows the Alertmanager which could be used to (eventually) replace our
+Nagios deployment.
+
+[Kubernetes]: https://kubernetes.io/
+
+It does not show that Prometheus can federate to multiple instances
+and the Alertmanager can be configured with High availability.
+
+### Alert routing details
+
+Once Prometheus has created an alert, it sends it to one or more instances of
+Alertmanager. This one in turn is responsible for routing the alert to the right
+communication channel.
+
+That is, if Alertmanager is correctly configured, that is if it's
+configured in `prometheus.yml`, the `alerting` section, see
+[Installation][] section.
+
+Alert routes are set as a hierarchical tree in which the first route that
+matches gets to handle the alert. The first-matching route may decide to ask
+Alertmanager to continue processing with other routes so that the same alert can
+match multiple routes. This is how TPA receives emails for critical alerts and
+also IRC notifications for both warning and critical.
+
+Each route needs to have one or more receivers set.
+
+Receivers are and routes are defined in hiera in `hiera/common/prometheus.yaml`.
+
+#### Receivers
+
+Receivers are set in the key `prometheus::alertmanager::receivers` and look like
+this:
+
+    - name: 'TPA-email'
+      email_configs:
+        - to: 'recipient@example.com'
+          require_tls: false
+          text: '{{ template "email.custom.txt" . }}'
+          headers:
+            subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " -- " }}'
+
+Here we've configured an email recipient. Alertmanager can send alerts with a
+bunch of other communications channels. For example to send IRC notifications,
+we have a daemon binding to `localhost` on the Prometheus server waiting for
+web hook calls, and the corresponding receiver has a section `webhook_configs`
+instead of `email_configs`.
+
+#### Routes
+
+Alert routes are set in the key `prometheus::alertmanager::route` in hiera. The
+default route, the one set at the top level of that key, uses the receiver
+`fallback` and some default options for other routes.
+
+The default route _should not be explicitly used_ by alerts. We always want to
+explicitly match on a set of labels to send alerts to the correct destination.
+Thus, the default recipient uses a different message template that explicitly
+says there is a configuration error. This way we can more easily catch what's
+been wrongly configured.
+
+The default route has a key `routes`. This is where additional routes are set.
+
+A route needs to set a recipient and then can match on certain label values,
+using the `matchers` list. Here's an example for the TPA IRC route:
+
+    - receiver: 'irc-tor-admin'
+      matchers:
+        - 'team = "TPA"'
+        - 'severity =~ "critical|warning"'
+
+## Pushgateway
+
+The [Pushgateway][] is a separate server from the main Prometheus
+server that is designed to "hold" onto metrics for ephemeral jobs that
+would otherwise be around long enough for Prometheus to scrape their
+metrics. We use it as a workaround to bridge Metrics data with
+Prometheus/Grafana.
+
+## Blackbox exporter
+
+Most exporters are pretty straightforward: a service binds to a port and exposes
+metrics through HTTP requests on that port, generally on the `/metrics` URL.
+
+The blackbox exporter, however, is a little bit more contrived. The exporter can
+be configured to run a bunch of different tests (e.g. tcp connections, http
+requests, ICMP ping, etc) for a list of targets of its own. So the prometheus
+server has one target, the host with the port for the blackbox exporter, but
+that exporter in turn is set to check other hosts.
+
+The [upstream documentation][] has some details that can help. We also
+have examples [above][] for how to configure it in our setup.
+
+One thing that's nice to know in addition to how it's configured is how you can
+debug it. You can query the exporter from localhost in order to get more
+information. If you are using this method for debugging, you'll most probably
+want to include debugging output. For example, to run an ICMP test on host
+pauli.torproject.org:
+
+    curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true
+
+Note that the above trick can be used for _any_ target, not just for ones
+currently configured in the blackbox exporter. So you can also use this to test
+things before creating the final configuration for the target.
+
+[upstream documentation]: https://github.com/prometheus/blackbox_exporter
+[above]: #adding-alert-rules
+
+## Alertmanager
+
+The [Alertmanager][] is a separate program that receives notifications
+generated by Prometheus servers through an API, groups, and
+deduplicates notifications before sending them by email or other
+mechanisms.
+
+[Alertmanager]: https://github.com/prometheus/alertmanager
+
+Here's how the internal design of the Alertmanager looks like:
+
+<img src="https://raw.githubusercontent.com/prometheus/alertmanager/master/doc/arch.svg" alt="Internal architecture of the Alert manager, showing how they get the alerts from Prometheus through an API and internally pushes this through various storage queues and deduplicating notification pipelines, along with a clustered gossip protocol" />
+
+The first deployments of the Alertmanager at TPO do not feature
+a "cluster", or high availability (HA) setup.
+
+Alerts are typically sent over email, but Alertmanager also has
+builtin support for:
+
+ * Email
+ * Slack
+ * [Victorops][] (now Splunk)
+ * [Pagerduty][]
+ * [Opsgenie][] (now Atlassian)
+ * Wechat
+
+There's also a [generic webhook receiver][] which is typically used
+to send notifications. Many other endpoints are implemented through
+that webhook, for example:
+
+ * [Cachet][]
+ * [Dingtalk][]
+ * [Discord][]
+ * [Google Chat][]
+ * [IRC][]
+ * Matrix: [matrix-alertmanager][] (JS) or [knopfler][] (Python), see
+   also [#40216][]
+ * [Mattermost][]
+ * [Microsoft teams][]
+ * [Phabricator][]
+ * [Sachet][] supports *many* messaging systems (Twilio, Pushbullet,
+   Telegram, Sipgate, etc)
+ * [Sentry][]
+ * [Signal][] (or [Signald][])
+ * [Splunk][]
+ * [SNMP][]
+ * Telegram: [nopp/alertmanager-webhook-telegram-python][] or [metalmatze/alertmanager-bot][]
+ * [Twilio][]
+ * [Wechat][]
+ * Zabbix: [alertmanager-zabbix-webhook][] or [zabbix-alertmanager][]
+
+And that is only what was available at the time of writing, the
+[alertmanager-webhook][] and [alertmanager tags][] GitHub might have more.
+
+The Alertmanager has its own web interface to see and silence alerts,
+but there are also alternatives like [Karma][] (previously
+Cloudflare's [unsee][]). The web interface is
+not shipped with the Debian package, because it depends on the [Elm
+compiler][] which is [not in Debian][]. It can be built by hand
+using the `debian/generate-ui.sh` script, but only in newer, post
+buster versions. Another alternative to consider is [Crochet][].
+
+In general, when working on alerting, keeping [the "My Philosophy on
+Alerting" paper from a Google engineer][] (now the [Monitoring
+distributed systems][] chapter of the [Site Reliability
+Engineering][] O'Reilly book.
+
+Another issue with alerting in Prometheus is that you can only silence
+warnings for a certain amount of time, then you get a notification
+again. The [kthxbye bot][] works around that issue.
+
+[Victorops]: https://victorops.com
+[Pagerduty]: https://pagerduty.com/
+[Opsgenie]: https://opsgenie.com
+[generic webhook receiver]: https://prometheus.io/docs/alerting/latest/configuration/#webhook_config
+[Cachet]: https://github.com/oxyno-zeta/prometheus-cachethq
+[Dingtalk]: https://github.com/timonwong/prometheus-webhook-dingtalk
+[Discord]: https://github.com/rogerrum/alertmanager-discord
+[Google Chat]: https://github.com/mr-karan/calert
+[IRC]: https://github.com/crisidev/alertmanager_irc
+[#40216]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
+[matrix-alertmanager]: https://github.com/jaywink/matrix-alertmanager
+[knopfler]: https://github.com/sinnwerkstatt/knopfler
+[Mattermost]: https://github.com/cpanato/mattermost-plugin-alertmanager
+[Microsoft teams]: https://github.com/prometheus-msteams/prometheus-msteams
+[Phabricator]: https://github.com/knyar/phalerts
+[Sachet]: https://github.com/messagebird/sachet
+[Sentry]: https://github.com/mikeroll/alertmanager-sentry-gateway
+[Signal]: https://github.com/dadevel/alertmanager-signal-receiver
+[Signald]: https://github.com/dgl/alertmanager-webhook-signald
+[Splunk]: https://github.com/sylr/alertmanager-splunkbot
+[SNMP]: https://github.com/maxwo/snmp_notifier
+[nopp/alertmanager-webhook-telegram-python]: https://github.com/nopp/alertmanager-webhook-telegram-python
+[metalmatze/alertmanager-bot]: https://github.com/metalmatze/alertmanager-bot
+[Twilio]: https://github.com/Swatto/promtotwilio
+[Wechat]: https://github.com/daozzg/work_wechat_robot
+[alertmanager-zabbix-webhook]: https://github.com/gmauleon/alertmanager-zabbix-webhook
+[zabbix-alertmanager]: https://github.com/devopyio/zabbix-alertmanager
+[alertmanager-webhook]: https://github.com/topics/alertmanager-webhook
+[alertmanager tags]: https://github.com/topics/alertmanager
+[Karma]: https://karma-dashboard.io/
+[unsee]: https://github.com/cloudflare/unsee
+[Elm compiler]: https://github.com/elm/compiler
+[not in Debian]: http://bugs.debian.org/973915
+[Crochet]: https://github.com/simonpasquier/crochet
+[the "My Philosophy on Alerting" paper from a Google engineer]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
+[Monitoring distributed systems]: https://www.oreilly.com/radar/monitoring-distributed-systems/
+[Site Reliability Engineering]: https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
+[kthxbye bot]: https://github.com/prymitive/kthxbye
+
+### Alert timing details
+
+Alert timing can be a hard topic to understand in Prometheus alerting,
+because there are many components associated with it, and Prometheus
+documentation is not great at explaining how things work clearly. This
+is an attempt at explaining various parts of it as I (anarcat)
+understand it as of 2024-09-19, based the latest documentation
+available on <https://prometheus.io> and the current [Alertmanager git
+HEAD][].
+
+First, there might be a time vector involved in the Prometheus
+query. For example, take the query:
+
+    increase(django_http_exceptions_total_by_type_total[5m]) > 0
+
+Here, the "vector range" is `5m` or five minutes. You might think this
+will fire only after 5 minutes have passed. I'm not actually sure. In
+my observations, I have found this fires as soon as an increase is
+detected, but will *stop* after the vector range has passed.
+
+Second, there's the `for:` parameter in the alerting rule. Say this
+was set to 5 minutes again:
+
+    - alert: DjangoExceptions
+      expr: increase(django_http_exceptions_total_by_type_total[5m]) > 0
+      for: 5m
+
+This means that the alert will be considered only `pending` for that
+period. Prometheus will *not* send an alert to the Alertmanager at all
+unless `increase()` was sustained for the period. If *that* happens,
+then the alert is marked as `firing` and Alertmanager will start
+getting the alert.
+
+(Alertmanager *might* be getting the alert in the `pending` state, but
+that makes no difference to our discussion: it will not send alerts
+before that period has passed.)
+
+Third, there's another setting, `keep_firing_for`, that will make
+Prometheus keep firing the alert even after the query evaluates to
+false. We're ignoring this for now.
+
+At this point, the alert has reached Alertmanager and it needs to make a
+decision of what to do with it. More timers are involved.
+
+Alerts will be evaluated against the alert routes, thus aggregated
+into a new group or added to an existing group according to that
+route's `group_by` setting, and then Alertmanager will evaluate the
+timers set on the particular route that was matched. An alert group is
+created when an alert is received and no other alerts already match
+the same values for the `group_by` criteria. An alert group is removed
+when all alerts in a group are in state `inactive` (e.g.  resolved).
+
+Fourth, there's the `group_wait` setting (defaults to 5 seconds, can
+be [customized by route][]). This will keep Alertmanager from
+routing any alerts for a while thus allowing it to group the _first_
+alert notification for all alerts in the same group in one batch. It
+implies that you will not receive a notification for a new alert
+before that timer has elapsed. See also the too short [documentation
+on grouping][].
+
+(The `group_wait` timer is initialized when the alerting group is
+created, see [`dispatch/dispatch.go`, line 415, function
+`newAggrGroup`][].)
+
+Now, *more* alerts might be sent by Prometheus if more metrics match
+the above expression. They are *different* alerts because they have
+different labels (say, another host might have exceptions, above, or,
+more commonly, other hosts require a reboot). Prometheus will then
+relay that alert to the Alertmanager, and another timer comes in.
+
+Fifth, before relaying that new alert that's already part of a firing
+group, Alertmanager will wait `group_interval` (defaults to 5m) before
+resending a notification to a group.
+
+When Alertmanager first creates an alert group, a thread is started
+for that group and the _route_'s `group_interval` acts like a time
+ticker. Notifications are only sent when the `group_interval` period
+repeats.
+
+So new alerts merged in a group will wait _up to_ `group_interval` before
+being relayed.
+
+(The `group_interval` timer is also initialized [in `dispatch.go`, line
+460, function `aggrGroup.run()`][]. It's done *after* that function
+waits for the previous timer which is normally based on the
+`group_wait` value, but can be switched to `group_interval` after that
+very iteration, of course.)
+
+So, conclusions:
+
+- If an alert flaps because it pops in and out of existence, consider
+  tweaking the query to cover a longer vector, by increasing the time
+  range (e.g. switch from `5m` to `1h`), or by comparing against a
+  moving average
+
+- If an alert triggers too quickly due to a transient event (say
+  network noise, or someone messing up a deployment but you want to
+  give them a chance to fix it), increase the `for:` timer.
+
+- Inversely, if you *fail* to detect transient outages, *reduce* the
+  `for:` timer, but be aware this might pick up other noises.
+
+- If alerts come too soon and you get a flood of alerts
+  when an outage *starts*, increase `group_wait`.
+
+- If alerts come in slowly but fail to be group because they don't
+  arrive at the same time, increase `group_interval`.
+
+This analysis was done in response to a [mysterious failure to send
+notification in a particularly flappy alert][].
+
+[Alertmanager git HEAD]: https://github.com/prometheus/alertmanager/tree/e9904f93a7efa063bac628ed0b74184acf1c7401
+[customized by route]: https://prometheus.io/docs/alerting/latest/configuration/#route
+[documentation on grouping]: https://prometheus.io/docs/alerting/latest/alertmanager/#grouping
+[`dispatch/dispatch.go`, line 415, function `newAggrGroup`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L415
+[in `dispatch.go`, line 460, function `aggrGroup.run()`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460
+[mysterious failure to send notification in a particularly flappy alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18
+
+## Issues
+
+There is no issue tracker specifically for this project, [File][new-ticket] or
+[search][] for issues in the [team issue tracker][search] with the
+~Prometheus label.
+
+ [new-ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
+ [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues?label_name%5B%5D=Prometheus
+
+### Known issues
+
+Those are major issues that are worth knowing about Prometheus in
+general, and our setup in particular:
+
+ - bind mounts generate duplicate metrics, upstream issue: [Way to
+   distinguish bind mounted path ?][], possible workaround: manually
+   specify known bind mount points
+   (e.g. `node_filesystem_avail_bytes{instance=~"$instance:.*",fstype!='tmpfs',fstype!='shm',mountpoint!~"/home|/var/lib/postgresql"}`),
+   but that can hide actual, real mountpoints, possible fix: the
+   `node_filesystem_mount_info` metric, [added in PR 2970 from
+   2024-07-14][], unreleased as of 2024-08-28
+ - high cardinality metrics from exporters we do not control can fill
+   the disk
+ - no long-term metrics storage, issue: [multi-year metrics storage][]
+
+In general, the service is still being launched, see [TPA-RFC-33][]
+for the full deployment plan.
+
+[Way to distinguish bind mounted path ?]: https://github.com/prometheus/node_exporter/issues/600
+[added in PR 2970 from 2024-07-14]: https://github.com/prometheus/node_exporter/pull/2970
+[multi-year metrics storage]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330
+
+### Resolved issues
+
+No major issue resolved so far is worth mentioning here.
+
+## Maintainer, users, and upstream
+
+The Prometheus services have been setup and are managed by anarcat
+inside TPA. The internal Prometheus server is mostly used by TPA staff
+to diagnose issues. The external Prometheus server is used by various
+TPO teams for their own monitoring needs.
+
+The upstream Prometheus projects are diverse and generally active as
+of early 2021. Since Prometheus is used as an ad-hoc standard in the
+new "cloud native" communities like Kubernetes, it has seen an upsurge
+of development and interest from various developers, and
+companies. The future of Prometheus should therefore be fairly bright.
+
+The individual exporters, however, can be hit and miss. Some exporters
+are "code dumps" from companies and not very well maintained. For
+example, [Digital Ocean][] dumped the [bind_exporter][] on GitHub,
+but it was [salvaged][] by the [Prometheus community][].
+
+Another important layer is the large amount of Puppet code that is
+used to deploy Prometheus and its components. This is all part of a
+big Puppet module, [`puppet-prometheus`][], managed by the [voxpupuli
+collective][]. Our integration with the module is not yet complete:
+we have a lot of glue code on top of it to correctly make it work with
+Debian packages. A lot of work has been done to complete that work by
+anarcat, but work still remains, see [upstream issue 32][] for
+details.
+
+[`puppet-prometheus`]: https://github.com/voxpupuli/puppet-prometheus/
+[Digital Ocean]: https://github.com/digitalocean/
+[bind_exporter]: https://github.com/digitalocean/bind_exporter/
+[salvaged]: https://github.com/prometheus-community/bind_exporter/issues/55
+[Prometheus community]: https://github.com/prometheus-community/community/issues/15
+[voxpupuli collective]: https://github.com/voxpupuli
+[upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32
+
+## Monitoring and testing
+
+Prometheus doesn't have specific tests, but there *is* a test suite in
+the upstream prometheus Puppet module.
+
+The server is monitored for basic system-level metrics by Nagios. It
+also monitors itself for system-level metrics but also
+application-specific metrics.
+
+## Logs and metrics
+
+Prometheus servers typically do not generate many logs, except when
+errors and warnings occur. They should hold very little PII. The web
+frontends collect logs in accordance with our regular policy.
+
+Actual metrics *may* contain PII, although it's quite unlikely:
+typically, data is anonymized and aggregated at collection time. It
+would still be able to deduce some activity patterns from the metrics
+generated by Prometheus, and use it to leverage side-channel attacks,
+which is why the external Prometheus server access is restricted.
+
+Metrics are held for about a year or less, depending on the server,
+see [ticket 29388][] for storage requirements and possible
+alternatives for data retention policies.
+
+Note that [TPA-RFC-33][] discusses alternative metrics retention
+policies.
+
+[TPA-RFC-33]: policy/tpa-rfc-33-monitoring
+
+## Backups
+
+Prometheus servers should be fully configured through Puppet and
+require little backups. The metrics themselves are kept in
+`/var/lib/prometheus2` and should be backed up along with our regular
+[backup procedures][].
+
+WAL (write-ahead log) files are ignored by the backups, which can lead
+to an extra 2-3 hours of data loss since the last backup in the case
+of a total failure, see [tpo/tpa/team#41627][] for the
+discussion. This should eventually be mitigated by a high availability
+setup ([tpo/tpa/team#41643][]).
+
+[backup procedures]: service/backup
+[tpo/tpa/team#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627
+[tpo/tpa/team#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643
+
+## Other documentation
+
+ * [Prometheus home page][]
+ * [Prometheus documentation][]
+ * [Prometheus developer blog][]
+
+[Prometheus home page]: https://prometheus.io/
+[Prometheus documentation]: https://prometheus.io/docs/introduction/overview/
+[Prometheus developer blog]: https://www.robustperception.io/tag/prometheus/
+
+# Discussion
+
+## Overview
+
+The Prometheus and [Grafana][] services were setup after anarcat
+realized that there was no "trending" service setup inside TPA after
+Munin had died ([ticket 29681][]). The "node exporter" was deployed on
+all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining
+traces of Munin were removed in early April 2019 ([ticket 29682][]).
+
+ [ticket 29683]: https://bugs.torproject.org/29683
+ [ticket 29682]: https://bugs.torproject.org/29682
+
+Resource requirements were researched in [ticket 29388][] and it was
+originally planned to retain 15 days of metrics. This was expanded to
+one year in November 2019 ([ticket 31244][]) with the hope this could
+eventually be expanded further with a downsampling server in the
+future.
+
+ [ticket 31244]: https://bugs.torproject.org/31244
+
+Eventually, a second Prometheus/Grafana server was setup to monitor
+external resources ([ticket 31159][]) because there were concerns
+about mixing internal and external monitoring on TPA's side. There
+were also concerns on the metrics team about exposing those metrics
+publicly.
+
+ [ticket 31159]: https://bugs.torproject.org/31159
+
+It was originally thought Prometheus could completely replace
+[Nagios][] as well [ticket 29864][], but this turned out to be more
+difficult than planned. The main difficulty is that Nagios checks come
+with builtin threshold of acceptable performance. But Prometheus
+metrics are just that: metrics, without thresholds... This makes it
+more difficult to replace Nagios because a ton of alerts need to be
+rewritten to replace the existing ones. A lot of reports and
+functionality built-in to Nagios, like availability reports,
+acknowledgements and other reports, would need to be reimplemented as
+well.
+
+## Goals
+
+This section didn't exist when the project was launched, so this is
+really just second-guessing...
+
+### Must have
+
+ * Munin replacement: long-term trending metrics to predict resource
+   allocation, with graphing
+ * Free software, self-hosted
+ * Puppet automation
+
+### Nice to have
+
+ * Possibility of eventual Nagios phase-out ([ticket 29864][])
+
+ [ticket 29864]: https://bugs.torproject.org/29864
+
+### Non-Goals
+
+ * Data retention beyond one year
+
+## Approvals required
+
+Primary Prometheus server was decided [in the Brussels 2019
+devmeeting][], before anarcat joined the team ([ticket
+29389][]). Secondary Prometheus server was approved in
+[meeting/2019-04-08][]. Storage expansion was approved in
+[meeting/2019-11-25][].
+
+ [in the Brussels 2019 devmeeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
+ [ticket 29389]: https://bugs.torproject.org/29389
+[meeting/2019-04-08]: meeting/2019-04-08
+[meeting/2019-11-25]: meeting/2019-11-25
+
+## Proposed Solution
+
+Prometheus was chosen, see also [Grafana][].
+
+## Cost
+
+N/A.
+
+## Alternatives considered
+
+We considered retaining Nagios/Icinga as an alerting system, separate
+from Prometheus, but ultimately decided against it in [TPA-RFC-33][].
+
+### Alerting rules in Puppet
+
+Alerting rules are currently stored in an external
+[`prometheus-alerts.git` repository][] that holds not only TPA's
+alerts, but also those of other teams.  So the rules
+are _not_ directly managed by puppet -- although puppet will ensure
+that the repository is checked out with the most recent commit on the
+Prometheus servers.
+
+The rationale is that rule definitions should appear only once and we
+already had the above-mentioned repository that could be used to
+configure alerting rules.
+
+We were concerned we would potentially have multiple sources of truth
+for alerting rules. We already have that for scrape targets, but that
+doesn't seem to be an issue. It did feel, however, critical for the
+more important alerting rules to have a single source of truth.
+
+### Migrating from Munin
+
+Here's a quick cheat sheet from people used to Munin and switching to
+Prometheus:
+
+| What              | Munin           | Prometheus                             |
+|-------------------|-----------------|----------------------------------------|
+| Scraper           | `munin-update`  | Prometheus                             |
+| Agent             | `munin-node`    | Prometheus, `node-exporter` and others |
+| Graphing          | `munin-graph`   | Prometheus or Grafana                  |
+| Alerting          | `munin-limits`  | Prometheus, Alertmanager               |
+| Network port      | 4949            | 9100 and others                        |
+| Protocol          | TCP, text-based | HTTP, [text-based][]                   |
+| Storage format    | RRD             | Custom time series database            |
+| Down-sampling     | Yes             | No                                     |
+| Default interval  | 5 minutes       | 15 seconds                             |
+| Authentication    | No              | No                                     |
+| Federation        | No              | Yes (can fetch from other servers)     |
+| High availability | No              | Yes (alert-manager gossip protocol)    |
+
+[text-based]: https://prometheus.io/docs/instrumenting/exposition_formats/
+
+Basically, Prometheus is similar to Munin in many ways:
+
+ * It "pulls" metrics from the nodes, although it does it over HTTP
+   (to <http://host:9100/metrics>) instead of a custom TCP protocol
+   like Munin
+
+ * The agent running on the nodes is called `prometheus-node-exporter`
+   instead of `munin-node`. it scrapes only a set of built-in
+   parameters like CPU, disk space and so on, different exporters are
+   necessary for different applications (like
+   `prometheus-apache-exporter`) and any application can easily
+   implement an exporter by exposing a Prometheus-compatible
+   `/metrics` endpoint
+
+ * Like Munin, the node exporter doesn't have any form of
+   authentication built-in. we rely on IP-level firewalls to avoid
+   leakage
+
+ * The central server is simply called `prometheus` and runs as a
+   daemon that wakes up on its own, instead of `munin-update` which is
+   called from `munin-cron` and before that `cron`
+
+ * graphics are generated on the fly through the crude Prometheus web
+   interface or by frontends like Grafana, instead of being constantly
+   regenerated by `munin-graph`
+
+ * samples are stored in a custom "time series database" (TSDB) in
+   Prometheus instead of the (ad-hoc) RRD standard
+   
+ * Prometheus performs *no* down-sampling like RRD and Prom relies on
+   smart compression to spare disk space, but it uses more than Munin
+
+ * Prometheus scrapes samples much more aggressively than Munin by
+   default, but that interval is configurable
+
+ * Prometheus can scale horizontally (by sharding different services
+   to different servers) and vertically (by aggregating different
+   servers to a central one with a different sampling frequency)
+   natively - `munin-update` and `munin-graph` can only run on a
+   single (and same) server
+
+ * Prometheus can act as a high availability alerting system thanks
+   to its `alertmanager` that can run multiple copies in parallel
+   without sending duplicate alerts - `munin-limits` can only run on a
+   single server