[How much time was the given service (`node` job, in this case) `up` in the past period (`30d`)]:https://prometheus.torproject.org/graph?g0.expr=avg(avg_over_time(up{job%3D"node"}[30d]))
[How many hosts are online at any given point in time]:https://prometheus.torproject.org/graph?g0.expr=sum(count(up%3D=1))/sum(count(up))+by+(alias)
[How long did an alert fire over a given period of time]:https://prometheus.torproject.org/graph?g0.expr=sum_over_time(ALERTS{alertname%3D"MemFullSoon"}[1d:1s])
### Disk usage
This is a less strict version of the [`DiskWillFillSoon` alert][],
[Find disks that will be full in 6 hours]:https://prometheus.torproject.org/graph?g0.expr=predict_linear(node_filesystem_avail_bytes[6h],+24*60*60)+<+0
[Number of machines]:https://prometheus.torproject.org/graph?g0.expr=count(up{job%3D"node"})
[Number of machine per OS version]:https://prometheus.torproject.org/graph?g0.expr=count(node_os_info)+by+(version_id,+version_codename)
[Number of machines per exporters, or technically, number of machines per job]:https://prometheus.torproject.org/graph?g0.expr=sort_desc(sum(up{job%3D~\"$job\"})+by+(job)
[Number of CPU cores, memory size, filesystem and LVM sizes]:https://prometheus.torproject.org/graph?g0.expr=count(node_cpu_seconds_total{classes%3D~\"$class\",mode%3D\"system\"})
[Uptime, in days]:https://prometheus.torproject.org/graph?g0.expr=round((time()+-+node_boot_time_seconds)+/+(24*60*60))
[not possible to push timestamps]:https://github.com/prometheus/pushgateway#about-timestamps
### Deleting metrics
Deleting metrics can be done through the Admin API. That first needs
to be enabled in `/etc/default/prometheus`, by adding
`--web.enable-admin-api` to the `ARGS` list, then Prometheus needs to
be restarted:
service prometheus restart
WARNING: make sure there is authentication in front of Prometheus
because this could expose the server to more destruction.
Then you need to issue a special query through the API. This, for
example, will wipe all metrics associated with the given instance:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}'
The same, but only for about an hour, good for testing that only the
wanted metrics are destroyed:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}&start=2021-10-25T19:00:00Z&end=2021-10-25T20:00:00Z'
To match only a job on a specific instance:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="gitlab-02.torproject.org:9101"}&match[]={job="gitlab"}'
Deleted metrics are not necessarily immediately removed from disk but
are "eligible for compaction". Changes *should* show up immediately
however. The "Clean Tombstones" should be used to remove samples from
# HELP apache_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which apache_exporter was built.
The Pushgateway web interface provides some basic information about
the metrics it collects, and allow you to view the pending metrics
before they get scraped by Prometheus, which may be useful to
troubleshoot issues with the gateway.
To pull metrics by hand, you can pull directly from the pushgateway:
curl localhost:9091/metrics
If you get this error while pulling metrics from the exporter:
An error has occurred while serving metrics:
collected metric "some_metric" { label:<name:"instance"value:""> label:<name:"job"value:"some_job"> label:<name:"tag"value:"val1"> counter:<value:1> } was collected before with the same name and label values
It's because similar metrics were sent twice into the gateway, which
corrupts the state of the pushgateway, a [known problems][] in
earlier versions and [fixed in 0.10][] (Debian bullseye and later). A
workaround is simply to restart the Pushgateway (and clear the
storage, if persistence is enabled, see the `--persistence.file`
The error should be visible in the node exporter logs, run the
following command to see it:
journalctl -u prometheus-node-exporter -e
Here's a list of issues found in the wild, but your particular issue
might be different.
#### Wrong permissions
```
Sep 24 20:56:53 bungei prometheus-node-exporter[1387]: ts=2024-09-24T20:56:53.280Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa_backuppg.prom err="failed to open textfile data file \"/var/lib/prometheus/node-exporter/tpa_backuppg.prom\": open /var/lib/prometheus/node-exporter/tpa_backuppg.prom: permission denied"
```
In this case, the file was created as a tempfile and moved into place
without fixing the permission. The fix was to simply create the file
without the `tempfile` Python library, with a `.tmp` suffix, and just
move it into place.
#### Garbage in a text file
```
Sep 24 21:14:41 perdulce prometheus-node-exporter[429]: ts=2024-09-24T21:14:41.783Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=scheduled_shutdown_metric.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/scheduled_shutdown_metric.prom\": text format parsing error in line 3: expected '\"' at start of label value, found 'r'"
```
This was an experimental metric designed in [tpo/tpa/team#41734][] to
keep track of scheduled reboot times, but it was formatted
incorrectly. The entire file content was:
```
# HELP node_shutdown_scheduled_timestamp_seconds time of the next scheduled reboot, or zero
# TYPE node_shutdown_scheduled_timestamp_seconds gauge
Here's how the internal design of the Alertmanager looks like:
<imgsrc="https://raw.githubusercontent.com/prometheus/alertmanager/master/doc/arch.svg"alt="Internal architecture of the Alert manager, showing how they get the alerts from Prometheus through an API and internally pushes this through various storage queues and deduplicating notification pipelines, along with a clustered gossip protocol"/>
The first deployments of the Alertmanager at TPO do not feature
a "cluster", or high availability (HA) setup.
Alerts are typically sent over email, but Alertmanager also has
builtin support for:
* Email
* Slack
*[Victorops][] (now Splunk)
*[Pagerduty][]
*[Opsgenie][] (now Atlassian)
* Wechat
There's also a [generic webhook receiver][] which is typically used
to send notifications. Many other endpoints are implemented through
that webhook, for example:
*[Cachet][]
*[Dingtalk][]
*[Discord][]
*[Google Chat][]
*[IRC][]
* Matrix: [matrix-alertmanager][] (JS) or [knopfler][] (Python), see
also [#40216][]
*[Mattermost][]
*[Microsoft teams][]
*[Phabricator][]
*[Sachet][] supports *many* messaging systems (Twilio, Pushbullet,
Telegram, Sipgate, etc)
*[Sentry][]
*[Signal][] (or [Signald][])
*[Splunk][]
*[SNMP][]
* Telegram: [nopp/alertmanager-webhook-telegram-python][] or [metalmatze/alertmanager-bot][]
*[Twilio][]
*[Wechat][]
* Zabbix: [alertmanager-zabbix-webhook][] or [zabbix-alertmanager][]
And that is only what was available at the time of writing, the
[alertmanager-webhook][] and [alertmanager tags][] GitHub might have more.
The Alertmanager has its own web interface to see and silence alerts,
but there are also alternatives like [Karma][] (previously
Cloudflare's [unsee][]). The web interface is
not shipped with the Debian package, because it depends on the [Elm
compiler][] which is [not in Debian][]. It can be built by hand
using the `debian/generate-ui.sh` script, but only in newer, post
buster versions. Another alternative to consider is [Crochet][].
In general, when working on alerting, keeping [the "My Philosophy on
Alerting" paper from a Google engineer][] (now the [Monitoring
distributed systems][] chapter of the [Site Reliability
Engineering][] O'Reilly book.
Another issue with alerting in Prometheus is that you can only silence
warnings for a certain amount of time, then you get a notification
again. The [kthxbye bot][] works around that issue.
[customized by route]:https://prometheus.io/docs/alerting/latest/configuration/#route
[documentation on grouping]:https://prometheus.io/docs/alerting/latest/alertmanager/#grouping
[`dispatch/dispatch.go`, line 415, function `newAggrGroup`]:https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L415
[in `dispatch.go`, line 460, function `aggrGroup.run()`]:https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460
[mysterious failure to send notification in a particularly flappy alert]:https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18
## Issues
There is no issue tracker specifically for this project, [File][new-ticket] or
[search][] for issues in the [team issue tracker][search] with the