deal with breakage in Prometheus UI authored by anarcat's avatar anarcat
the classic/targets endpoint is gone, but the API lives on!

reported this in debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1108095
......@@ -77,6 +77,18 @@ dashboards for most purposes other than debugging.
It also shows alerts, but for that, there are better dashboards, see
below.
Note that the "classic" dashboard has been deprecated upstream and,
starting from Debian 13, has been failing at some tasks. We're slowly
replacing it with Grafana and Fabric scripts, see
[tpo/tpa/team#41790](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41790) for progress.
For general queries, in particular, use the
`prometheus.query-to-series` task, for example:
fab prometheus.query-to-series --expression 'up!=1'
... will show jobs that are "down".
### Alerting dashboards
There are a couple of web interfaces to see alerts in our setup:
......@@ -1265,7 +1277,7 @@ to manage the Alertmanager, but in practice the Debian package does
not ship the web interface, so its interest is limited in that
regard. See the `amtool` section below for more information.
Note that the [`/targets`][] URL is also useful to diagnose problems
Note that the [`/api/v1/targets`][] URL is also useful to diagnose problems
with exporters, in general, see also the [troubleshooting section][]
below.
......@@ -1748,11 +1760,83 @@ up in Prometheus either, but it's worth a try, even if only to see the
raw data.
Then, if data truly isn't present in Prometheus, you can track down
the "target" (the exporter) responsible for it in the [`/targets`][]
listing. If the target is "unhealthy", it will be marked in red and an
error message will show up.
the "target" (the exporter) responsible for it in the
[`/api/v1//targets`][] listing. If the target is "unhealthy", it will
be marked as "down" and an error message will show up.
[`/api/v1/targets`]: https://prometheus.torproject.org/api/v1/targets
This will show all down targets with their error messages:
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | {instance: .labels, scrapeUrl, health, lastError}'
If it returns nothing, it means that all targets are empty. Here's an
example of a probe that has not completed yet:
```
root@hetzner-nbg1-01:~# curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | {instance: .labels, scrapeUrl, health, lastError}'
{
"instance": "gitlab-02.torproject.org:9188",
"health": "unknown",
"lastError": ""
}
```
... and, after a while, an error might come up:
```
root@hetzner-nbg1-01:~# curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | {instance: .labels, scrapeUrl, health, lastError}'
{
"instance": {
"alias": "gitlab-02.torproject.org",
"instance": "gitlab-02.torproject.org:9188",
"job": "gitlab",
"team": "TPA"
},
"scrapeUrl": "http://gitlab-02.torproject.org:9188/metrics",
"health": "down",
"lastError": "Get \"http://gitlab-02.torproject.org:9188/metrics\": dial tcp [2620:7:6002:0:266:37ff:feb8:3489]:9188: connect: connection refused"
}
```
In that case, there was a typo in the port number, which was
incorrect. The correct port was 9187 and, when changed, the target was
scraped properly. You can directly verify a given target with this
`jq` incantation:
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.instance == "gitlab-02.torproject.org:9187") | {instance: .labels, health, lastError}'
For example:
```
root@hetzner-nbg1-01:~# curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.instance == "gitlab-02.torproject.org:9187") | {instance: .labels, health, lastError}'
{
"instance": {
"alias": "gitlab-02.torproject.org",
"instance": "gitlab-02.torproject.org:9187",
"job": "gitlab",
"team": "TPA"
},
"health": "up",
"lastError": ""
}
{
"instance": {
"alias": "gitlab-02.torproject.org",
"classes": "role::gitlab",
"instance": "gitlab-02.torproject.org:9187",
"job": "postgres",
"team": "TPA"
},
"health": "up",
"lastError": ""
}
```
[`/targets`]: https://prometheus.torproject.org/targets
Note that the above is an example of a mis-configuration: in this
case, the target was scraped *twice*. Once from Puppet (the `classes`
label is a good hint of that) and the other from the static
configuration. The latter was removed.
If the target is marked healthy, the next step is to scrape the
metrics manually. This, for example, will scrape the Apache exporter
......
......