handle prometheus "classic" dashbaord removal in trixie (or later?)
Poking around the alertmanager UI (#41733 (closed)), I found out that prometheus itself has a similar "modern" UI bundled with it, and stripped out of the Debian package. I have tried it on my home server and it didn't quite work, but visually, it's essentially the same thing as the classic UI:
The problem, however, is that upstream seems to have removed the "classic" dashboard entirely. For now, it looks like Debian still supports it, but a week ago, a change was introduced in the unstable debian package warning that the classic UI could be removed from Debian as well eventually. I've filed an issue in the Debian package to make sure that, if that happens, we actually do get the react app packaged, which is not that far fetched because react now is in Debian, amazingly, and has been since bullseye! https://packages.debian.org/sid/node-react
In any case, it seems pretty important to make sure we deal with that deprecation. We do use parts of the prometheus dashboards, particularly for debugging alerts and we have tons of places pointing at stuff like https://prometheus.torproject.org/classic/graph?g0.range_input=1h&g0.expr=sum(needrestart_processes_with_outdated_libraries)+by+(alias)+%3E+0&g0.tab=1 (which we might want to stop doing as well).
Here's a list of features we currently use in the classic web UI, and possible replacements we have, keep updating this as we find stuff:
-
the alerts listing, in particular: -
firing alerts (karma is okay for this) -
pending alerts (grafana is somewhat okay for this, in the availability dashboard) -
inactive alerts (no replacement, shows the currently parsed alerts and links to their expressions)
-
-
targets list, especially useful to diagnose scraping failures (the https://prometheus.torproject.org/api/v1/targets provides a JSON view, this script turns it into a table, wiki updated to show how to parse the JSON) -
ad-hoc queries (grafana explore is okay, but not accessible to all users, we have good tooling in fabric for that now (prometheus.query*) ) -
query links from Karma (would need to replace with grafana explore), in general we link to prometheus.tpo all over the place and would need redirections -
service discovery information (rarely used, shows which labels are attached to which scrape targets) -
rules listing, which includes timing for alerting rules evaluation, but also recording rules -
configuration dump, command-line flags (mostly unused) -
run-time and build information (unused, but i just noticed it shows important stats like high cardinality labels and so on, cardinality alternatives documented in https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#running-out-of-disk-space
In general, the solution probably will be to use the web interface less and replace it by scripts and tools that talk with the API directly.