... | ... | @@ -89,7 +89,7 @@ TODO: talk about `scrape_jobs` for in-puppet configurations. |
|
|
TODO: show how to hook a custom scrape job, and on where server to put
|
|
|
it.
|
|
|
|
|
|
## Web dashboard usage
|
|
|
## Web dashboard access
|
|
|
|
|
|
The main web dashboard for the internal Prometheus server should be
|
|
|
accessible at <https://prometheus.torproject.org> using the
|
... | ... | @@ -191,7 +191,7 @@ to the Alertmanager server, if the latter is correctly configured |
|
|
see [Installation](#installation) below).
|
|
|
|
|
|
If you're not sure alerts are working, head to the web dashboard (see
|
|
|
[the access instructions](#web-dashboard-usage)) and look at the
|
|
|
[the access instructions](#web-dashboard-access)) and look at the
|
|
|
`/alerts`, and `/rules` pages. For example, if you're
|
|
|
using port forwarding:
|
|
|
|
... | ... | @@ -205,8 +205,16 @@ manage the Alertmanager, but in practice the Debian package does not |
|
|
ship the web interface, so its interest is limited in that regard. See
|
|
|
the `amtool` section below for more information.
|
|
|
|
|
|
Note that the `/targets` URL is also useful to diagnose problems with
|
|
|
exporters, in general.
|
|
|
Note that the [`/targets`][] URL is also useful to diagnose problems
|
|
|
with exporters, in general, see also the [troubleshooting section](#troubleshooting-missing-metrics)
|
|
|
below.
|
|
|
|
|
|
If you can't access the dashboard at all or if the above seems too
|
|
|
complicated, [Grafana][] can be jury-rigged as a debugging tool for
|
|
|
metrics as well. In the "Explore" panels, you can input Prometheus
|
|
|
metrics, with auto-completion, and inspect the output directly.
|
|
|
|
|
|
[Grafana]: howto/grafana
|
|
|
|
|
|
### Managing alerts with amtool
|
|
|
|
... | ... | @@ -222,15 +230,68 @@ use the [amtool](https://manpages.debian.org/amtool.1) command. A few useful com |
|
|
|
|
|
TBD.
|
|
|
|
|
|
### Troubleshooting
|
|
|
### Troubleshooting missing metrics
|
|
|
|
|
|
If metrics do not correctly show up in Grafana, it might be worth
|
|
|
checking in the [Prometheus dashboard](https://prometheus.torproject.org/) itself for the same
|
|
|
metrics. Typically, if they do not show up in Grafana, they won't show
|
|
|
up in Prometheus either, but it's worth a try, even if only to see the
|
|
|
raw data.
|
|
|
|
|
|
Then, if data truly isn't present in Prometheus, you can track down
|
|
|
the "target" (the exporter) responsible for it in the [`/targets`][]
|
|
|
listing. If the target is "unhealthy", it will be marked in red and an
|
|
|
error message will show up.
|
|
|
|
|
|
[`/targets`]: https://prometheus.torproject.org/targets
|
|
|
|
|
|
If the target is marked healthy, the next step is to scrape the
|
|
|
metrics manually. This, for example, will scrape the Apache exporter
|
|
|
from the host `gayi`:
|
|
|
|
|
|
curl -s http://gayi.torproject.org:9117/metrics | grep apache
|
|
|
|
|
|
In the case of [this bug](https://github.com/voxpupuli/puppet-prometheus/pull/541), the metrics were not showing up at all:
|
|
|
|
|
|
root@hetzner-nbg1-01:~# curl -s http://gayi.torproject.org:9117/metrics | grep apache
|
|
|
# HELP apache_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which apache_exporter was built.
|
|
|
# TYPE apache_exporter_build_info gauge
|
|
|
apache_exporter_build_info{branch="",goversion="go1.7.4",revision="",version=""} 1
|
|
|
# HELP apache_exporter_scrape_failures_total Number of errors while scraping apache.
|
|
|
# TYPE apache_exporter_scrape_failures_total counter
|
|
|
apache_exporter_scrape_failures_total 18371
|
|
|
# HELP apache_up Could the apache server be reached
|
|
|
# TYPE apache_up gauge
|
|
|
apache_up 0
|
|
|
|
|
|
TODO: document how to figure out stuff in Prometheus. Talk about the
|
|
|
"targets" page, how to use Grafana to find queries if the Prometheus
|
|
|
dashboard is not available, how to bypass authentication restrictions
|
|
|
on said dashboard, talk about the Alertmanager (lack of?) UI, the
|
|
|
Pushgateway UI, how to access them, `amtool`, rules debugging...
|
|
|
Notice, however, the `apache_exporter_scrape_failures_total`, which
|
|
|
was incrementing. From there, we reproduced the work the exporter was
|
|
|
doing manually and fixed the issue, which involved passing the correct
|
|
|
argument to the exporter.
|
|
|
|
|
|
TODO: talk about `/targets`.
|
|
|
### Pushgateway errors
|
|
|
|
|
|
The Pushgateway web interface provides some basic information about
|
|
|
the metrics it collects, and allow you to view the pending metrics
|
|
|
before they get scraped by Prometheus, which may be useful to
|
|
|
troubleshoot issues with the gateway.
|
|
|
|
|
|
To pull metrics by hand, you can pull directly from the pushgateway:
|
|
|
|
|
|
curl localhost:9091/metrics
|
|
|
|
|
|
If you get this error while pulling metrics from the exporter:
|
|
|
|
|
|
An error has occurred while serving metrics:
|
|
|
|
|
|
collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values
|
|
|
|
|
|
It's because similar metrics were sent twice into the gateway, which
|
|
|
corrupts the state of the pushgateway, a [known problems](https://github.com/prometheus/pushgateway/issues/232) in
|
|
|
earlier versions and [fixed in 0.10](https://github.com/prometheus/pushgateway/pull/290) (Debian bullseye and later). A
|
|
|
workaround is simply to restart the Pushgateway (and clear the
|
|
|
storage, if persistence is enabled, see the `--persistence.file`
|
|
|
flag).
|
|
|
|
|
|
## Disaster recovery
|
|
|
|
... | ... | @@ -607,12 +668,21 @@ application-specific metrics. |
|
|
|
|
|
## Logs and metrics
|
|
|
|
|
|
<!-- TODO: where are the logs? how long are they kept? any PII? -->
|
|
|
<!-- what about performance metrics? same questions -->
|
|
|
Prometheus servers typically do not generate many logs, except when
|
|
|
errors and warnings occur. They should hold very little PII. The web
|
|
|
frontends collect logs in accordance with our regular policy.
|
|
|
|
|
|
Actual metrics *may* contain PII, although it's quite unlikely:
|
|
|
typically, data is anonymized and aggregated at collection time. It
|
|
|
would still be able to deduce some activity patterns from the metrics
|
|
|
generated by Prometheus, and use it to leverage side-channel attacks,
|
|
|
which is why the external Prometheus server access is restricted.
|
|
|
|
|
|
## Other documentation
|
|
|
|
|
|
<!-- TODO: references to upstream documentation, if relevant -->
|
|
|
* [Prometheus home page](https://prometheus.io/)
|
|
|
* [Prometheus documentation](https://prometheus.io/docs/introduction/overview/)
|
|
|
* [Prometheus developer blog](https://www.robustperception.io/tag/prometheus/)
|
|
|
|
|
|
# Discussion
|
|
|
|
... | ... | |