anarcat · 6b6e3aa1
--- a/howto/prometheus.md
+++ b/howto/prometheus.md
@@ -89,7 +89,7 @@ TODO: talk about `scrape_jobs` for in-puppet configurations.
 TODO: show how to hook a custom scrape job, and on where server to put
 it.

-## Web dashboard usage
+## Web dashboard access

 The main web dashboard for the internal Prometheus server should be
 accessible at <https://prometheus.torproject.org> using the
@@ -191,7 +191,7 @@ to the Alertmanager server, if the latter is correctly configured
 see [Installation](#installation) below).

 If you're not sure alerts are working, head to the web dashboard (see
-[the access instructions](#web-dashboard-usage)) and look at the
+[the access instructions](#web-dashboard-access)) and look at the
 `/alerts`, and `/rules` pages. For example, if you're
 using port forwarding:

@@ -205,8 +205,16 @@ manage the Alertmanager, but in practice the Debian package does not
 ship the web interface, so its interest is limited in that regard. See
 the `amtool` section below for more information.

-Note that the `/targets` URL is also useful to diagnose problems with
-exporters, in general.
+Note that the [`/targets`][] URL is also useful to diagnose problems
+with exporters, in general, see also the [troubleshooting section](#troubleshooting-missing-metrics)
+below.
+
+If you can't access the dashboard at all or if the above seems too
+complicated, [Grafana][] can be jury-rigged as a debugging tool for
+metrics as well. In the "Explore" panels, you can input Prometheus
+metrics, with auto-completion, and inspect the output directly.
+
+[Grafana]: howto/grafana

 ### Managing alerts with amtool

@@ -222,15 +230,68 @@ use the [amtool](https://manpages.debian.org/amtool.1) command. A few useful com

 TBD.

-### Troubleshooting
+### Troubleshooting missing metrics
+
+If metrics do not correctly show up in Grafana, it might be worth
+checking in the [Prometheus dashboard](https://prometheus.torproject.org/) itself for the same
+metrics. Typically, if they do not show up in Grafana, they won't show
+up in Prometheus either, but it's worth a try, even if only to see the
+raw data.
+
+Then, if data truly isn't present in Prometheus, you can track down
+the "target" (the exporter) responsible for it in the [`/targets`][]
+listing. If the target is "unhealthy", it will be marked in red and an
+error message will show up.
+
+[`/targets`]: https://prometheus.torproject.org/targets
+
+If the target is marked healthy, the next step is to scrape the
+metrics manually. This, for example, will scrape the Apache exporter
+from the host `gayi`:
+
+    curl -s http://gayi.torproject.org:9117/metrics | grep apache
+
+In the case of [this bug](https://github.com/voxpupuli/puppet-prometheus/pull/541), the metrics were not showing up at all:
+
+    root@hetzner-nbg1-01:~# curl -s http://gayi.torproject.org:9117/metrics | grep apache
+    # HELP apache_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which apache_exporter was built.
+    # TYPE apache_exporter_build_info gauge
+    apache_exporter_build_info{branch="",goversion="go1.7.4",revision="",version=""} 1
+    # HELP apache_exporter_scrape_failures_total Number of errors while scraping apache.
+    # TYPE apache_exporter_scrape_failures_total counter
+    apache_exporter_scrape_failures_total 18371
+    # HELP apache_up Could the apache server be reached
+    # TYPE apache_up gauge
+    apache_up 0

-TODO: document how to figure out stuff in Prometheus. Talk about the
-"targets" page, how to use Grafana to find queries if the Prometheus
-dashboard is not available, how to bypass authentication restrictions
-on said dashboard, talk about the Alertmanager (lack of?) UI, the
-Pushgateway UI, how to access them, `amtool`, rules debugging...
+Notice, however, the `apache_exporter_scrape_failures_total`, which
+was incrementing. From there, we reproduced the work the exporter was
+doing manually and fixed the issue, which involved passing the correct
+argument to the exporter.

-TODO: talk about `/targets`.
+### Pushgateway errors
+
+The Pushgateway web interface provides some basic information about
+the metrics it collects, and allow you to view the pending metrics
+before they get scraped by Prometheus, which may be useful to
+troubleshoot issues with the gateway.
+
+To pull metrics by hand, you can pull directly from the pushgateway:
+
+    curl localhost:9091/metrics
+
+If you get this error while pulling metrics from the exporter:
+
+    An error has occurred while serving metrics:
+
+    collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values
+
+It's because similar metrics were sent twice into the gateway, which
+corrupts the state of the pushgateway, a [known problems](https://github.com/prometheus/pushgateway/issues/232) in
+earlier versions and [fixed in 0.10](https://github.com/prometheus/pushgateway/pull/290) (Debian bullseye and later). A
+workaround is simply to restart the Pushgateway (and clear the
+storage, if persistence is enabled, see the `--persistence.file`
+flag).

 ## Disaster recovery

@@ -607,12 +668,21 @@ application-specific metrics.

 ## Logs and metrics

-<!-- TODO: where are the logs? how long are they kept? any PII? -->
-<!-- what about performance metrics? same questions -->
+Prometheus servers typically do not generate many logs, except when
+errors and warnings occur. They should hold very little PII. The web
+frontends collect logs in accordance with our regular policy.
+
+Actual metrics *may* contain PII, although it's quite unlikely:
+typically, data is anonymized and aggregated at collection time. It
+would still be able to deduce some activity patterns from the metrics
+generated by Prometheus, and use it to leverage side-channel attacks,
+which is why the external Prometheus server access is restricted.

 ## Other documentation

-<!-- TODO: references to upstream documentation, if relevant -->
+ * [Prometheus home page](https://prometheus.io/)
+ * [Prometheus documentation](https://prometheus.io/docs/introduction/overview/)
+ * [Prometheus developer blog](https://www.robustperception.io/tag/prometheus/)

 # Discussion