Changes

anarcat · ab6604f2
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -525,6 +525,87 @@ See [documentation for targets in the repository][] for more details

 [documentation for targets in the repository]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/main/targets.d/README.md

+## PromQL primer
+
+The [upstream documentation on PromQL][] can be a little daunting, so
+we provide you with a few examples from our infrastructure.
+
+ [upstream documentation on PromQL]: https://prometheus.io/docs/prometheus/latest/querying/basics/
+
+A query, fundamentally, asks the Prometheus server to query its
+database for a given metric. For example, this simple query will
+return the status of all exporters, with a value of 0 (down) or 1
+(up):
+
+    up
+
+You can use labels to select a subset of those, for example this will
+only check the [`node_exporter`][]:
+
+    up{job="node"}
+
+You can also match the metric against a value, for example this will
+list all exporters that are unavailable:
+
+    up{job="node"}==0
+
+The `up` metric is not very interesting because it doesn't change
+often. It's tremendously useful for availability of course, but
+typically we use more complex queries.
+
+This, for example, is the number of accesses on the Apache web server,
+according to the [`apache_exporter`][]:
+
+    apache_accesses_total
+
+In itself, however, that metric is not that useful because it's a
+constantly incrementing counter. What we want is actually the *rate*
+of that counter, for which there is of course a function, `rate()`. We
+need to apply that to a *vector*, however, a *series* of samples
+for the above metric, over a given time period, or a time
+series. This, for example, will give us the access rate over 5
+minutes:
+
+    rate(apache_accesses_total[5m])
+
+That will give us a lot of results though, one per web server. We
+might want to regroup those, for example, so we would do something
+like:
+
+    sum(rate(apache_accesses_total[5m])) by (classes)
+
+Which would show you the access rate by "classes" (which is our
+poorly-named "role" label).
+
+Another similar example is this query, which will give us the number
+of bytes incoming or outgoing, per second, in the last 5 minutes,
+across the infrastructure:
+
+    sum(rate(node_network_transmit_bytes_total[5m]))
+    sum(rate(node_receive_transmit_bytes_total[5m]))
+
+[`apache_exporter`]: https://github.com/Lusitaniae/apache_exporter/
+
+Finally, you should know about the difference between `rate` and
+`increase`. The `rate()` is always "per second", and can be a little
+hard to read if you're trying to figure our things like "how many hits
+did we have in the last month", or "how much data did we actually
+transfer yesterday". For that, you need `increase()` which will
+actually count the changes in the time period. So for example, to
+answer those two questions, this is the the number of hits in the last
+month:
+
+    sum(increase(apache_accesses_total[30d])) by (classes)
+
+And the data transferred in the last 24h:
+
+    sum(increase(node_network_transmit_bytes_total[24h]))
+    sum(increase(node_receive_transmit_bytes_total[24h]))
+
+For more complex examples of queries, see the [queries cheat sheet][],
+the [`prometheus-alerts.git` repository][], and the
+[`grafana-dashboards.git` repository][].
+
 ## Writing an alert

 Now that you have [metrics in your application][] and those are
@@ -566,6 +647,9 @@ file][]:
 In the above, Prometheus will generate an alert if the metric `up` is
 not equal to 1 for more than 15 minutes, hence `up < 1`.

+See the [PromQL primer][] for more information about queries and the
+[queries cheat sheet][] for more examples.
+
 ### Duration

 The `for` field means the alert is not immediately passed down to the
@@ -872,20 +956,28 @@ space left, to avoid warning about normal write spikes.

 ## Queries cheat sheet

-Some handy queries I often find myself looking for and forgetting.
+This section collects PromQL queries we find interesting. 

-### Availability
+Those are useful, but more complex queries we had to recreate a few
+times before writing them down.

-Those are almost all visible from the [availability dashboard][].
+If you're looking for more basic information about PromQL, see our
+[PromQL primer][].

-[Currently firing alerts][]:
+ [PromQL primer]: #promql-primer

-    ALERTS{alertstate="firing"}
+### Availability
+
+Those are almost all visible from the [availability dashboard][].

 [Unreachable hosts][] (technically, unavailable node exporters):

    up{job="node"} != 1

+[Currently firing alerts][]:
+
+    ALERTS{alertstate="firing"}
+
 [How much time was the given service (`node` job, in this case) `up` in the past period (`30d`)][]:

    avg(avg_over_time(up{job="node"}[30d]))
@@ -906,19 +998,6 @@ day:
 [How many hosts are online at any given point in time]: https://prometheus.torproject.org/graph?g0.expr=sum(count(up%3D=1))/sum(count(up))+by+(alias)
 [How long did an alert fire over a given period of time]: https://prometheus.torproject.org/graph?g0.expr=sum_over_time(ALERTS{alertname%3D"MemFullSoon"}[1d:1s])

-### Disk usage
-
-This is a less strict version of the [`DiskWillFillSoon` alert][],
-see also the [disk usage dashboard][].
-
-[Find disks that will be full in 6 hours][]:
-
-    predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0
-
-[Find disks that will be full in 6 hours]: https://prometheus.torproject.org/graph?g0.expr=predict_linear(node_filesystem_avail_bytes[6h],+24*60*60)+<+0
-[`DiskWillFillSoon` alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/6a27846edfba9b0fcb8fa3230f0f929ceeeb0fc2/rules.d/tpa_node.rules#L15-23
-[disk usage dashboard]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage
-
 ### Inventory

 Those are visible in the [main Grafana dashboard][].
@@ -958,6 +1037,19 @@ See also the [CPU][], [memory][], and [disk][] dashboards.
 [memory]: https://grafana.torproject.org/d/amgrk2Qnk/memory-usage
 [disk]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?from=now-24h&to=now&var-class=All&var-node=All

+### Disk usage
+
+This is a less strict version of the [`DiskWillFillSoon` alert][],
+see also the [disk usage dashboard][].
+
+[Find disks that will be full in 6 hours][]:
+
+    predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0
+
+[Find disks that will be full in 6 hours]: https://prometheus.torproject.org/graph?g0.expr=predict_linear(node_filesystem_avail_bytes[6h],+24*60*60)+<+0
+[`DiskWillFillSoon` alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/6a27846edfba9b0fcb8fa3230f0f929ceeeb0fc2/rules.d/tpa_node.rules#L15-23
+[disk usage dashboard]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage
+
 ### Running commands on hosts matching a PromQL query

 Say you have an alert or situation (e.g. high load) affecting multiple