prom: write a promql primer authored by anarcat's avatar anarcat
......@@ -525,6 +525,87 @@ See [documentation for targets in the repository][] for more details
[documentation for targets in the repository]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/main/targets.d/README.md
## PromQL primer
The [upstream documentation on PromQL][] can be a little daunting, so
we provide you with a few examples from our infrastructure.
[upstream documentation on PromQL]: https://prometheus.io/docs/prometheus/latest/querying/basics/
A query, fundamentally, asks the Prometheus server to query its
database for a given metric. For example, this simple query will
return the status of all exporters, with a value of 0 (down) or 1
(up):
up
You can use labels to select a subset of those, for example this will
only check the [`node_exporter`][]:
up{job="node"}
You can also match the metric against a value, for example this will
list all exporters that are unavailable:
up{job="node"}==0
The `up` metric is not very interesting because it doesn't change
often. It's tremendously useful for availability of course, but
typically we use more complex queries.
This, for example, is the number of accesses on the Apache web server,
according to the [`apache_exporter`][]:
apache_accesses_total
In itself, however, that metric is not that useful because it's a
constantly incrementing counter. What we want is actually the *rate*
of that counter, for which there is of course a function, `rate()`. We
need to apply that to a *vector*, however, a *series* of samples
for the above metric, over a given time period, or a time
series. This, for example, will give us the access rate over 5
minutes:
rate(apache_accesses_total[5m])
That will give us a lot of results though, one per web server. We
might want to regroup those, for example, so we would do something
like:
sum(rate(apache_accesses_total[5m])) by (classes)
Which would show you the access rate by "classes" (which is our
poorly-named "role" label).
Another similar example is this query, which will give us the number
of bytes incoming or outgoing, per second, in the last 5 minutes,
across the infrastructure:
sum(rate(node_network_transmit_bytes_total[5m]))
sum(rate(node_receive_transmit_bytes_total[5m]))
[`apache_exporter`]: https://github.com/Lusitaniae/apache_exporter/
Finally, you should know about the difference between `rate` and
`increase`. The `rate()` is always "per second", and can be a little
hard to read if you're trying to figure our things like "how many hits
did we have in the last month", or "how much data did we actually
transfer yesterday". For that, you need `increase()` which will
actually count the changes in the time period. So for example, to
answer those two questions, this is the the number of hits in the last
month:
sum(increase(apache_accesses_total[30d])) by (classes)
And the data transferred in the last 24h:
sum(increase(node_network_transmit_bytes_total[24h]))
sum(increase(node_receive_transmit_bytes_total[24h]))
For more complex examples of queries, see the [queries cheat sheet][],
the [`prometheus-alerts.git` repository][], and the
[`grafana-dashboards.git` repository][].
## Writing an alert
Now that you have [metrics in your application][] and those are
......@@ -566,6 +647,9 @@ file][]:
In the above, Prometheus will generate an alert if the metric `up` is
not equal to 1 for more than 15 minutes, hence `up < 1`.
See the [PromQL primer][] for more information about queries and the
[queries cheat sheet][] for more examples.
### Duration
The `for` field means the alert is not immediately passed down to the
......@@ -872,20 +956,28 @@ space left, to avoid warning about normal write spikes.
## Queries cheat sheet
Some handy queries I often find myself looking for and forgetting.
This section collects PromQL queries we find interesting.
### Availability
Those are useful, but more complex queries we had to recreate a few
times before writing them down.
Those are almost all visible from the [availability dashboard][].
If you're looking for more basic information about PromQL, see our
[PromQL primer][].
[Currently firing alerts][]:
[PromQL primer]: #promql-primer
ALERTS{alertstate="firing"}
### Availability
Those are almost all visible from the [availability dashboard][].
[Unreachable hosts][] (technically, unavailable node exporters):
up{job="node"} != 1
[Currently firing alerts][]:
ALERTS{alertstate="firing"}
[How much time was the given service (`node` job, in this case) `up` in the past period (`30d`)][]:
avg(avg_over_time(up{job="node"}[30d]))
......@@ -906,19 +998,6 @@ day:
[How many hosts are online at any given point in time]: https://prometheus.torproject.org/graph?g0.expr=sum(count(up%3D=1))/sum(count(up))+by+(alias)
[How long did an alert fire over a given period of time]: https://prometheus.torproject.org/graph?g0.expr=sum_over_time(ALERTS{alertname%3D"MemFullSoon"}[1d:1s])
### Disk usage
This is a less strict version of the [`DiskWillFillSoon` alert][],
see also the [disk usage dashboard][].
[Find disks that will be full in 6 hours][]:
predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0
[Find disks that will be full in 6 hours]: https://prometheus.torproject.org/graph?g0.expr=predict_linear(node_filesystem_avail_bytes[6h],+24*60*60)+<+0
[`DiskWillFillSoon` alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/6a27846edfba9b0fcb8fa3230f0f929ceeeb0fc2/rules.d/tpa_node.rules#L15-23
[disk usage dashboard]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage
### Inventory
Those are visible in the [main Grafana dashboard][].
......@@ -958,6 +1037,19 @@ See also the [CPU][], [memory][], and [disk][] dashboards.
[memory]: https://grafana.torproject.org/d/amgrk2Qnk/memory-usage
[disk]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?from=now-24h&to=now&var-class=All&var-node=All
### Disk usage
This is a less strict version of the [`DiskWillFillSoon` alert][],
see also the [disk usage dashboard][].
[Find disks that will be full in 6 hours][]:
predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0
[Find disks that will be full in 6 hours]: https://prometheus.torproject.org/graph?g0.expr=predict_linear(node_filesystem_avail_bytes[6h],+24*60*60)+<+0
[`DiskWillFillSoon` alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/6a27846edfba9b0fcb8fa3230f0f929ceeeb0fc2/rules.d/tpa_node.rules#L15-23
[disk usage dashboard]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage
### Running commands on hosts matching a PromQL query
Say you have an alert or situation (e.g. high load) affecting multiple
......
......