Changes

anarcat · 7cbaa745
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -363,6 +363,110 @@ Those rules are declared on the server, in `prometheus::prometheus::server::inte
 [tpo/tpa/gitlab#20]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
+## Writing an alerting rule
+TODO
+## Writing a playbook
+Every alert in Prometheus *must* have a playbook annotation. This is
+(if done well), a URL pointing at a service page like this one,
+typically in the `Pager playbook` section, that explains how to deal
+with the alert.
+The playbook *must* include those things:
+ 1. the actual code name of the alert (e.g. `JobDown` or
+    `DiskWillFillSoon`)
+ 2. an example of the alert output (e.g. `Exporter job gitlab_runner
+    on tb-build-02.torproject.org:9252 is down`)
+ 3. why this alert triggered, what is its impact
+ 4. optionally, how to reproduce the issue
+ 5. how to fix it
+How to reproduce the issue is optional, but important. Think of
+yourself in the future, tired and panicking because things are
+broken:
+ - Where do you think the error will be visible?
+ - Can we `curl` something to see it happening?
+ - Is there a dashboard where you can see trends?
+ - Is there a specific Prometheus query to run live?
+ - Which log file can we inspect?
+ - Which systemd service is running it?
+The "how to fix it" can be a simple one line, or it can go into a
+multiple case example of scenarios that were found in the wild. It's
+the hard part: sometimes, when you make an alert, you don't actually
+*know* how to handle the situation. If so, explicitly state that
+problem in the playbook, and say you're sorry, and that it should be
+fixed.
+If the playbook becomes too complicated, consider making a [Fabric][]
+script out of it.
+A good example of a proper playbook is the [Textfile collector errors
+playbook here][]. It has all of the above points, including actual
+fixes for different actual scenarios.
+Here's a template to get started:
+```
+### Foo errors
+The `FooAlert` looks like this:
+    Service Foo has too many errors on test.torproject.org
+It means that the service Foo is having some kind of trouble. [Explain
+why this happened, and what the impact is, what means for which
+users. Are we losing money, data, exposing users, etc.]
+[Optional] You can tell this is a real issue by going to place X and
+trying Y.
+[Ideal] To fix this issue, [inverse the polarity of the shift inverter
+in service Foo].
+[Optional] We do not yet exactly know how to fix issue, sorry. Please
+document here how you fix this next time.
+```
+[Fabric]: howto/fabric
+[Textfile collector errors playbook here]: #textfile-collector-errors
+## Adding alerting rules to Prometheus
+Adding alerts is mainly an alerting rule definition that matches on a
+PromQL expression, defined in a Git repository.
+But it already assumes some metrics are available and scraped by
+Prometheus. For this, ensure you have followed the tutorials [Adding
+metrics to applications][] and [Adding scrape targets][].
+[Adding scrape targets]: #adding-scrape-targets
+The Prometheus servers regularly pull the [`prometheus-alerts.git`
+repository][] for alerting rule and target definitions. Alert rules
+can be added through the repository by adding a file in the `rules.d`
+directory, see [`rules.d`][] directory for more documentation on that.
+[`rules.d`]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d
+After being merged, the changes should propagate within [4 to 6
+hours][]. Prometheus does _not_ automatically reload those rules by
+itself, but Puppet should handle reloading the service as a
+consequence of the file changes. TPA members can accelerate this by
+running Puppet on the Prometheus servers, or pulling the code and
+reloading the Prometheus server with:
+    git -C /etc/prometheus-alerts/ pull
+    systemctl reload prometheus
 # How-to
 ## Queries cheat sheet
@@ -472,7 +576,7 @@ the Prometheus server.
 [Prometheus JSON API]: https://prometheus.io/docs/prometheus/latest/querying/api/
 [Cumin]: howto/cumin
-## Alerting
+## Alert debugging
 We are now using Prometheus for alerting for TPA services. Here's a basic
 overview of how things interact around alerting:
@@ -514,110 +618,6 @@ TPA-RFC-33 proposal][].
 [alerting system]: https://grafana.torproject.org/alerting/
 [Grafana for alerting section of the TPA-RFC-33 proposal]: policy/tpa-rfc-33-monitoring#grafana-for-alerting
-### Writing alerting rules
-TODO
-### Writing a playbook
-Every alert in Prometheus *must* have a playbook annotation. This is
-(if done well), a URL pointing at a service page like this one,
-typically in the `Pager playbook` section, that explains how to deal
-with the alert.
-The playbook *must* include those things:
- 1. the actual code name of the alert (e.g. `JobDown` or
-    `DiskWillFillSoon`)
- 2. an example of the alert output (e.g. `Exporter job gitlab_runner
-    on tb-build-02.torproject.org:9252 is down`)
- 3. why this alert triggered, what is its impact
- 4. optionally, how to reproduce the issue
- 5. how to fix it
-How to reproduce the issue is optional, but important. Think of
-yourself in the future, tired and panicking because things are
-broken:
- - Where do you think the error will be visible?
- - Can we `curl` something to see it happening?
- - Is there a dashboard where you can see trends?
- - Is there a specific Prometheus query to run live?
- - Which log file can we inspect?
- - Which systemd service is running it?
-The "how to fix it" can be a simple one line, or it can go into a
-multiple case example of scenarios that were found in the wild. It's
-the hard part: sometimes, when you make an alert, you don't actually
-*know* how to handle the situation. If so, explicitly state that
-problem in the playbook, and say you're sorry, and that it should be
-fixed.
-If the playbook becomes too complicated, consider making a [Fabric][]
-script out of it.
-A good example of a proper playbook is the [Textfile collector errors
-playbook here][]. It has all of the above points, including actual
-fixes for different actual scenarios.
-Here's a template to get started:
-```
-### Foo errors
-The `FooAlert` looks like this:
-    Service Foo has too many errors on test.torproject.org
-It means that the service Foo is having some kind of trouble. [Explain
-why this happened, and what the impact is, what means for which
-users. Are we losing money, data, exposing users, etc.]
-[Optional] You can tell this is a real issue by going to place X and
-trying Y.
-[Ideal] To fix this issue, [inverse the polarity of the shift inverter
-in service Foo].
-[Optional] We do not yet exactly know how to fix issue, sorry. Please
-document here how you fix this next time.
-```
-[Fabric]: howto/fabric
-[Textfile collector errors playbook here]: #textfile-collector-errors
-### Adding alerting rules
-Adding alerts is mainly an alerting rule definition that matches on a
-PromQL expression, defined in a Git repository.
-But it already assumes some metrics are available and scraped by
-Prometheus. For this, ensure you have followed the tutorials [Adding
-metrics to applications][] and [Adding scrape targets][].
-[Adding scrape targets]: #adding-scrape-targets
-The Prometheus servers regularly pull the [`prometheus-alerts.git`
-repository][] for alerting rule and target definitions. Alert rules
-can be added through the repository by adding a file in the `rules.d`
-directory, see [`rules.d`][] directory for more documentation on that.
-[`rules.d`]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d
-After being merged, the changes should propagate within [4 to 6
-hours][]. Prometheus does _not_ automatically reload those rules by
-itself, but Puppet should handle reloading the service as a
-consequence of the file changes. TPA members can accelerate this by
-running Puppet on the Prometheus servers, or pulling the code and
-reloading the Prometheus server with:
-    git -C /etc/prometheus-alerts/ pull
-    systemctl reload prometheus
 ### Diagnosing alerting failures
 Normally, alerts should fire on the Prometheus server and be sent out