Verified Commit 4c07f6a4 authored by lelutin's avatar lelutin
Browse files

prometheus: tiny bit of rewording to make it easier to read

The first paragraph says that we are not using prom alerting, and while
it's still technically true that we haven't fully switched to it yet, we
do have alerts for TPA services in prometheus now and we're slowly
moving towards switching to that completely. So we might as well change
that now to say that we do indeed use this for our montiring.

The "Looking for alerts" paragraph gives a better overview of things if
we make the list of URLs that one needs to know about in a list format
with verbosity reduced.
parent cd20b009
Loading
Loading
Loading
Loading
+47 −32
Original line number Original line Diff line number Diff line
@@ -186,43 +186,58 @@ Alertmanager (port 9093) and Pushgateway (9091).


## Alerting
## Alerting


We currently do not do alerting for TPA services with Prometheus. We
We are now using Prometheus for alerting for TPA services. Prometheus is
do, however, have the Alertmanager setup to do alerting for other
configured to create alerts on certain conditions on metrics and then send the
teams on the secondary Prometheus server (`prometheus2`). This
alerts to one or more Alertmanager instance. That one in turn is responsible for
documentation details how that works, but could also eventually cover
routing the alert to the appropriate channels, be they a team's email address or
the main server, if it eventually replaces [Nagios](howto/nagios) for
TPA's irc channels for alerts, `#tor-alerts`.
alerting ([ticket 29864][]).


Currently, the secondary Prometheus server (`prometheus2`) reproduces this setup
In general, the upstream documentation for alerting starts from [the
specifically for sending out alerts to other teams with metrics that are not
Alerting Overview](https://prometheus.io/docs/alerting/latest/overview/) but I have found it to be lacking at times. I
made public.
have instead been following [this tutorial](https://ashish.one/blogs/setup-alertmanager/) which was quite

helpful.
This section details how the alerting setup mentioned above works. 

Note that the [Nagios(icinga)](howto/nagios) service is still in service, but it
is planned to eventually be shut down and replaced by the Prometheus +
Alertmanager setup ([ticket 29864][]).

In general, the upstream documentation for alerting starts from [the Alerting
Overview](https://prometheus.io/docs/alerting/latest/overview/) but it can be
lacking at times. [This tutorial](https://ashish.one/blogs/setup-alertmanager/)
can be quite helpful in better understanding how things are working.

Note that Grafana also has its own [alerting
system](https://grafana.torproject.org/alerting/) but we are _not_ using that,
see the [Grafana for alerting section of the TPA-RFC-33
proposal](policy/tpa-rfc-33-monitoring#grafana-for-alerting).


### Looking at alerts
### Looking at alerts


There are a couple of interfaces to see alerts in our setup. The
There are a couple of web interfaces to see alerts in our setup:
primary one is the [Karma dashboard](https://karma.torproject.org), which shows currently firing

alerts grouped by labels.
* [Karma dashboard](https://karma.torproject.org) - our primary view on

  currently firing alerts. The alerts are grouped by labels.
But it won't show you history: for this, you might want to take a look
  * This web interface only shows what's current, not some form of alert
at the [Grafana availability dashboard](https://grafana.torproject.org/d/adwbl8mxnaneoc/availability) which drills down into
    history.
alerts and, more importantly shows their past values.
  * Shows links to runbooks related to alerts

* [Grafana availability
The ultimate source of truth for alerts and the related alerting
  dashboard](https://grafana.torproject.org/d/adwbl8mxnaneoc/availability) -
rules, however, is Prometheus itself. The [Alerts dashboard](https://prometheus.torproject.org/classic/alerts) show
  drills down into alerts and, more importantly shows their past values.
all alerting rules and which file they are from. Normally, all rules
* [Prometheus' Alerts
are defined in the [prometheus-alerts repository](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts). Another view of
  dashboard](https://prometheus.torproject.org/classic/alerts) - show
this is the [rules configuration dump](https://prometheus.torproject.org/classic/rules) which also shows when the
  all alerting rules and which file they are from
  * also contains links to graphs based on alerts' PromQL expressions

Normally, all rules are defined in the [prometheus-alerts
repository](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts). Another
view of this is the [rules configuration
dump](https://prometheus.torproject.org/classic/rules) which also shows when the
rule was last evaluated and how long it took.
rule was last evaluated and how long it took.


Each alert should have a link to a "runbook", typically a link to this
Each alert should have a URL to a "runbook" in its annotations, typically a link
very wiki, in the "Pager playbook" section, which shows how to handle
to this very wiki, in the "Pager playbook" section, which shows how to handle
any particular outage. If it's not present, it's a bug and can be
any particular outage. If it's not present, it's a bug and can be filed as such.
filed as such.

Note that Grafana also has its own [alerting system](https://grafana.torproject.org/alerting/) but we are not
using that, see the [Grafana for alerting section of the TPA-RFC-33
proposal](policy/tpa-rfc-33-monitoring#grafana-for-alerting).


### Adding alerts in Puppet
### Adding alerts in Puppet