Loading howto/prometheus.md +142 −2 Original line number Diff line number Diff line Loading @@ -89,6 +89,135 @@ TODO: talk about `scrape_jobs` for in-puppet configurations. TODO: show how to hook a custom scrape job, and on where server to put it. ## Web dashboard usage The main web dashboard for the internal Prometheus server should be accessible at <https://prometheus.torproject.org> using the well-known, public username. The dashboard for the external Prometheus server, however, is not publicly available. To bypass it, use the following commandline to forward ports over SSH: ssh -L 9090:localhost:9090 -L 9091:localhost:9091 -L 9093:localhost:9093 prometheus2.torproject.org The above will also forward the management interfaces of the Alertmanager (port 9093) and Pushgateway (9091). ## Alerting We currently do not do alerting for TPA services with Prometheus. We do, however, have the Alertmanager setup to do alerting for other teams on the secondary Prometheus server (`prometheus2`). This documentation details how that works, but could also eventually cover the main server, if it eventually replaces [Nagios](howto/nagios) for alerting ([ticket 29864][]). In general, the upstream documentation for alerting starts from [the Alerting Overview](https://prometheus.io/docs/alerting/latest/overview/) but I have found it to be lacking at times. I have instead been following [this tutorial](https://ashish.one/blogs/setup-alertmanager/) which was quite helpful. ### Adding alerts The Alertmanager is currently managed through Puppet, in `profile::prometheus::server::external`. An alerting rule is defined like: { 'name' => 'bridgestrap', 'rules' => [ 'alert' => 'Bridges down', 'expr' => 'bridgestrap_fraction_functional < 0.50', 'for' => '5m', 'labels' => { 'severity' => 'critical', 'team' => 'anti-censorship', }, 'annotations' => { 'title' => 'Bridges down', 'description' => 'Too many bridges down', # use humanizePercentage when upgrading to prom > 2.11 'summary' => 'Number of functional bridges is `{{$value}}%`', 'host' => '{{$labels.instance}}', }, ], }, The key part of the alert is the `expr` setting which is a PromQL expression that, when evaluated to "true" for more than `5m` (the `for` settings), will fire an error at the Alertmanager. Also note the `team` label which will route the message to the right team. Those routes are defined later, in the `routes` and `receivers` settings. Note that those might move to separate files and/or Hiera later on. ### Adding alert recipients To add a new recipient for alerts, look for the `receivers` setting and add something like this: receivers => [ { 'name' => 'anti-censorship team', 'email_configs' => [ 'to' => 'anti-censorship-alerts@lists.torproject.org', # see above 'require_tls' => false, ], }, # [...] Then alerts can be routed to that receiver by adding a "route" in the `routes` setting. For example, this will route alerts with the `team: anti-censorship` label: routes => [ { 'receiver' => 'anti-censorship team', 'match' => { 'team' => 'anti-censorship', }, }, ], ### Testing alerts Normally, alerts should fire on the Prometheus server and be sent out to the Alertmanager server, if the latter is correctly configured (ie. if it's configured in `prometheus.yml`, the `alerting` section, see [Installation](#installation) below). If you're not sure alerts are working, head to the web dashboard (see [the access instructions](#web-dashboard-usage)) and look at the `/alerts`, and `/rules` pages. For example, if you're using port forwarding: * <http://localhost:9090/alerts> - should show the configure alerts, and if they are firing * <http://localhost:9090/rules> - should show the configured rules, and whether they match Typically, the <http://localhost:9093> URL should also be useful to manage the Alertmanager, but in practice the Debian package does not ship the web interface, so its interest is limited in that regard. See the `amtool` section below for more information. Note that the `/targets` URL is also useful to diagnose problems with exporters, in general. ### Managing alerts with amtool Since the Alertmanager web UI is not available in Debian, you need to use the [amtool](https://manpages.debian.org/amtool.1) command. A few useful commands: * `amtool alert`: show firing alerts * `amtool silence add --duration=1h --author=anarcat --comment="working on it" ALERTNAME`: silence alert ALERTNAME for an hour, with some comments ## Pager playbook TBD. Loading @@ -101,6 +230,8 @@ dashboard is not available, how to bypass authentication restrictions on said dashboard, talk about the Alertmanager (lack of?) UI, the Pushgateway UI, how to access them, `amtool`, rules debugging... TODO: talk about `/targets`. ## Disaster recovery If a Prometheus/Grafana is destroyed, it should be compltely Loading Loading @@ -257,13 +388,22 @@ changed. The [Alertmanager][] is configured on the external Prometheus server for the metrics and anti-censorship teams to monitor the health of the network. It may eventually also be used to replace or enhance [Nagios](howto/nagios) ([ticket 29864](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864)). [Nagios](howto/nagios) ([ticket 29864][]). It is installed through Puppet, in `profile::prometheus::server::external`, but could be moved to its own profile if it is deployed on more than one server. TODO: document how to add stuff to the Alertmanager. Note that Alertmanager only dispatches alerts, which are actually generated on the Prometheus server side of things. Make sure the following block exists in the `prometheus.yml` file: alerting: alert_relabel_configs: [] alertmanagers: - static_configs: - targets: - localhost:9093 ### Manual node configuration Loading Loading
howto/prometheus.md +142 −2 Original line number Diff line number Diff line Loading @@ -89,6 +89,135 @@ TODO: talk about `scrape_jobs` for in-puppet configurations. TODO: show how to hook a custom scrape job, and on where server to put it. ## Web dashboard usage The main web dashboard for the internal Prometheus server should be accessible at <https://prometheus.torproject.org> using the well-known, public username. The dashboard for the external Prometheus server, however, is not publicly available. To bypass it, use the following commandline to forward ports over SSH: ssh -L 9090:localhost:9090 -L 9091:localhost:9091 -L 9093:localhost:9093 prometheus2.torproject.org The above will also forward the management interfaces of the Alertmanager (port 9093) and Pushgateway (9091). ## Alerting We currently do not do alerting for TPA services with Prometheus. We do, however, have the Alertmanager setup to do alerting for other teams on the secondary Prometheus server (`prometheus2`). This documentation details how that works, but could also eventually cover the main server, if it eventually replaces [Nagios](howto/nagios) for alerting ([ticket 29864][]). In general, the upstream documentation for alerting starts from [the Alerting Overview](https://prometheus.io/docs/alerting/latest/overview/) but I have found it to be lacking at times. I have instead been following [this tutorial](https://ashish.one/blogs/setup-alertmanager/) which was quite helpful. ### Adding alerts The Alertmanager is currently managed through Puppet, in `profile::prometheus::server::external`. An alerting rule is defined like: { 'name' => 'bridgestrap', 'rules' => [ 'alert' => 'Bridges down', 'expr' => 'bridgestrap_fraction_functional < 0.50', 'for' => '5m', 'labels' => { 'severity' => 'critical', 'team' => 'anti-censorship', }, 'annotations' => { 'title' => 'Bridges down', 'description' => 'Too many bridges down', # use humanizePercentage when upgrading to prom > 2.11 'summary' => 'Number of functional bridges is `{{$value}}%`', 'host' => '{{$labels.instance}}', }, ], }, The key part of the alert is the `expr` setting which is a PromQL expression that, when evaluated to "true" for more than `5m` (the `for` settings), will fire an error at the Alertmanager. Also note the `team` label which will route the message to the right team. Those routes are defined later, in the `routes` and `receivers` settings. Note that those might move to separate files and/or Hiera later on. ### Adding alert recipients To add a new recipient for alerts, look for the `receivers` setting and add something like this: receivers => [ { 'name' => 'anti-censorship team', 'email_configs' => [ 'to' => 'anti-censorship-alerts@lists.torproject.org', # see above 'require_tls' => false, ], }, # [...] Then alerts can be routed to that receiver by adding a "route" in the `routes` setting. For example, this will route alerts with the `team: anti-censorship` label: routes => [ { 'receiver' => 'anti-censorship team', 'match' => { 'team' => 'anti-censorship', }, }, ], ### Testing alerts Normally, alerts should fire on the Prometheus server and be sent out to the Alertmanager server, if the latter is correctly configured (ie. if it's configured in `prometheus.yml`, the `alerting` section, see [Installation](#installation) below). If you're not sure alerts are working, head to the web dashboard (see [the access instructions](#web-dashboard-usage)) and look at the `/alerts`, and `/rules` pages. For example, if you're using port forwarding: * <http://localhost:9090/alerts> - should show the configure alerts, and if they are firing * <http://localhost:9090/rules> - should show the configured rules, and whether they match Typically, the <http://localhost:9093> URL should also be useful to manage the Alertmanager, but in practice the Debian package does not ship the web interface, so its interest is limited in that regard. See the `amtool` section below for more information. Note that the `/targets` URL is also useful to diagnose problems with exporters, in general. ### Managing alerts with amtool Since the Alertmanager web UI is not available in Debian, you need to use the [amtool](https://manpages.debian.org/amtool.1) command. A few useful commands: * `amtool alert`: show firing alerts * `amtool silence add --duration=1h --author=anarcat --comment="working on it" ALERTNAME`: silence alert ALERTNAME for an hour, with some comments ## Pager playbook TBD. Loading @@ -101,6 +230,8 @@ dashboard is not available, how to bypass authentication restrictions on said dashboard, talk about the Alertmanager (lack of?) UI, the Pushgateway UI, how to access them, `amtool`, rules debugging... TODO: talk about `/targets`. ## Disaster recovery If a Prometheus/Grafana is destroyed, it should be compltely Loading Loading @@ -257,13 +388,22 @@ changed. The [Alertmanager][] is configured on the external Prometheus server for the metrics and anti-censorship teams to monitor the health of the network. It may eventually also be used to replace or enhance [Nagios](howto/nagios) ([ticket 29864](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864)). [Nagios](howto/nagios) ([ticket 29864][]). It is installed through Puppet, in `profile::prometheus::server::external`, but could be moved to its own profile if it is deployed on more than one server. TODO: document how to add stuff to the Alertmanager. Note that Alertmanager only dispatches alerts, which are actually generated on the Prometheus server side of things. Make sure the following block exists in the `prometheus.yml` file: alerting: alert_relabel_configs: [] alertmanagers: - static_configs: - targets: - localhost:9093 ### Manual node configuration Loading