Loading service/prometheus.md +36 −19 Original line number Diff line number Diff line Loading @@ -15,20 +15,20 @@ layer on top (see [Grafana][]). ## Training course plan - Where can I find documentation? In the wiki, in [Prometheus](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus) and [Grafana](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/grafana) - Where can I find documentation? In the wiki, in [Prometheus service page][] (this page) but also the [Grafana service page][] - Where do I reach the different web sites for the monitoring service? See the [web dashboards section](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#web-dashboards) See the [web dashboards section][] - Where do i watch for alerts? Join the `#tor-alerts` IRC channel! See also [how to access alerting history](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#checking-alert-history) also [how to access alerting history][] - How can we use silences to prevent some alerts from firing? See [Silencing an alert in advance](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#silencing-an-alert-in-advance) and following - [Architecture overview](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#design) - [Alerting philosophy](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alerting-philosophy) - [Adding metrics](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#adding-metrics-to-applications) - [How to add alerts](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#writing-an-alert) - [Queries cheat sheet](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#queries-cheat-sheet) - [Alert debugging](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alert-debugging): [Silencing an alert in advance][] and following - [Architecture overview][] - [Alerting philosophy][] - [Adding metrics][] - [How to add alerts][] - [Queries cheat sheet][] - [Alert debugging][]: - Alert unit tests - Alert routing tests - Ensuring the tags required for routing are there Loading @@ -38,6 +38,18 @@ layer on top (see [Grafana][]). - %"TPA-RFC-33-B: Prometheus server merge, more exporters" - %"TPA-RFC-33-C: Prometheus high availability, long term metrics, other exporters" [Alert debugging]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alert-debugging [Queries cheat sheet]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#queries-cheat-sheet [How to add alerts]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#writing-an-alert [Adding metrics]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#adding-metrics-to-applications [Alerting philosophy]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alerting-philosophy [Architecture overview]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#design [Silencing an alert in advance]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#silencing-an-alert-in-advance [how to access alerting history]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#checking-alert-history [web dashboards section]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#web-dashboards [Grafana service page]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/grafana [Prometheus service page]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus ## Web dashboards The main Prometheus web interface is available at: Loading Loading @@ -400,9 +412,11 @@ blackbox exporter to the target at the moment the Prometheus server is scraping the exporter. The blackbox exporter is rather peculiar and counter-intuitive, see the [how to debug the blackbox exporter](#debugging-blackbox-exporter) for the [how to debug the blackbox exporter][] for more information. [how to debug the blackbox exporter]: #debugging-blackbox-exporter #### Scrape jobs In Prometheus's point of view, two information are needed: Loading Loading @@ -501,9 +515,9 @@ Prometheus targets, except that they define what the blackbox exporter will try to reach. The targets can be `hostname:port` pairs or URLs, depending on the nature of the type of check being defined. See [documentation for targets in the repository](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/main/targets.d/README.md) for more details See [documentation for targets in the repository][] for more details [documentation for targets in the repository]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/main/targets.d/README.md ## Writing an alert Loading @@ -527,7 +541,9 @@ Prometheus query that should evaluate to "true" (non-zero) for the alert to fire. Here is, for example, the first alert in the [`rules.d/tpa_node.rules` file](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/21d67a21ce9926b2eeef0e14b04bb317fb5c94c0/rules.d/tpa_node.rules): file][]: [`rules.d/tpa_node.rules` file]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/21d67a21ce9926b2eeef0e14b04bb317fb5c94c0/rules.d/tpa_node.rules ``` - alert: JobDown Loading Loading @@ -672,7 +688,7 @@ built-in functions][]. [Prometheus template reference]: https://prometheus.io/docs/prometheus/latest/configuration/template_reference/ [Alertmanager template reference]: https://prometheus.io/docs/alerting/latest/notifications/ [limited set of built-in functions]: https://pkg.go.dev/text/template#hdr-Functions [Limited set of built-in functions]: https://pkg.go.dev/text/template#hdr-Functions [Golang templates]: https://pkg.go.dev/text/template ### Writing a playbook Loading Loading @@ -840,7 +856,6 @@ space left, to avoid warning about normal write spikes. [metrics in your application]: #adding-metrics-to-applications [scraped by Prometheus]: #adding-scrape-targets [Alerting philosophy]: #alerting-philosophy [alerting rule]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ [recording rules documentation]: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules [aggregation operators]: https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators Loading Loading @@ -1024,9 +1039,11 @@ below. If you can't access the dashboard at all or if the above seems too complicated, [Grafana][] can be used as a debugging tool for metrics as well. In the [Explore](https://grafana.torproject.org/explore) section, you can input Prometheus as well. In the [Explore][] section, you can input Prometheus metrics, with auto-completion, and inspect the output directly. [Explore]: https://grafana.torproject.org/explore There's also the [Grafana availability dashboard][], see the [Alerting dashboards][] section for details. Loading Loading
service/prometheus.md +36 −19 Original line number Diff line number Diff line Loading @@ -15,20 +15,20 @@ layer on top (see [Grafana][]). ## Training course plan - Where can I find documentation? In the wiki, in [Prometheus](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus) and [Grafana](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/grafana) - Where can I find documentation? In the wiki, in [Prometheus service page][] (this page) but also the [Grafana service page][] - Where do I reach the different web sites for the monitoring service? See the [web dashboards section](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#web-dashboards) See the [web dashboards section][] - Where do i watch for alerts? Join the `#tor-alerts` IRC channel! See also [how to access alerting history](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#checking-alert-history) also [how to access alerting history][] - How can we use silences to prevent some alerts from firing? See [Silencing an alert in advance](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#silencing-an-alert-in-advance) and following - [Architecture overview](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#design) - [Alerting philosophy](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alerting-philosophy) - [Adding metrics](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#adding-metrics-to-applications) - [How to add alerts](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#writing-an-alert) - [Queries cheat sheet](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#queries-cheat-sheet) - [Alert debugging](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alert-debugging): [Silencing an alert in advance][] and following - [Architecture overview][] - [Alerting philosophy][] - [Adding metrics][] - [How to add alerts][] - [Queries cheat sheet][] - [Alert debugging][]: - Alert unit tests - Alert routing tests - Ensuring the tags required for routing are there Loading @@ -38,6 +38,18 @@ layer on top (see [Grafana][]). - %"TPA-RFC-33-B: Prometheus server merge, more exporters" - %"TPA-RFC-33-C: Prometheus high availability, long term metrics, other exporters" [Alert debugging]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alert-debugging [Queries cheat sheet]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#queries-cheat-sheet [How to add alerts]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#writing-an-alert [Adding metrics]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#adding-metrics-to-applications [Alerting philosophy]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alerting-philosophy [Architecture overview]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#design [Silencing an alert in advance]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#silencing-an-alert-in-advance [how to access alerting history]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#checking-alert-history [web dashboards section]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#web-dashboards [Grafana service page]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/grafana [Prometheus service page]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus ## Web dashboards The main Prometheus web interface is available at: Loading Loading @@ -400,9 +412,11 @@ blackbox exporter to the target at the moment the Prometheus server is scraping the exporter. The blackbox exporter is rather peculiar and counter-intuitive, see the [how to debug the blackbox exporter](#debugging-blackbox-exporter) for the [how to debug the blackbox exporter][] for more information. [how to debug the blackbox exporter]: #debugging-blackbox-exporter #### Scrape jobs In Prometheus's point of view, two information are needed: Loading Loading @@ -501,9 +515,9 @@ Prometheus targets, except that they define what the blackbox exporter will try to reach. The targets can be `hostname:port` pairs or URLs, depending on the nature of the type of check being defined. See [documentation for targets in the repository](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/main/targets.d/README.md) for more details See [documentation for targets in the repository][] for more details [documentation for targets in the repository]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/main/targets.d/README.md ## Writing an alert Loading @@ -527,7 +541,9 @@ Prometheus query that should evaluate to "true" (non-zero) for the alert to fire. Here is, for example, the first alert in the [`rules.d/tpa_node.rules` file](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/21d67a21ce9926b2eeef0e14b04bb317fb5c94c0/rules.d/tpa_node.rules): file][]: [`rules.d/tpa_node.rules` file]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/21d67a21ce9926b2eeef0e14b04bb317fb5c94c0/rules.d/tpa_node.rules ``` - alert: JobDown Loading Loading @@ -672,7 +688,7 @@ built-in functions][]. [Prometheus template reference]: https://prometheus.io/docs/prometheus/latest/configuration/template_reference/ [Alertmanager template reference]: https://prometheus.io/docs/alerting/latest/notifications/ [limited set of built-in functions]: https://pkg.go.dev/text/template#hdr-Functions [Limited set of built-in functions]: https://pkg.go.dev/text/template#hdr-Functions [Golang templates]: https://pkg.go.dev/text/template ### Writing a playbook Loading Loading @@ -840,7 +856,6 @@ space left, to avoid warning about normal write spikes. [metrics in your application]: #adding-metrics-to-applications [scraped by Prometheus]: #adding-scrape-targets [Alerting philosophy]: #alerting-philosophy [alerting rule]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ [recording rules documentation]: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules [aggregation operators]: https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators Loading Loading @@ -1024,9 +1039,11 @@ below. If you can't access the dashboard at all or if the above seems too complicated, [Grafana][] can be used as a debugging tool for metrics as well. In the [Explore](https://grafana.torproject.org/explore) section, you can input Prometheus as well. In the [Explore][] section, you can input Prometheus metrics, with auto-completion, and inspect the output directly. [Explore]: https://grafana.torproject.org/explore There's also the [Grafana availability dashboard][], see the [Alerting dashboards][] section for details. Loading