prometheus: tiny bit of rewording to make it easier to read (4c07f6a4) · Commits · The Tor Project / TPA / Wiki Replica

howto/prometheus.md

+47 −32

Original line number	Original line	Diff line number	Diff line
	@@ -186,43 +186,58 @@ Alertmanager (port 9093) and Pushgateway (9091).

	## Alerting		## Alerting

	We currently do not do alerting for TPA services with Prometheus. We		We are now using Prometheus for alerting for TPA services. Prometheus is
	do, however, have the Alertmanager setup to do alerting for other		configured to create alerts on certain conditions on metrics and then send the
	teams on the secondary Prometheus server (`prometheus2`). This		alerts to one or more Alertmanager instance. That one in turn is responsible for
	documentation details how that works, but could also eventually cover		routing the alert to the appropriate channels, be they a team's email address or
	the main server, if it eventually replaces [Nagios](howto/nagios) for		TPA's irc channels for alerts, `#tor-alerts`.
	alerting ([ticket 29864][]).
			Currently, the secondary Prometheus server (`prometheus2`) reproduces this setup
	In general, the upstream documentation for alerting starts from [the		specifically for sending out alerts to other teams with metrics that are not
	Alerting Overview](https://prometheus.io/docs/alerting/latest/overview/) but I have found it to be lacking at times. I		made public.
	have instead been following [this tutorial](https://ashish.one/blogs/setup-alertmanager/) which was quite
	helpful.		This section details how the alerting setup mentioned above works.

			Note that the [Nagios(icinga)](howto/nagios) service is still in service, but it
			is planned to eventually be shut down and replaced by the Prometheus +
			Alertmanager setup ([ticket 29864][]).

			In general, the upstream documentation for alerting starts from [the Alerting
			Overview](https://prometheus.io/docs/alerting/latest/overview/) but it can be
			lacking at times. [This tutorial](https://ashish.one/blogs/setup-alertmanager/)
			can be quite helpful in better understanding how things are working.

			Note that Grafana also has its own [alerting
			system](https://grafana.torproject.org/alerting/) but we are _not_ using that,
			see the [Grafana for alerting section of the TPA-RFC-33
			proposal](policy/tpa-rfc-33-monitoring#grafana-for-alerting).

	### Looking at alerts		### Looking at alerts

	There are a couple of interfaces to see alerts in our setup. The		There are a couple of web interfaces to see alerts in our setup:
	primary one is the [Karma dashboard](https://karma.torproject.org), which shows currently firing
	alerts grouped by labels.		* [Karma dashboard](https://karma.torproject.org) - our primary view on
			currently firing alerts. The alerts are grouped by labels.
	But it won't show you history: for this, you might want to take a look		* This web interface only shows what's current, not some form of alert
	at the [Grafana availability dashboard](https://grafana.torproject.org/d/adwbl8mxnaneoc/availability) which drills down into		history.
	alerts and, more importantly shows their past values.		* Shows links to runbooks related to alerts
			* [Grafana availability
	The ultimate source of truth for alerts and the related alerting		dashboard](https://grafana.torproject.org/d/adwbl8mxnaneoc/availability) -
	rules, however, is Prometheus itself. The [Alerts dashboard](https://prometheus.torproject.org/classic/alerts) show		drills down into alerts and, more importantly shows their past values.
	all alerting rules and which file they are from. Normally, all rules		* [Prometheus' Alerts
	are defined in the [prometheus-alerts repository](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts). Another view of		dashboard](https://prometheus.torproject.org/classic/alerts) - show
	this is the [rules configuration dump](https://prometheus.torproject.org/classic/rules) which also shows when the		all alerting rules and which file they are from
			* also contains links to graphs based on alerts' PromQL expressions

			Normally, all rules are defined in the [prometheus-alerts
			repository](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts). Another
			view of this is the [rules configuration
			dump](https://prometheus.torproject.org/classic/rules) which also shows when the
	rule was last evaluated and how long it took.		rule was last evaluated and how long it took.

	Each alert should have a link to a "runbook", typically a link to this		Each alert should have a URL to a "runbook" in its annotations, typically a link
	very wiki, in the "Pager playbook" section, which shows how to handle		to this very wiki, in the "Pager playbook" section, which shows how to handle
	any particular outage. If it's not present, it's a bug and can be		any particular outage. If it's not present, it's a bug and can be filed as such.
	filed as such.

	Note that Grafana also has its own [alerting system](https://grafana.torproject.org/alerting/) but we are not
	using that, see the [Grafana for alerting section of the TPA-RFC-33
	proposal](policy/tpa-rfc-33-monitoring#grafana-for-alerting).

	### Adding alerts in Puppet		### Adding alerts in Puppet