document today's unexpected alert related to host retirement (tpo/tpa/team#41838) (2a9d68ce) · Commits · The Tor Project / TPA / Wiki Replica

howto/puppet.md

+58 −2

Original line number	Diff line number	Diff line
		@@ -1195,6 +1195,8 @@ One of the following is happening, in decreasing likeliness:
		impossible to run the catalog
		2. the node is down and has failed to report since the last time
		specified
		3. the node was retired but the monitoring or puppet server doesn't
		know
		3. the Puppet server is down and all nodes will fail to
		report in the same way (in which case a lot more warnings will
		show up, and other warnings about the server will come in)
		@@ -1210,6 +1212,13 @@ extended duration. Normally, the node will recover when it goes back
		online. If a node is to be permanently retired, it should be removed
		from Puppet, using the [host retirement procedures](howto/retire-a-host).

		The third situation should not normally occur: when a host is retired
		following the [retirement procedure](howto/retire-a-host), it's also retired from
		Puppet. That should normally clean up everything, but reports
		generated by the [Puppet reporter][] do actually stick around for 7
		extra days. There's now a silence in the retirement procedure to hide
		those alerts, but they will still be generated on host retirements.

		Finally, if the main Puppet server is down, it should definitely
		be brought back up. See disaster recovery, below.

		@@ -1218,8 +1227,55 @@ more information:

		ssh NODE puppet agent -t

		TODO: document the [Puppet reporter](https://github.com/voxpupuli/puppet-prometheus_reporter) after deployment, see
		[#41639](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639).
		The Puppet metrics are generated by the [Puppet reporter][], which is
		a plugin deployed on the Puppet server (currently `pauli`) which
		accepts reports from nodes and writes metrics in the node exporter's
		"`textfile` collector" directory
		(`/var/lib/prometheus/node-exporter/`). You can, for example, see the
		metrics for the host `idle-fsn-01` like this:

		```
		root@pauli:~# cat /var/lib/prometheus/node-exporter/idle-fsn-01.torproject.org.prom
		# HELP puppet_report Unix timestamp of the last puppet run
		# TYPE puppet_report gauge
		# HELP puppet_transaction_completed transaction completed status of the last puppet run
		# TYPE puppet_transaction_completed gauge
		# HELP puppet_cache_catalog_status whether a cached catalog was used in the run, and if so, the reason that it was used
		# TYPE puppet_cache_catalog_status gauge
		# HELP puppet_status the status of the client run
		# TYPE puppet_status gauge
		# Old metrics
		# New metrics
		puppet_report{environment="production",host="idle-fsn-01.torproject.org"} 1731076367.657
		puppet_transaction_completed{environment="production",host="idle-fsn-01.torproject.org"} 1
		puppet_cache_catalog_status{state="not_used",environment="production",host="idle-fsn-01.torproject.org"} 1
		puppet_cache_catalog_status{state="explicitly_requested",environment="production",host="idle-fsn-01.torproject.org"} 0
		puppet_cache_catalog_status{state="on_failure",environment="production",host="idle-fsn-01.torproject.org"} 0
		puppet_status{state="failed",environment="production",host="idle-fsn-01.torproject.org"} 0
		puppet_status{state="changed",environment="production",host="idle-fsn-01.torproject.org"} 0
		puppet_status{state="unchanged",environment="production",host="idle-fsn-01.torproject.org"} 1
		```

		If something is off between reality and what the monitoring system
		thinks, this file should be inspected for validity, and its timestamp
		checked. Normally, those files should be updated every time the node
		runs a catalog, for example.

		Expired nodes should disappear from that directory after 7 days,
		defined in `/etc/puppet/prometheus.yaml`. The reporter is hooked in
		the Puppet server through the `/etc/puppet/puppet.conf` file, with the
		following line:

		```
		[master]
		# ...
		reports = puppetdb,prometheus
		```

		See also issue [#41639](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639) for notes on the deployment of that
		monitoring tool.

		[Puppet reporter]: https://github.com/voxpupuli/puppet-prometheus_reporter

		Note that this used to be monitored through Icinga before its
		retirement, and, until it's fully retired, you might also see this