Verified Commit 2a9d68ce authored by anarcat's avatar anarcat
Browse files

document today's unexpected alert related to host retirement (team#41838)

parent 70806410
Loading
Loading
Loading
Loading
+58 −2
Original line number Diff line number Diff line
@@ -1195,6 +1195,8 @@ One of the following is happening, in decreasing likeliness:
    impossible to run the catalog
 2. the node is down and has failed to report since the last time
    specified
 3. the node was retired but the monitoring or puppet server doesn't
    know
 3. the Puppet **server** is down and **all** nodes will fail to
    report in the same way (in which case a lot more warnings will
    show up, and other warnings about the server will come in)
@@ -1210,6 +1212,13 @@ extended duration. Normally, the node will recover when it goes back
online. If a node is to be permanently retired, it should be removed
from Puppet, using the [host retirement procedures](howto/retire-a-host).

The third situation should not normally occur: when a host is retired
following the [retirement procedure](howto/retire-a-host), it's also retired from
Puppet. That should normally clean up everything, but reports
generated by the [Puppet reporter][] do actually stick around for 7
extra days. There's now a silence in the retirement procedure to hide
those alerts, but they will still be generated on host retirements.

Finally, if the main Puppet **server** is down, it should definitely
be brought back up. See disaster recovery, below.

@@ -1218,8 +1227,55 @@ more information:

    ssh NODE puppet agent -t

TODO: document the [Puppet reporter](https://github.com/voxpupuli/puppet-prometheus_reporter) after deployment, see
[#41639](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639).
The Puppet metrics are generated by the [Puppet reporter][], which is
a plugin deployed on the Puppet server (currently `pauli`) which
accepts reports from nodes and writes metrics in the node exporter's
"`textfile` collector" directory
(`/var/lib/prometheus/node-exporter/`). You can, for example, see the
metrics for the host `idle-fsn-01` like this:

```
root@pauli:~# cat /var/lib/prometheus/node-exporter/idle-fsn-01.torproject.org.prom 
# HELP puppet_report Unix timestamp of the last puppet run
# TYPE puppet_report gauge
# HELP puppet_transaction_completed transaction completed status of the last puppet run
# TYPE puppet_transaction_completed gauge
# HELP puppet_cache_catalog_status whether a cached catalog was used in the run, and if so, the reason that it was used
# TYPE puppet_cache_catalog_status gauge
# HELP puppet_status the status of the client run
# TYPE puppet_status gauge
# Old metrics
# New metrics
puppet_report{environment="production",host="idle-fsn-01.torproject.org"} 1731076367.657
puppet_transaction_completed{environment="production",host="idle-fsn-01.torproject.org"} 1
puppet_cache_catalog_status{state="not_used",environment="production",host="idle-fsn-01.torproject.org"} 1
puppet_cache_catalog_status{state="explicitly_requested",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_cache_catalog_status{state="on_failure",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_status{state="failed",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_status{state="changed",environment="production",host="idle-fsn-01.torproject.org"} 0
puppet_status{state="unchanged",environment="production",host="idle-fsn-01.torproject.org"} 1
```

If something is off between reality and what the monitoring system
thinks, this file should be inspected for validity, and its timestamp
checked. Normally, those files should be updated every time the node
runs a catalog, for example. 

Expired nodes should disappear from that directory after 7 days,
defined in `/etc/puppet/prometheus.yaml`. The reporter is hooked in
the Puppet server through the `/etc/puppet/puppet.conf` file, with the
following line:

```
[master]
# ...
reports = puppetdb,prometheus
```

See also issue [#41639](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639) for notes on the deployment of that
monitoring tool.

 [Puppet reporter]: https://github.com/voxpupuli/puppet-prometheus_reporter

Note that this used to be monitored through Icinga before its
retirement, and, until it's fully retired, you might also see this