Loading howto/puppet.md +58 −2 Original line number Diff line number Diff line Loading @@ -1195,6 +1195,8 @@ One of the following is happening, in decreasing likeliness: impossible to run the catalog 2. the node is down and has failed to report since the last time specified 3. the node was retired but the monitoring or puppet server doesn't know 3. the Puppet **server** is down and **all** nodes will fail to report in the same way (in which case a lot more warnings will show up, and other warnings about the server will come in) Loading @@ -1210,6 +1212,13 @@ extended duration. Normally, the node will recover when it goes back online. If a node is to be permanently retired, it should be removed from Puppet, using the [host retirement procedures](howto/retire-a-host). The third situation should not normally occur: when a host is retired following the [retirement procedure](howto/retire-a-host), it's also retired from Puppet. That should normally clean up everything, but reports generated by the [Puppet reporter][] do actually stick around for 7 extra days. There's now a silence in the retirement procedure to hide those alerts, but they will still be generated on host retirements. Finally, if the main Puppet **server** is down, it should definitely be brought back up. See disaster recovery, below. Loading @@ -1218,8 +1227,55 @@ more information: ssh NODE puppet agent -t TODO: document the [Puppet reporter](https://github.com/voxpupuli/puppet-prometheus_reporter) after deployment, see [#41639](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639). The Puppet metrics are generated by the [Puppet reporter][], which is a plugin deployed on the Puppet server (currently `pauli`) which accepts reports from nodes and writes metrics in the node exporter's "`textfile` collector" directory (`/var/lib/prometheus/node-exporter/`). You can, for example, see the metrics for the host `idle-fsn-01` like this: ``` root@pauli:~# cat /var/lib/prometheus/node-exporter/idle-fsn-01.torproject.org.prom # HELP puppet_report Unix timestamp of the last puppet run # TYPE puppet_report gauge # HELP puppet_transaction_completed transaction completed status of the last puppet run # TYPE puppet_transaction_completed gauge # HELP puppet_cache_catalog_status whether a cached catalog was used in the run, and if so, the reason that it was used # TYPE puppet_cache_catalog_status gauge # HELP puppet_status the status of the client run # TYPE puppet_status gauge # Old metrics # New metrics puppet_report{environment="production",host="idle-fsn-01.torproject.org"} 1731076367.657 puppet_transaction_completed{environment="production",host="idle-fsn-01.torproject.org"} 1 puppet_cache_catalog_status{state="not_used",environment="production",host="idle-fsn-01.torproject.org"} 1 puppet_cache_catalog_status{state="explicitly_requested",environment="production",host="idle-fsn-01.torproject.org"} 0 puppet_cache_catalog_status{state="on_failure",environment="production",host="idle-fsn-01.torproject.org"} 0 puppet_status{state="failed",environment="production",host="idle-fsn-01.torproject.org"} 0 puppet_status{state="changed",environment="production",host="idle-fsn-01.torproject.org"} 0 puppet_status{state="unchanged",environment="production",host="idle-fsn-01.torproject.org"} 1 ``` If something is off between reality and what the monitoring system thinks, this file should be inspected for validity, and its timestamp checked. Normally, those files should be updated every time the node runs a catalog, for example. Expired nodes should disappear from that directory after 7 days, defined in `/etc/puppet/prometheus.yaml`. The reporter is hooked in the Puppet server through the `/etc/puppet/puppet.conf` file, with the following line: ``` [master] # ... reports = puppetdb,prometheus ``` See also issue [#41639](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639) for notes on the deployment of that monitoring tool. [Puppet reporter]: https://github.com/voxpupuli/puppet-prometheus_reporter Note that this used to be monitored through Icinga before its retirement, and, until it's fully retired, you might also see this Loading Loading
howto/puppet.md +58 −2 Original line number Diff line number Diff line Loading @@ -1195,6 +1195,8 @@ One of the following is happening, in decreasing likeliness: impossible to run the catalog 2. the node is down and has failed to report since the last time specified 3. the node was retired but the monitoring or puppet server doesn't know 3. the Puppet **server** is down and **all** nodes will fail to report in the same way (in which case a lot more warnings will show up, and other warnings about the server will come in) Loading @@ -1210,6 +1212,13 @@ extended duration. Normally, the node will recover when it goes back online. If a node is to be permanently retired, it should be removed from Puppet, using the [host retirement procedures](howto/retire-a-host). The third situation should not normally occur: when a host is retired following the [retirement procedure](howto/retire-a-host), it's also retired from Puppet. That should normally clean up everything, but reports generated by the [Puppet reporter][] do actually stick around for 7 extra days. There's now a silence in the retirement procedure to hide those alerts, but they will still be generated on host retirements. Finally, if the main Puppet **server** is down, it should definitely be brought back up. See disaster recovery, below. Loading @@ -1218,8 +1227,55 @@ more information: ssh NODE puppet agent -t TODO: document the [Puppet reporter](https://github.com/voxpupuli/puppet-prometheus_reporter) after deployment, see [#41639](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639). The Puppet metrics are generated by the [Puppet reporter][], which is a plugin deployed on the Puppet server (currently `pauli`) which accepts reports from nodes and writes metrics in the node exporter's "`textfile` collector" directory (`/var/lib/prometheus/node-exporter/`). You can, for example, see the metrics for the host `idle-fsn-01` like this: ``` root@pauli:~# cat /var/lib/prometheus/node-exporter/idle-fsn-01.torproject.org.prom # HELP puppet_report Unix timestamp of the last puppet run # TYPE puppet_report gauge # HELP puppet_transaction_completed transaction completed status of the last puppet run # TYPE puppet_transaction_completed gauge # HELP puppet_cache_catalog_status whether a cached catalog was used in the run, and if so, the reason that it was used # TYPE puppet_cache_catalog_status gauge # HELP puppet_status the status of the client run # TYPE puppet_status gauge # Old metrics # New metrics puppet_report{environment="production",host="idle-fsn-01.torproject.org"} 1731076367.657 puppet_transaction_completed{environment="production",host="idle-fsn-01.torproject.org"} 1 puppet_cache_catalog_status{state="not_used",environment="production",host="idle-fsn-01.torproject.org"} 1 puppet_cache_catalog_status{state="explicitly_requested",environment="production",host="idle-fsn-01.torproject.org"} 0 puppet_cache_catalog_status{state="on_failure",environment="production",host="idle-fsn-01.torproject.org"} 0 puppet_status{state="failed",environment="production",host="idle-fsn-01.torproject.org"} 0 puppet_status{state="changed",environment="production",host="idle-fsn-01.torproject.org"} 0 puppet_status{state="unchanged",environment="production",host="idle-fsn-01.torproject.org"} 1 ``` If something is off between reality and what the monitoring system thinks, this file should be inspected for validity, and its timestamp checked. Normally, those files should be updated every time the node runs a catalog, for example. Expired nodes should disappear from that directory after 7 days, defined in `/etc/puppet/prometheus.yaml`. The reporter is hooked in the Puppet server through the `/etc/puppet/puppet.conf` file, with the following line: ``` [master] # ... reports = puppetdb,prometheus ``` See also issue [#41639](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639) for notes on the deployment of that monitoring tool. [Puppet reporter]: https://github.com/voxpupuli/puppet-prometheus_reporter Note that this used to be monitored through Icinga before its retirement, and, until it's fully retired, you might also see this Loading