... | @@ -629,11 +629,47 @@ Revocation procedures problems were discussed in [33587][] and [33446][]. |
... | @@ -629,11 +629,47 @@ Revocation procedures problems were discussed in [33587][] and [33446][]. |
|
|
|
|
|
## Pager playbook
|
|
## Pager playbook
|
|
|
|
|
|
<!-- information about common errors from the monitoring system and -->
|
|
### catalog run: PuppetDB warning: did not update since...
|
|
<!-- how to deal with them. this should be easy to follow: think of -->
|
|
|
|
<!-- your future self, in a stressful situation, tired and hungry. -->
|
|
|
|
|
|
|
|
TODO.
|
|
If you see an error like:
|
|
|
|
|
|
|
|
Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z
|
|
|
|
|
|
|
|
It can also be eventually accompanied with the puppet server reporting
|
|
|
|
the same problem:
|
|
|
|
|
|
|
|
Subject: ** PROBLEM Service Alert: pauli/puppet - all catalog runs is WARNING **
|
|
|
|
[...]
|
|
|
|
Check last node runs from PuppetDB WARNING - cupani.torproject.org did not update since 2020-05-11T04:38:54.512Z
|
|
|
|
|
|
|
|
One of the following is happening, in decreasing likeliness:
|
|
|
|
|
|
|
|
1. the node's Puppet manifest has an error of some sort that makes it
|
|
|
|
impossible to run the catalog
|
|
|
|
2. the node is down and has failed to report since the last time
|
|
|
|
specified
|
|
|
|
3. the Puppet **server** is down and **all** nodes will fail to
|
|
|
|
report in the same way (in which case a lot more warnings will
|
|
|
|
show up, and other warnings about the server will come in)
|
|
|
|
|
|
|
|
The first situation will usually happen after someone pushed a commit
|
|
|
|
introducing the error. We try to keep all manifests compiling all the
|
|
|
|
time and such errors should be immediately fixed. Look at the history
|
|
|
|
of the Puppet source tree and try to identify the faulty
|
|
|
|
commit. Reverting such a commit is acceptable to restore the service.
|
|
|
|
|
|
|
|
The second situation can happen if a node is in maintenance for an
|
|
|
|
extended duration. Normally, the node will recover when it goes back
|
|
|
|
online. If a node is to be permanently retired, it should be removed
|
|
|
|
from Puppet, using the [host retirement procedures][retire-a-host].
|
|
|
|
|
|
|
|
Finally, if the main Puppet **server** is down, it should definitely
|
|
|
|
be brought back up. See disaster recovery, below.
|
|
|
|
|
|
|
|
In any case, running the Puppet agent on the affected node should give
|
|
|
|
more information:
|
|
|
|
|
|
|
|
ssh NODE puppet agent -t
|
|
|
|
|
|
## Disaster recovery
|
|
## Disaster recovery
|
|
|
|
|
... | | ... | |