Puppet ENC deployment failure post-mortem
Yesterday, a deployment of the new Puppet External Node Classifier (ENC, #40358 (closed)) caused a cascade of failure across our Puppet infrastructure, which was thankfully quickly identified and stopped. Several hours of manual remediation were required to revert the changes and ensure all the configurations got back to a proper state.
As far as we know, no user-facing services were impacted by the incident.
Timeline
The timeline (UTC) goes like this:
- 2021-09-26 - anarcat reviews the tor-puppet feature branch and approves changes
- 2021-09-27 17:39 - anarcat changes
pat
to run from cache (52977656 pat: do not use --test to run puppet
) - 2021-09-27 @ 19:10 - lavamind runs puppet agent on all servers to ensure catalog compiles everywhere
- 2021-09-27 @ 19:10 - lavamind merges the
prepare-enc-roles
branch tomaster
and pushes to the Puppet server (pauli) - 2021-09-27 @ 19:11 - lavamind runs this command on all systems
puppet agent --onetime --no-daemonize --show_diff
, as hinted by anarcat's change topat
- cumin reports 100% success rate
- command completion time seems abnormal, too fast compared to previous agent runs
- 2021-09-27 @ 19:17 - lavamind realizes that
puppet agent
was run without--no-usecacheonfailure
, which means that the catalogs may have failed to compile everywhere and a cached version was used - 2021-09-27 @ 19:22 - lavamind identifies missing
base::includes
class from the all puppet nodes - 2021-09-27 @ 19:29 - lavamind identifies several missing common classes from all puppet nodes, defined in
hiera/common.yaml
- 2021-09-27 @ 19:35 - lavamind pushes a fix to the Puppet server to include common base classes on all nodes
- 2021-09-27 @ 19:36 - lavamind notices the catalogs still fail to compile because some manifests expect non-empty
$::classes
- 2021-09-27 @ 19:38 - anarcat raises alarm about Puppet agent deconfiguring nodes, eg. ferm (firewall) rules are being removed
- anarcat proceeds to
puppet agent --disable
on all nodes to prevent further damage - lavamind disables Puppet server on pauli
- anarcat proceeds to
- 2021-09-27 @ 19:40 - anarcat and lavamind join up on a voice call to discuss incident and recovery options
- 2021-09-27 @ 19:45 - tests indicate that Puppet is in a state where it want to deconfigure some firewall rules, SSH access, TLSA DNS entries and some parts of the static-mirror system (!!) but most servers have been spared from severe damage and Nagios is not showing signs of downtime for user-facing services like DNS, web and Gitlab
- 2021-09-27 @ 19:54 - a first hotfix is pushed to the Puppet server
- 2021-09-27 @ 19:57 - a second version of the hotfix is pushed to the Puppet server
- 2021-09-27 @ 20:03 - Puppet agent keeps attempting to deconfigure its nodes due to missing data in PuppetDB, so a full revert of the branch and following fixes is pushed to the Puppet server, to ensure catalog compiles everywhere
- 2021-09-27 @ 20:15-20:30 -
puppet agent --enable ; puppet agent --noop --test ; puppet agent --disable
is run on all Puppet nodes to attempt to rebuild PuppetDB data and revert damage to configurations- following this most Puppet nodes seem go back to a normal configuration state
- the Puppet catalog on the static-mirror system is still not compiling due to a bootstrap problem
- two DNS servers are down, and we're unable to SSH in, presumably due to destroyed ferm rules
- 2021-09-27 @ 20:34 - anarcat identifies a potential fix for the static-mirror system and pushes it to the Puppet server
- the fix seems to be working, catalogs are compiling again on the static-mirror system
- 2021-09-27 @ 20:40 - the focus turns to two unresponsive DNS hosts (out of six),
fallax
andnutans
- anarcat is unable to get a console on either
fallax
ornutans
- anarcat attempts to get a console on
fallax
by shutting down the VM and adjusting the Grub config on its storage: success but now the vm fails to boot up
- anarcat is unable to get a console on either
- 2021-09-27 @ 21:10 - anarcat and lavamind leave call, with lavamind agreeing to continue recovery attempts on both DNS hosts, anarcat logging off
- 2021-09-27 @ 21:28 - lavamind reports
fallax
is fully back online after restoring Grub config - 2021-09-27 @ 21:43 - lavamind reports
nutans
is fully back online after logging in and running the Puppet agent manually - 2021-09-27 @ 22:01 - lavamind reports Puppet appears to be in a state where configurations are restored to a normal state across the infrastructure, and begins to manually re-enable the agent on a handful of nodes at a time
- 2021-09-27 @ 22:15-ish - lavamind logging off
- 2021-09-27 @ 23:58 - anarcat manually re-enables the agent on all the remaining Puppet nodes (~2 hours)
Root cause analysis
The core of the issue is that Puppet was ran without --no-usecacheonfailure
, or, in other words, was using the cache on failure, which, in turn, led Puppet to clear out a lot of resources everywhere.
Recovery attempts were frustrated by inaccessible consoles on moly
(fallax
libvirt host) and sunet (nutans
OpenStack host).
What went right
The problem was identified quickly and damage was limited. The Puppet agents and server were quickly disabled, thanks to the prompt action of lavamind and anarcat running cumin on all servers.
Since the deployment was made during working hours while both sysadmins were online, we were able to collaborate on recovery efforts.
Lessons learned
We learned that running the agent with the --noop
switch, while no changes are applied to the node itself, it will still upload its exported resources to PuppetDB on the server.
Based on the timeline of events, it's likely that the initial agent run with a failed catalog and a cached version being used, caused the removal of exported resources on PuppetDB, so we learned to be careful about the state of exported resources when deploying large scale changes.
Before pushing the ENC changes to the Puppet server, lavamind should have disabled the agent everywhere and ran tests on a handful of nodes as described in the wiki. It's likely that would have immediately allowed him to notice that the missing classes were an issue, and offered an opportunity to fix the issue before any widespread damage was done due to the Puppet agent running automatically on all nodes.
The following changes were performed:
- anarcat patched
modules/staticsync/templates/static-components.conf.erb
to fix the bootstrap loop:e67d3e3f staticsync: turn a failure into a warning
- anarcat reverted the ENC patchset in
tor-puppet.git
(patches 4e3776f6..3077de2a) - anarcat patched
pat
(64fe3ed2
) to "bring back --no-usecacheonfailure"
Followup work
The following tickets were created to address issues discovered during the outage:
- #40421: enhance incident response procedures
- use this file as a first template for post-mortems
- next time: dedicate a person to note-taking earlier so that this post-mortem is easier to write