Puppet ENC deployment failure post-mortem

Yesterday, a deployment of the new Puppet External Node Classifier (ENC, #40358 (closed)) caused a cascade of failure across our Puppet infrastructure, which was thankfully quickly identified and stopped. Several hours of manual remediation were required to revert the changes and ensure all the configurations got back to a proper state.

As far as we know, no user-facing services were impacted by the incident.

Timeline

The timeline (UTC) goes like this:

2021-09-26 - anarcat reviews the tor-puppet feature branch and approves changes
2021-09-27 17:39 - anarcat changes pat to run from cache (52977656 pat: do not use --test to run puppet)
2021-09-27 @ 19:10 - lavamind runs puppet agent on all servers to ensure catalog compiles everywhere
2021-09-27 @ 19:10 - lavamind merges the prepare-enc-roles branch to master and pushes to the Puppet server (pauli)
2021-09-27 @ 19:11 - lavamind runs this command on all systems puppet agent --onetime --no-daemonize --show_diff, as hinted by anarcat's change to pat
- cumin reports 100% success rate
- command completion time seems abnormal, too fast compared to previous agent runs
2021-09-27 @ 19:17 - lavamind realizes that puppet agent was run without --no-usecacheonfailure, which means that the catalogs may have failed to compile everywhere and a cached version was used
2021-09-27 @ 19:22 - lavamind identifies missing base::includes class from the all puppet nodes
2021-09-27 @ 19:29 - lavamind identifies several missing common classes from all puppet nodes, defined in hiera/common.yaml
2021-09-27 @ 19:35 - lavamind pushes a fix to the Puppet server to include common base classes on all nodes
2021-09-27 @ 19:36 - lavamind notices the catalogs still fail to compile because some manifests expect non-empty $::classes
2021-09-27 @ 19:38 - anarcat raises alarm about Puppet agent deconfiguring nodes, eg. ferm (firewall) rules are being removed
- anarcat proceeds to puppet agent --disable on all nodes to prevent further damage
- lavamind disables Puppet server on pauli
2021-09-27 @ 19:40 - anarcat and lavamind join up on a voice call to discuss incident and recovery options
2021-09-27 @ 19:45 - tests indicate that Puppet is in a state where it want to deconfigure some firewall rules, SSH access, TLSA DNS entries and some parts of the static-mirror system (!!) but most servers have been spared from severe damage and Nagios is not showing signs of downtime for user-facing services like DNS, web and Gitlab
2021-09-27 @ 19:54 - a first hotfix is pushed to the Puppet server
2021-09-27 @ 19:57 - a second version of the hotfix is pushed to the Puppet server
2021-09-27 @ 20:03 - Puppet agent keeps attempting to deconfigure its nodes due to missing data in PuppetDB, so a full revert of the branch and following fixes is pushed to the Puppet server, to ensure catalog compiles everywhere
2021-09-27 @ 20:15-20:30 - puppet agent --enable ; puppet agent --noop --test ; puppet agent --disable is run on all Puppet nodes to attempt to rebuild PuppetDB data and revert damage to configurations
- following this most Puppet nodes seem go back to a normal configuration state
- the Puppet catalog on the static-mirror system is still not compiling due to a bootstrap problem
- two DNS servers are down, and we're unable to SSH in, presumably due to destroyed ferm rules
2021-09-27 @ 20:34 - anarcat identifies a potential fix for the static-mirror system and pushes it to the Puppet server
- the fix seems to be working, catalogs are compiling again on the static-mirror system
2021-09-27 @ 20:40 - the focus turns to two unresponsive DNS hosts (out of six), fallax and nutans
- anarcat is unable to get a console on either fallax or nutans
- anarcat attempts to get a console on fallax by shutting down the VM and adjusting the Grub config on its storage: success but now the vm fails to boot up
2021-09-27 @ 21:10 - anarcat and lavamind leave call, with lavamind agreeing to continue recovery attempts on both DNS hosts, anarcat logging off
2021-09-27 @ 21:28 - lavamind reports fallax is fully back online after restoring Grub config
2021-09-27 @ 21:43 - lavamind reports nutans is fully back online after logging in and running the Puppet agent manually
2021-09-27 @ 22:01 - lavamind reports Puppet appears to be in a state where configurations are restored to a normal state across the infrastructure, and begins to manually re-enable the agent on a handful of nodes at a time
2021-09-27 @ 22:15-ish - lavamind logging off
2021-09-27 @ 23:58 - anarcat manually re-enables the agent on all the remaining Puppet nodes (~2 hours)

Root cause analysis

The core of the issue is that Puppet was ran without --no-usecacheonfailure, or, in other words, was using the cache on failure, which, in turn, led Puppet to clear out a lot of resources everywhere.

Recovery attempts were frustrated by inaccessible consoles on moly (fallax libvirt host) and sunet (nutans OpenStack host).

What went right

The problem was identified quickly and damage was limited. The Puppet agents and server were quickly disabled, thanks to the prompt action of lavamind and anarcat running cumin on all servers.

Since the deployment was made during working hours while both sysadmins were online, we were able to collaborate on recovery efforts.

Lessons learned

We learned that running the agent with the --noop switch, while no changes are applied to the node itself, it will still upload its exported resources to PuppetDB on the server.

Based on the timeline of events, it's likely that the initial agent run with a failed catalog and a cached version being used, caused the removal of exported resources on PuppetDB, so we learned to be careful about the state of exported resources when deploying large scale changes.

Before pushing the ENC changes to the Puppet server, lavamind should have disabled the agent everywhere and ran tests on a handful of nodes as described in the wiki. It's likely that would have immediately allowed him to notice that the missing classes were an issue, and offered an opportunity to fix the issue before any widespread damage was done due to the Puppet agent running automatically on all nodes.

The following changes were performed:

anarcat patched modules/staticsync/templates/static-components.conf.erb to fix the bootstrap loop: e67d3e3f staticsync: turn a failure into a warning
anarcat reverted the ENC patchset in tor-puppet.git (patches 4e3776f6..3077de2a)
anarcat patched pat (64fe3ed2) to "bring back --no-usecacheonfailure"

Followup work

The following tickets were created to address issues discovered during the outage:

#40421: enhance incident response procedures
use this file as a first template for post-mortems
next time: dedicate a person to note-taking earlier so that this post-mortem is easier to write

Edited Sep 30, 2021 by anarcat