alerts BackupStalled and PuppetCatalogStale do not expire correctly after host retirement
four onionoo servers were retired in #41838 (closed). at the time, the retirement script didn't automatically set silences, but I think i set some manually. lelutin found silence edf5db64-f86f-49e8-8b6a-945c4789e50e
for example. but those eventually (and correctly) expired.
now we're getting warnings about those hosts. examples:
19:03:23 -ALERTOR1:#tor-alerts- PuppetCatalogStale [firing] Stale Puppet catalog on onionoo-backend-01.torproject.org
19:03:23 -ALERTOR1:#tor-alerts- PuppetCatalogStale [firing] Stale Puppet catalog on onionoo-backend-02.torproject.org
19:03:23 -ALERTOR1:#tor-alerts- PuppetCatalogStale [firing] Stale Puppet catalog on onionoo-frontend-01.torproject.org
19:03:23 -ALERTOR1:#tor-alerts- PuppetCatalogStale [firing] Stale Puppet catalog on onionoo-frontend-02.torproject.org
[...]
Day changed to 14 Nov 2024
[...]
12:00:56 -ALERTOR1:#tor-alerts- BackupStalled [firing] A backup job is stalled on onionoo-backend-02.torproject.org
12:00:56 -ALERTOR1:#tor-alerts- BackupStalled [firing] A backup job is stalled on onionoo-frontend-01.torproject.org
12:00:56 -ALERTOR1:#tor-alerts- BackupStalled [firing] A backup job is stalled on onionoo-frontend-02.torproject.org
timestamps are UTC-5.
so. something is publishing metrics for those hosts, even if they have been retired (presumably correctly, but perhaps that should be double-checked) from Puppet.
let's figure out a way to (a) get rid of those alerts and (b) make sure it doesn't happen the next time we retire a host.