Skip to content

colchicifolium units flapping

During the break (and the more quiet alert time), I noticed recurring failures on colcichifolium.

Steps to reproduce

mostly catchup with backlog on #tor-alerts, during a break

What is the current bug behavior?

daily systemd unit failures alerts.

What is the expected correct behavior?

no failures

When did this start?

unsure, needs to be investigated further.

Relevant logs and/or screenshots

Day changed to 29 Jun 2025
00:29:06 -ALERTOR1:#tor-alerts- SystemdFailedUnits [firing] Some systemd units are in failed state on colchicifolium.torproject.org
03:04:06 -ALERTOR1:#tor-alerts- SystemdFailedUnits [resolved] Some systemd units are in failed state on colchicifolium.torproject.org
Day changed to 30 Jun 2025
01:07:06 -ALERTOR1:#tor-alerts- SystemdFailedUnits [firing] Some systemd units are in failed state on colchicifolium.torproject.org
03:07:06 -ALERTOR1:#tor-alerts- SystemdFailedUnits [resolved] Some systemd units are in failed state on colchicifolium.torproject.org
          04:02 | Joins: groentor, zen-fu | 06:15
07:16:36 -ALERTOR1:#tor-alerts- HostDown [firing] Host dal-rescue-02.torproject.org is not responding CRITICAL!
Day changed to 01 Jul 2025
00:29:06 -ALERTOR1:#tor-alerts- SystemdFailedUnits [firing] Some systemd units are in failed state on colchicifolium.torproject.org
03:04:06 -ALERTOR1:#tor-alerts- SystemdFailedUnits [resolved] Some systemd units are in failed state on colchicifolium.torproject.org

Possible fixes

for now, i've issued a silence for a week on this, but it would be nice to fix this.

alternatively, we might want to put those alerts in the "info" level, although it would hide such transient failures completely, which might not be what we actually want.

maybe a retry could be set in the system unit?

in any case, more logs are needed to analyse this, which can be found on the server...

we could also figure out the patterns better, but for now it looks like a daily failure that lasts about a couple hours... but we need better history for the alerts to figure out the pattern for that (see also reduce alert fatigue in Prometheus (#42222))

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information