okay, so we do have an actual problem here: it seems the apt-daily.service job, which was supposed to run apt-get update everywhere, at least daily, is not actually doing its job. take idle-fsn-01 for example:
root@idle-fsn-01:~# /usr/share/prometheus-node-exporter-collectors/apt_info.py# HELP apt_upgrades_pending Apt packages pending updates by origin.# TYPE apt_upgrades_pending gaugeapt_upgrades_pending{origin="",arch=""} 0# HELP apt_upgrades_held Apt packages pending updates but held back.# TYPE apt_upgrades_held gaugeapt_upgrades_held{origin="",arch=""} 0# HELP apt_autoremove_pending Apt packages pending autoremoval.# TYPE apt_autoremove_pending gaugeapt_autoremove_pending 15# HELP apt_package_cache_timestamp_seconds Apt update last run time.# TYPE apt_package_cache_timestamp_seconds gaugeapt_package_cache_timestamp_seconds 1728443683.6660926# HELP node_reboot_required Node reboot is required for software updates.# TYPE node_reboot_required gauge
that timestampe is more than a day old:
> (1728574537-1728443683.6660926)s (1728574537 − 1728443683,6660926) secondes ≈ 1 d + 12 h + 20 min + 53,33 s
the timer is scheduled to run in 12h, and the service actually ran an hour ago:
@anarcat I haven't looked at the file at all.. I just saw something recently about timer unit files needing to have OnBootSec in order for them to be activated at boot time. is this maybe missing in the unit file?
so yeah, i thought i had a fix for this in puppet by passing always to that parameter, i made a detailed explanation in the commit log:
commit a8c447cef0101444378ea787f50e4ac4d92db945Author: Antoine Beaupré <anarcat@debian.org>Date: Thu Oct 10 12:00:39 2024 -0400run apt-get update more frequentlyThis setting changes the APT::Periodic::Update-Package-Lists parameterin the APT configuration, which in turns affects how the`/usr/lib/apt/apt.systemd.daily` script works. By default, it's set tozero ("0") but the unattended-upgrades module somewhat reasonablyturns that on by setting it to one ("1").Except one, here, doesn't just mean "true", it means "one day", or"daily". What that's going to do is make the script check the updatetimestamp for when it ran last and make sure we wait more than aday. But then the script only gets called once or twice a day, whichmeans that it can actually take more than a day to run apt-get update.Reduce that complexity and just run apt-get update every time we callthis script.Upgrades, however, are still configured to run "more than daily"here. The rationale is that we might not want *those* to upgrade twicea day. That could be fixed later, but so far are alerts for this aremuch more liberaly, and I think the current settings for that are fine.Closes: tpo/tpa/prometheus-alerts#221 file changed, 4 insertions(+)modules/profile/manifests/unattended_upgrades.pp | 4 ++++modified modules/profile/manifests/unattended_upgrades.pp@@ -39,5 +39,9 @@ class profile::unattended_upgrades { # takes effect only after unattended-upgrades 2.5, shipped in BULLSEYE remove_new_unused_deps => true, remove_unused_kernel => true,+ # always run apt-get update in the daily job, instead of having+ # *two* separate timers. the systemd timer takes care of making+ # sure this only runs daily.+ update => 'always', } }
but that broke because our unattended-upgrades module is out of date. and we can't update it because it drops support for puppet 5 and would break upgrades there.
so i'm just going to bump the alert latency threshold fold now and punt this forward to until we upgrade to puppet 7 everywhere.
so i tweaked the alert in 91bef26b, the next step here is to upgrade all servers to puppet 7 (#41819) and then merge the u-u-upgrade branch in Puppet, and then revert 91bef26b here.
okay, this is weird: on lists-01 right now, there's an AptUpdateLagging alert firing, and it seems legit:
root@lists-01:/etc/mailman3# systemctl status apt-daily | grep Active Active: inactive (dead) since Tue 2024-10-29 20:07:55 UTC; 17h agoroot@lists-01:/etc/mailman3# /usr/share/prometheus-node-exporter-collectors/apt_info.py | grep timestamp# HELP apt_package_cache_timestamp_seconds Apt update last run time.# TYPE apt_package_cache_timestamp_seconds gaugeapt_package_cache_timestamp_seconds 1730096470.0152664root@lists-01:/etc/mailman3# date +%s1730294846
the service ran recently enough:
root@lists-01:/etc/mailman3# systemctl status apt-daily | grep Active Active: inactive (dead) since Tue 2024-10-29 20:07:55 UTC; 17h ago
... buuuut it didn't work because our "update flag" is wrong:
so i tweaked the alert in 91bef26b, the next step here is to upgrade all servers to puppet 7 (#41819) and then merge the u-u-upgrade branch in Puppet, and then revert 91bef26b here.
i've done that merge, now that we've upgraded to Puppet 7 everywhere. i don't think this will have any negative impact, but i've ran puppet by hand on the idle-* node to see, seems like the catalog compiles fine at least.
this should knock out yet another false positive, whoohoo!