i wonder if this issue might have actually been fixed in Debian's grub scripts... I am not sure what error we were getting on the console here exactly, but if i remember correctly, it looked something like bug 966575 in the Debian BTS, which is:
GRUB loading.Welcome to GRUB!error: symbol 'grub_calloc' not found.grub rescue>
was that the exact error?
if that's the case, then the bug above is marked as fixed in grub 2.02+dfsg1-20+deb10u3, which has been shipped everywhere. also, since that version, grub-pc upgrades will completely fail if grub-install fails, which should at least mark the package as broken in situations such as this.
so should we remove this from the unattended-upgrades blocklist?
uhm. So last time I did update grub I had to select the partition to reinstall grub on. I am not sure that happens with unattended-upgrades which I think was part of the problem?
i think that's the point of the patches added to 2.02+dfsg1-20+deb10u3... the changelog says:
When upgrading grub-pc noninteractively, bail out if grub-install fails.
It's better to fail the upgrade than to produce a possibly-unbootable
system.
Explicitly check whether the target device exists before running
grub-install, since grub-install copies modules to /boot/grub/ before
installing the core image, and the new modules might be incompatible
with the old core image (closes: #966575).
The first part should ensure that the unattended-upgrade will fail to upgrade grub (which we will detect in nagios) instead of just going ahead and installing the incompatible stuff in /boot. The second part adds an extra check, which, I think, might be the problem you're describing: the case where dpkg-reconfigure grub-pc has to run to properly select a device to grub-install the grub shims on...
I have also ran dpkg-reconfigure grub-pc on boxes where grub had to be upgraded here, by hand, before doing reboots. I guess what I am proposing is that, next time there's a grub update, we try to not do that and see what happens. If everything works, then we re-enable automatic upgrades...
There's been a grub-pc pending recently, and I've tested removing the package from the unattended-upgrades blacklist and running it manually with --verbose. In all cases I've tested (ganeti node, ganeti instance and standalone host) the upgrade went without a hitch and the subsequent reboot was successful. I'm therefore going ahead in re-enabling it via Puppet across the fleet.
Yesterday I removed grub-pc from the unattended-upgrades blacklist and ran unattended-upgrades -v across the fleet. The vast majority of upgrades went ahead without a hitch, but failed on 12 machines. The cause identified was an empty grub-pc/install_devices debconf parameter.
The fix was to run apt install --fix-broken and select the proper disk where GRUB should be installed, usually the first disk (eg. /dev/sda).
I had to reboot approximately 15 machines for a kernel upgrade after running the grub-pc via unattended-upgrades, and none of them failed to boot, so I have high confidence this chance should not cause any further issues.
Additionally, a second upgrade for grub-pc is forthcoming in Debian, so I'll keep this ticket open to check that the unattended upgrade works as expected and perform a handful of reboots to verify.
one of my concerns right now is that we had failures in grub on hosts that were not recently installed (e.g. nevii). how will changing the install procedure fix that class of problems?
let's keep in mind that one of the outcomes of this ticket might very well be that we do not want to automatically update grub. grub updates are not necessarily that frequent that we need them completely automated and besides, for them to be really automated, we'd also need reboots, so maybe it's actually fine that the package update is blocked. that way we can be more deliberate on how we deploy this through the fleet, coupled with reboots...
one of my concerns right now is that we had failures in grub on hosts that were not recently installed (e.g. nevii). how will changing the install procedure fix that class of problems?
I think the most likely explanation is this particular machine was missing the debconf parameter because of some manual intervention that was made the previous time that there were issues with the grub package. Another explanation could be that the correct disk path changed, but I'm not concerned about this being an issue because we rarely change disks on an existing machines (even less the disk where root lives).
let's keep in mind that one of the outcomes of this ticket might very well be that we do not want to automatically update grub. grub updates are not necessarily that frequent that we need them completely automated and besides, for them to be really automated, we'd also need reboots, so maybe it's actually fine that the package update is blocked. that way we can be more deliberate on how we deploy this through the fleet, coupled with reboots...
I think we already have too many manual maintenance steps already, and if the grub-pc package upgrade succeeds it's safe to assume the machine is not broken and we don't need to reboot the whole fleet just to make sure. And if that isn't the case, kernel upgrades are regular enough that's we'd find out pretty quickly.
oh, and @micah mentioned yesterday that he also had problems with the grub upgrade in another fleet of servers, maybe he can expand on that here...
Additionally, a second upgrade for grub-pc is forthcoming in Debian, so I'll keep this ticket open to check that the unattended upgrade works as expected and perform a handful of reboots to verify.
... and it's also possible that this second upgrade would be a fix for the bug that @micah experienced...
No Xen machines, all KVM... I think 3-4 machines out of ~50 failed to
load grub after upgrade when rebooted. Had to manually fix over serial,
have no idea what happened.
...
On 2022-09-23 13:03:55, Jérôme Charaoui (@lavamind) wrote:
@anarcat This issue has been waiting for information two
weeks. It needs attention. Please take care of this before
the end of 2022-10-22 or it
will be moved to the Icebox.