Add note about cases caused by ganeti instances authored by lelutin's avatar lelutin
The easy and long methods. The long method reduces the amount of
downtime for our VMs but is a bit longer to do, especially just to
figure out which instances used a 'plain' disk template to handle them
differently.
......@@ -200,47 +200,52 @@ Others do upgrade automatically, but require a manual
restart. Normally, [needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care
of restarting services, but it can't actually deal with everything.
There is a Nagios check that might trigger and tell you that some
services are running with outdated libraries. You may see a warning
like:
Our alert in Alertmanager only shows a sum of how much hosts have pending
restarts. To check the entire fleet and simultaneously discover which hosts are
triggering the alert, run this command in [Fabric](howto/fabric):
[web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
The detailed status information will show you which service it fails
to restart:
WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
Services:
- cron.service
fab fleet.pending-restarts
If you cannot figure out why the warning happens, you might want to
run the check by hand:
run `needrestart` on a particular host by hand:
needrestart -v
To check the entire fleet, run this command in [Fabric](howto/fabric):
Important notes:
fab fleet.pending-restarts
1. Ganeti instance (VM) processes (kvm) might show up as running with an
outdated library and `needrestart` will try to restart the `ganeti.service`
unit but that will not fix the issue. In this situation, you can reboot the
whole node, which will cause a downtime for all instances on it.
* An alternative that can limit the downtimes on instances but takes longer
to operate is to issue a series of instance migrations to their secondaries
and then back to their primaries. However, some instances with disks of
type 'plain' cannot be migrated and need to be rebooted instead with
`gnt-instance stop $instance && gnt-instance start $instance` on the
cluster's main server (issuing a reboot from within the instance e.g. with
the `reboot` fabric script might not stop the instance's KVM process on the
ganeti node so is not enough)
Note that there's a false alarm that occurs regularly here because
there's lag between `needrestart` running after upgrades (which is on
a `dpkg` post-invoke hook) and the metrics updates (which are on a
timer running daily and 2 minutes after boot).
2. There's a false alarm that occurs regularly here because there's lag between
`needrestart` running after upgrades (which is on a `dpkg` post-invoke hook)
and the metrics updates (which are on a timer running daily and 2 minutes
after boot).
If a host is showing up in an alert and the above fabric task says:
If a host is showing up in an alert and the above fabric task says:
INFO: no host found requiring a restart
INFO: no host found requiring a restart
It might be the timer hasn't ran recently enough, you can diagnose
that with:
It might be the timer hasn't ran recently enough, you can diagnose
that with:
systemctl status tpa-needrestart-prometheus-metrics.timer tpa-needrestart-prometheus-metrics.service
systemctl status tpa-needrestart-prometheus-metrics.timer tpa-needrestart-prometheus-metrics.service
And, normally, fix it with:
And, normally, fix it with:
systemctl start tpa-needrestart-prometheus-metrics.service
systemctl start tpa-needrestart-prometheus-metrics.service
See [issue `prometheus-alerts#20`](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/20) to get rid of that false positive.
See [issue `prometheus-alerts#20`](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/20)
to get rid of that false positive.
Packages are blocked from upgrades when they cause significant
breakage during an upgrade run, enough to cause an outage and/or
......
......