|
[[_TOC_]]
|
|
[[_TOC_]]
|
|
|
|
|
|
|
|
|
|
# Major upgrades
|
|
# Major upgrades
|
|
|
|
|
|
Major upgrades are done by hand, with a "cheat sheet" created for each
|
|
Major upgrades are done by hand, with a "cheat sheet" created for each
|
... | @@ -68,25 +67,18 @@ block certain upgrades. If you want to bypass that, use regular `apt`: |
... | @@ -68,25 +67,18 @@ block certain upgrades. If you want to bypass that, use regular `apt`: |
|
|
|
|
|
cumin -b 10 '*' 'apt update ; apt upgrade -yy ; TERM=doit dsa-update-apt-status'
|
|
cumin -b 10 '*' 'apt update ; apt upgrade -yy ; TERM=doit dsa-update-apt-status'
|
|
|
|
|
|
## GitLab runner upgrades
|
|
# Special cases and manual restarts
|
|
|
|
|
|
Every month or so GitLab publishes a update to the `gitlab-runner` apt
|
|
|
|
package. The package is excluded from `unattended-upgrades` to avoid any
|
|
|
|
risk of interrupting long-running CI jobs (eg. large shadow sims).
|
|
|
|
|
|
|
|
The recommended procedure is to go through each CI machine one at a time,
|
|
The above covers all upgrades that are automatically applied, but some
|
|
pause all the runners on that single machine, ensure no long-running
|
|
are blocked from automation and require manual intervention.
|
|
shadow sims are being executed, and launch `apt upgrade`. If any regular
|
|
|
|
CI jobs are running, systemd will wait up to one hour for them to end,
|
|
|
|
then proceed with the package upgrade.
|
|
|
|
|
|
|
|
## Restarting services by hand
|
|
Others do upgrade automatically, but require a manual
|
|
|
|
restart. Normally, [needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care
|
|
|
|
of restarting services, but it can't actually deal with everything.
|
|
|
|
|
|
After upgrades, there's a Nagios check that might trigger and tell you
|
|
There is a Nagios check that might trigger and tell you that some
|
|
that some services are running with outdated libraries. Normally,
|
|
services are running with outdated libraries. You may see a warning
|
|
[needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care of restarting
|
|
like:
|
|
services, but it can't actually deal with everything. In Nagios, you
|
|
|
|
will see a warning like:
|
|
|
|
|
|
|
|
[web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
|
|
[web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
|
|
|
|
|
... | @@ -102,40 +94,6 @@ run the check by hand: |
... | @@ -102,40 +94,6 @@ run the check by hand: |
|
|
|
|
|
needrestart -v
|
|
needrestart -v
|
|
|
|
|
|
There are a few scenarios here:
|
|
|
|
|
|
|
|
* `cron.service`: typically services that should run under `systemd
|
|
|
|
--user`, reboot the box or ask the service admin to restart their
|
|
|
|
services
|
|
|
|
|
|
|
|
* `cron.service`, special case: sometimes, userdir-ldap's
|
|
|
|
`ud-replicate` leaves a multiplexing SSH process lying
|
|
|
|
around. logging into the LDAP server (currently `alberti`) and
|
|
|
|
killing all the `sshdist` process will clear those:
|
|
|
|
|
|
|
|
pkill -u sshdist ssh
|
|
|
|
|
|
|
|
* `ganeti.service`: typically this is an OpenSSL upgrade that affects
|
|
|
|
qemu, and restarting ganeti (thankfully) doesn't restart VMs. to
|
|
|
|
fix this, migrate all VMs to their secondaries and back, see
|
|
|
|
[Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only
|
|
|
|
restart](#instance-only-restarts) procedure.
|
|
|
|
|
|
|
|
* **Open vSwitch** (`openvswitch-switch` and `openvswitch-common`,
|
|
|
|
[bug 34185](https://bugs.torproject.org/34185)): to upgrade manually, empty the server, restart,
|
|
|
|
OVS, then migrate the machines back. It's actually easier to just
|
|
|
|
treat this as a "[reboot the nodes only](howto/ganeti#node-only-reboot)" procedure, see the
|
|
|
|
[Ganeti reboot procedures](howto/ganeti#rebooting) instead.
|
|
|
|
|
|
|
|
Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
|
|
|
|
Debian is marked as fixed, but will still need to be tested on our
|
|
|
|
side first. Update: it hasn't been fixed.
|
|
|
|
|
|
|
|
- **Grub** (`grub-pc`, [bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as
|
|
|
|
well, so it is blocked. to upgrade, make sure the install device is
|
|
|
|
defined, by running `dpkg-reconfigure grub-pc`. this issue might
|
|
|
|
actually have been fixed in the package, see [issue 40185](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40185).
|
|
|
|
|
|
|
|
Packages are blocked from upgrades when they cause significant
|
|
Packages are blocked from upgrades when they cause significant
|
|
breakage during an upgrade run, enough to cause an outage and/or
|
|
breakage during an upgrade run, enough to cause an outage and/or
|
|
require significant recovery work. This is done through Puppet, in the
|
|
require significant recovery work. This is done through Puppet, in the
|
... | @@ -148,15 +106,82 @@ Packages can be unblocked if and only if: |
... | @@ -148,15 +106,82 @@ Packages can be unblocked if and only if: |
|
* we have good confidence that future upgrades will not break the
|
|
* we have good confidence that future upgrades will not break the
|
|
system again
|
|
system again
|
|
|
|
|
|
|
|
This section documents how to do some of those upgrades and restarts
|
|
|
|
by hand.
|
|
|
|
|
|
|
|
## GitLab runner upgrades
|
|
|
|
|
|
|
|
Every month or so GitLab publishes a update to the `gitlab-runner` apt
|
|
|
|
package. The package is excluded from `unattended-upgrades` to avoid any
|
|
|
|
risk of interrupting long-running CI jobs (eg. large shadow sims).
|
|
|
|
|
|
|
|
The recommended procedure is to go through each CI machine one at a time,
|
|
|
|
pause all the runners on that single machine, ensure no long-running
|
|
|
|
shadow sims are being executed, and launch `apt upgrade`. If any regular
|
|
|
|
CI jobs are running, systemd will wait up to one hour for them to end,
|
|
|
|
then proceed with the package upgrade.
|
|
|
|
|
|
|
|
## cron.service
|
|
|
|
|
|
|
|
This is typically services that should be ran under `systemd --user`
|
|
|
|
but instead are started with a `@reboot` cron job.
|
|
|
|
|
|
|
|
For this kind of service, reboot the server or ask the service admin
|
|
|
|
to restart their services themselves. Ideally, this service should be
|
|
|
|
converted to a systemd unit, see [this documentation](doc/services).
|
|
|
|
|
|
|
|
### ud-replicate special case
|
|
|
|
|
|
|
|
Sometimes, userdir-ldap's `ud-replicate` leaves a multiplexing SSH
|
|
|
|
process lying around and those show up as part of
|
|
|
|
`cron.service`.
|
|
|
|
|
|
|
|
Logging into the LDAP server (currently `alberti`) and killing all the
|
|
|
|
`sshdist` process will clear those:
|
|
|
|
|
|
|
|
pkill -u sshdist ssh
|
|
|
|
|
|
|
|
## Ganeti
|
|
|
|
|
|
|
|
The `ganeti.service` warning is typically an OpenSSL upgrade that
|
|
|
|
affects qemu, and restarting ganeti (thankfully) doesn't restart
|
|
|
|
VMs. to Fix this, migrate all VMs to their secondaries and back, see
|
|
|
|
[Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only restart](#instance-only-restarts)
|
|
|
|
procedure.
|
|
|
|
|
|
|
|
## Open vSwitch
|
|
|
|
|
|
|
|
This is generally the `openvswitch-switch` and `openvswitch-common`
|
|
|
|
services, which are blocked from upgrades because of [bug 34185](https://bugs.torproject.org/34185)
|
|
|
|
|
|
|
|
To upgrade manually, empty the server, restart, OVS, then migrate the
|
|
|
|
machines back. It's actually easier to just treat this as a "[reboot
|
|
|
|
the nodes only](howto/ganeti#node-only-reboot)" procedure, see the [Ganeti reboot procedures](howto/ganeti#rebooting)
|
|
|
|
instead.
|
|
|
|
|
|
|
|
Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
|
|
|
|
Debian is marked as fixed, but will still need to be tested on our
|
|
|
|
side first. Update: it hasn't been fixed.
|
|
|
|
|
|
|
|
## Grub
|
|
|
|
|
|
|
|
`grub-pc` ([bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as well, so
|
|
|
|
it is blocked. to upgrade, make sure the install device is defined, by
|
|
|
|
running `dpkg-reconfigure grub-pc`. this issue might actually have
|
|
|
|
been fixed in the package, see [issue 40185](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40185).
|
|
|
|
|
|
|
|
## user@ services
|
|
|
|
|
|
Services setup with the new systemd-based startup system documented in
|
|
Services setup with the new systemd-based startup system documented in
|
|
[doc/services](doc/services) can be restarted with:
|
|
[doc/services](doc/services) may not automatically restart. They may be
|
|
|
|
(manually) restarted with:
|
|
|
|
|
|
systemctl restart user@1504.service
|
|
systemctl restart user@1504.service
|
|
|
|
|
|
There's a feature request ([bug #843778](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=843778)) to implement support for
|
|
There's a feature request ([bug #843778](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=843778)) to implement support for
|
|
those services directly in needrestart.
|
|
those services directly in needrestart.
|
|
|
|
|
|
## Kernel upgrades and reboots
|
|
# Reboots
|
|
|
|
|
|
Sometimes it is necessary to perform a reboot on the hosts, when the
|
|
Sometimes it is necessary to perform a reboot on the hosts, when the
|
|
kernel is updated. Nagios will warn about this, with something like
|
|
kernel is updated. Nagios will warn about this, with something like
|
... | @@ -164,7 +189,10 @@ this: |
... | @@ -164,7 +189,10 @@ this: |
|
|
|
|
|
WARNING: Kernel needs upgrade [linux-image-4.9.0-9-amd64 != linux-image-4.9.0-8-amd64]
|
|
WARNING: Kernel needs upgrade [linux-image-4.9.0-9-amd64 != linux-image-4.9.0-8-amd64]
|
|
|
|
|
|
### Rebooting guests
|
|
TODO: the above is the old way, the needrestart check has a different
|
|
|
|
output. document it above.
|
|
|
|
|
|
|
|
## Rebooting a single host
|
|
|
|
|
|
If this is only a virtual machine, and the only one affected, it can
|
|
If this is only a virtual machine, and the only one affected, it can
|
|
be rebooted directly. This can be done with the `tsa-misc` script
|
|
be rebooted directly. This can be done with the `tsa-misc` script
|
... | @@ -186,6 +214,29 @@ entered at boot time, either through the initramfs (if it has the |
... | @@ -186,6 +214,29 @@ entered at boot time, either through the initramfs (if it has the |
|
the case for the `mandos-01` server itself, for example, as it
|
|
the case for the `mandos-01` server itself, for example, as it
|
|
currently can't unlock itself, naturally.
|
|
currently can't unlock itself, naturally.
|
|
|
|
|
|
|
|
## Batch rebooting multiple hosts
|
|
|
|
|
|
|
|
IMPORTANT: before following this procedure, make sure that only a
|
|
|
|
subset of the hosts need a restart. If *all* hosts need a reboot, it's
|
|
|
|
likely going to be faster and easier to reboot the entire clusters at
|
|
|
|
once, see the [Ganeti reboot procedures](howto/ganeti#reboot) instead.
|
|
|
|
|
|
|
|
LDAP hosts have information about how they can be rebooted, in the
|
|
|
|
`rebootPolicy` field. Here are what the various fields mean:
|
|
|
|
|
|
|
|
* `justdoit` - can be rebooted any time, with a 10 minute delay,
|
|
|
|
possibly in parallel
|
|
|
|
* `rotation` - part of a cluster where each machine needs to be
|
|
|
|
rebooted one at a time, with a 30 minute delay for DNS to update
|
|
|
|
* `manual` - needs to be done by hand or with a special tool (fabric
|
|
|
|
in case of ganeti, reboot-host in the case of KVM, nothing for
|
|
|
|
windows boxes)
|
|
|
|
|
|
|
|
Therefore, it's possible to selectively reboot some of those hosts in
|
|
|
|
batches. Again, this is pretty rare: typically, you would either
|
|
|
|
reboot only a single host or *all* hosts, in which case a cluster-wide
|
|
|
|
reboot (with Ganeti, below) would be more appropriate.
|
|
|
|
|
|
This routine should be able to reboot all hosts with a `rebootPolicy`
|
|
This routine should be able to reboot all hosts with a `rebootPolicy`
|
|
defined to `justdoit` or `rotation`:
|
|
defined to `justdoit` or `rotation`:
|
|
|
|
|
... | @@ -197,29 +248,17 @@ defined to `justdoit` or `rotation`: |
... | @@ -197,29 +248,17 @@ defined to `justdoit` or `rotation`: |
|
|
|
|
|
## Rebooting KVM hosts
|
|
## Rebooting KVM hosts
|
|
|
|
|
|
The remaining is the "manual" procedure, the KVM hosts:
|
|
The remaining is the "manual" procedure, which includes one KVM last:
|
|
|
|
|
|
./reboot-host moly.torproject.org
|
|
./reboot-host moly.torproject.org
|
|
|
|
|
|
|
|
... and Ganeti nodes, below.
|
|
|
|
|
|
## Rebooting Ganeti nodes
|
|
## Rebooting Ganeti nodes
|
|
|
|
|
|
See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this
|
|
See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this procedure.
|
|
procedure.
|
|
|
|
|
|
|
|
## Remaining nodes
|
|
## Remaining nodes
|
|
|
|
|
|
The [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) will show remaining hosts that
|
|
The [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) will show remaining hosts that
|
|
might have been missed by the above procedure.. |
|
might have been missed by the above procedure.. |
|
|
|
|
|
### Generic upgrade routines
|
|
|
|
|
|
|
|
LDAP hosts have information about how they can be rebooted, in the
|
|
|
|
`rebootPolicy` field. Here are what the various fields mean:
|
|
|
|
|
|
|
|
* `justdoit` - can be rebooted any time, with a 10 minute delay,
|
|
|
|
possibly in parallel
|
|
|
|
* `rotation` - part of a cluster where each machine needs to be
|
|
|
|
rebooted one at a time, with a 30 minute delay for DNS to update
|
|
|
|
* `manual` - needs to be done by hand or with a special tool (fabric
|
|
|
|
in case of ganeti, reboot-host in the case of KVM, nothing for
|
|
|
|
windows boxes) |
|
|