in legacy/trac#31957 (moved) we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.
we have various mechanisms to do so right now:
tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
misc/multi-tool/torproject-reboot-rotation - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
ganeti-reboot-cluster - a tool to reboot the ganeti cluster
There are various problems with all this:
the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts replaced with fabric
the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced the fabric script performs better
the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow (see legacy/trac#33412 (moved)) have not witnessed this in the fabric script
we have 5 different ways of performing reboots, we should have just one script that does it all fixed in fabric
reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does) fixed in fabric
In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.
I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.
✓
5 of 5 checklist items completed
· Edited
Designs
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
note that this may very well mean just removing tsa-misc/reboot-host and tsa-misc/reboot-guest, and documenting the multi-tool better. :)
i just tried ./torproject-reboot-rotation and ./torproject-reboot-simple and the unattended operation isn't great... it fires up all those reboots, and doesn't show clearly what it did. for example, it seems to have queued reboots on a bunch of hosts, but it doesn't say which.
after further inspection (with cumin '*' 'screen -ls | grep reboot-job'), i have found it has scheduled reboots on
static-master-fsn.torproject.org
cdn-backend-sunet-01.torproject.org
web-fsn-01.torproject.org
onionoo-frontend-01.torproject.org
orestis.torproject.org
nutans.torproject.org
chives.torproject.org
onionbalance-01.torproject.org
listera.torproject.org
peninsulare.torproject.org
Most of those are okay and should return unattended. But in some cases, those could have been covered by a libvirt reboot (i had performed those before, in this case, so non were). Eventually though, that point is moot because we'll all be running under ganeti and will separate host and guest reboot procedures.
one host is problematic in there (chives) as it needs a specific warning to users. maybe chives should be taken out of "justdoit" rotation...
i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.
then i don't know which hosts are left to do manually, but i guess that, with time, nagios will let us know. it would be nice to have a scenario for those as well.
i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.
This idea might not at all be worth the hassle of implementing it, but your "rebooting x", "x is back" lines from #tor-project irc seem eminently automatable.
i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.
This idea might not at all be worth the hassle of implementing it, but your "rebooting x", "x is back" lines from #tor-project irc seem eminently automatable.
That's exactly what I had in mind. The trick is whether individual hosts should connect to IRC to issue those notifications (?!) or whether the calling script should. Either way, we'd need some sort of notification bot, which has been kind of a pain in the arse before in my experience. But maybe we could leverage KGB for this?
It's one of the reasons I'm thinking of rebuilding this system in the first place as well...
Trac: Description: in legacy/trac#31957 (moved) we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.
we have various mechanisms to do so right now:
tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
ganeti-reboot-cluster - a tool to reboot the ganeti cluster
There are various problems with all this:
the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts
the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced
the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow
we have 5 different ways of performing reboots, we should have just one script that does it all
reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does)
In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.
I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.
to
in legacy/trac#31957 (moved) we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.
we have various mechanisms to do so right now:
tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
ganeti-reboot-cluster - a tool to reboot the ganeti cluster
There are various problems with all this:
the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts
the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced
the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow (see legacy/trac#33412 (moved))
we have 5 different ways of performing reboots, we should have just one script that does it all
reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does)
In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.
I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.
just for future reference, ganeti-reboot-cluster, as we have in our puppet repo, doesn't work in our cluster, because it relies on assumptions specific to the DSA clusters (namely that the last node is an empty spare). so it fails with:
fsn-node-03.torproject.org not empty.
apparently, the latest version of the script might fix that with the crossmigratemany function:
migrate all the primaries off of the node: ssh $master gnt-migrate -f $node
if it's a master, promote another master: ssh $notmaster gnt-cluster master-failover (optional, only if we can't afford having the master down during the reboot)
it's mostly a test to see how Fabric works and is not intended to be a replacement for all tools just yet.
but i find the results promising: it's much nicer to work in python with that stuff: errors are (mostly) well defined and it's easy to modularize things. for example, i originally wrote the thing to migrate fsn-node-01 (and that worked) but then i could extend it to also reboot arbitrary node (and i rebooted gayi).
it handles ganeti nodes, but not libvirt nodes. it therefore replaces the following:
tsa-misc/reboot-guest
ganeti-reboot-cluster
it could also replace the following, provided that (a) a host list is somewhat generated out of band and (b) the operator stays online long enough for the job to complete:
misc/multi-tool/torproject-reboot-simple
misc/multi-tool/torproject-reboot-rotation - with an explicit 30 minutes delay
The remaining script (tsa-misc/reboot-host) has been marked as deprecated, and will be removed once we get rid of the last KVM/libvirt server (legacy/trac#33084 (moved)).
So the remaining work here is to extend the reboot script to do an automatic inventory of the hosts requiring a reboot and to schedule them according to policy. We should also make sure the ganeti reboot handlers schedule a rebalance of the cluster when it's done, like it's currently done by ganeti-reboot-cluster. This should be documented in the ganeti and upgrades wiki pages when done.
We also don't check if a reboot is required at all right now, and we should do so. All those "TODO" items are documented in the tsa-misc source code listed above.
Trac: Description: in legacy/trac#31957 (moved) we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.
we have various mechanisms to do so right now:
tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
ganeti-reboot-cluster - a tool to reboot the ganeti cluster
There are various problems with all this:
the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts
the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced
the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow (see legacy/trac#33412 (moved))
we have 5 different ways of performing reboots, we should have just one script that does it all
reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does)
In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.
I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.
to
in legacy/trac#31957 (moved) we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.
we have various mechanisms to do so right now:
tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
misc/multi-tool/torproject-reboot-rotation - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
ganeti-reboot-cluster - a tool to reboot the ganeti cluster
There are various problems with all this:
the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts
the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced
the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow (see legacy/trac#33412 (moved))
we have 5 different ways of performing reboots, we should have just one script that does it all
reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does)
In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.
I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.
i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.
This idea might not at all be worth the hassle of implementing it, but your "rebooting x", "x is back" lines from #tor-project irc seem eminently automatable.
This is getting closer to reality now. There's a KGB bot living on chives now (but just use the kgb-bot.torproject.org alias instead) that can be used for such notifications. It's not hooked into fabric just yet, but that's the next step. With the configuration from /etc/kgb-client-tpa.conf, one can do:
kgb-client --conf kgb-client-tpa.conf --relay-msg test
... and that will say "test" in #tor-project and #tor-bots. This is obviously configurable, but the next step here is to find the best way to hook this into Fabric.
I'm tempted to just shell out locally and do exactly the above to send notifications, as opposed to implementing a full KGB client in Python (!). But then again, "it's just JSON-RPC with some authentication mechanism". And we just use the "relay_message" bit:
i did more work on the reboot procedures today, and rebooted the ganeti cluster using the reboot command. there were some issues with the initrd interfering with the wait_for_boot (now called wait_for_ping) checks so I did some refactoring, but i'm still confused about the exception that's raised by Fabric in this case.
the exception I got here is:
All instances migrated successfully. Shutdown scheduled for Thu 2020-04-02 18:30:55 UTC, use 'shutdown -c' to cancel. waiting 0 minutes for reboot to happen waiting up to 30 seconds for host to go down waiting 300 seconds for host to go up host fsn-node-01.torproject.org should be back online, checking uptime Traceback (most recent call last): File "./reboot", line 132, in <module> logging.getLogger(mod).setLevel('WARNING') File "./reboot", line 116, in main delay_up=args.delay_up, File "/usr/lib/python3/dist-packages/invoke/tasks.py", line 127, in __call__ result = self.body(*args, **kwargs) File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/reboot.py", line 197, in shutdown_and_wait res = con.run('uptime', watchers=[responder], pty=True, warn=True) File "<decorator-gen-3>", line 2, in run File "/usr/lib/python3/dist-packages/fabric/connection.py", line 29, in opens self.open() File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/__init__.py", line 106, in safe_open Connection.open_orig(self) File "/usr/lib/python3/dist-packages/fabric/connection.py", line 634, in open self.client.connect(**kwargs) File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349, in connect retry_on_signal(lambda: sock.connect(addr)) File "/usr/lib/python3/dist-packages/paramiko/util.py", line 280, in retry_on_signal return function() File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349, in <lambda> retry_on_signal(lambda: sock.connect(addr)) TimeoutError: [Errno 110] Connection timed out
maybe the exception gets generated above our code, in the fabric task handler itself, in which case it might mean we shouldn't use a @task for this at all, at least in our code.
i fixed the timeout error, and did today's round of upgrades without too many problems. one issue that came up is that ganeti wasn't happy to chain-reboot machines: some instances had to have a activate-disks ran so they recognize their secondary. that has been added as a TODO in the code.
i also made some experiments with feeding LDAP hosts lists as an argument to the reboot command which also worked well. this, for example, rebooted the rotation hosts with a 10-minute delay:
Quick note here to add that it would be good if the tsa-misc/reboot script would trigger a failover for hosts that are down instead of running, because the gnt-node migrate command fails if it encounters a node that's shut down, eg.:
Can't migrate, please use failover: Instance woronowii.torproject.org is not running
Usually when rebooting a node for upgrades, all nodes in the cluster are rebooted one after the other because all of them require the same upgrades. Thus, nodes are often rebooted one after the other.
In this context, a problem with tsa-misc/reboot is sometimes encountered when gnt-node migrate encounters an instance that has a DRBD volume for which synchronisation is still ongoing from a previous node reboot. This can be determined by looking at /proc/drbd. The volume status will look like ds:SyncSource/SyncTarget and oos: (out of sync blocks) will be non-zero.
It would be great if the reboot script would wait until all DRBD volumes are UpToDate before running gnt-node migrate.
turn ganeti hosts into "rotation" once we officialize this new procedure
This is therefore likely to be completed in may.
Then this got pushed down far, far into the icebox. This definitely needs a refresh, but I should just note that I just opened #40380 to make sure that Nagios checks that mandos is up and running. This won't directly help with automation but may make the process easier if someone reboots a machine and that script just hangs there.
I'm also thinking that unattended-upgrades could reboot some boxes unattended provided the following conditions are met:
the box is justdoit or rotation
if the box is rotation, its reboot schedule is different than other rotation hosts (may be hard to do)
if the box is justdoit, a reboot delay is applied
if the box has full disk encryption, mandos is correctly configured (see #40380, and this would require the box to talk with nagios, not great)
In other words, we could save some labour by instructing u-u to directly reboot justdoit hosts as needed, with a delay, if they don't have FDE and there is a delay.
This seems like a low-hanging time-saving-fruit, so to speak...
The script doesn't handle backport kernel versions correctly:
./reboot -v --delay-shutdown 1 --delay-hosts 300 -H chi-node-14.torproject.orgchecking if host chi-node-14.torproject.org requires a rebootdpkg: warning: version '1.linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1)' has bad syntax: invalid character in version numberdpkg: warning: version '1.linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1)' has bad syntax: invalid character in version numberWARNING: Kernel needs upgrade [linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1) != linux-image-5.10.0-0.bpo.8-amd64]OK: current ucode 0x5003102 greater or equal to available 0x5003102rebooting host chi-node-14.torproject.orgchecking for ganeti master on host chi-node-14.torproject.orghost chi-node-14.torproject.org is not a ganeti nodeShutdown scheduled for Tue 2021-09-21 14:15:06 UTC, use 'shutdown -c' to cancel.waiting 1 minutes for reboot to happen, at 2021-09-21 14:15:05.942718+00:00 (now is 2021-09-21 14:14:05.942718+00:00)
On 2021-09-21 14:16:24, Jérôme Charaoui (@lavamind) wrote:
Jérôme Charaoui commented:
The script doesn't handle backport kernel versions correctly:
./reboot -v --delay-shutdown 1 --delay-hosts 300 -H chi-node-14.torproject.orgchecking if host chi-node-14.torproject.org requires a rebootdpkg: warning: version '1.linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1)' has bad syntax: invalid character in version numberdpkg: warning: version '1.linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1)' has bad syntax: invalid character in version numberWARNING: Kernel needs upgrade [linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1) != linux-image-5.10.0-0.bpo.8-amd64]OK: current ucode 0x5003102 greater or equal to available 0x5003102rebooting host chi-node-14.torproject.orgchecking for ganeti master on host chi-node-14.torproject.orghost chi-node-14.torproject.org is not a ganeti nodeShutdown scheduled for Tue 2021-09-21 14:15:06 UTC, use 'shutdown -c' to cancel.waiting 1 minutes for reboot to happen, at 2021-09-21 14:15:05.942718+00:00 (now is 2021-09-21 14:14:05.942718+00:00)
--
Antoine Beaupré
torproject.org system administration
so, another 7 months later rotting in the icebox, this ticket has become mostly confusing and irrelevant. i've updated the description to turn the issues there into a checklist, and almost all of the entries in there are directly fixed by the fabric script we now use.
i think the remaining TODOs might be:
let unattended-upgrades reboot hosts on its own
have a single command to do a fleet-wide reboot
that said, the automation we have right now works pretty well. i think the above two are sugar on top and maybe not worth having this very old ticket lying around forever. we could, instead, open a ticket for each of those issues to track them, if we really do feel they are warranted.
i think not, for now. so, closing. i'll update the docs to make sure we have good pointers towards the magic reboot script for various scenarios now instead.