automate reboots

added component::internal services/tor sysadmin team in Legacy / Trac owner::anarcat in Legacy / Trac priority::low in Legacy / Trac severity::major in Legacy / Trac status::accepted in Legacy / Trac tpa-roadmap-june in Legacy / Trac type::project in Legacy / Trac labels

note that this may very well mean just removing tsa-misc/reboot-host and tsa-misc/reboot-guest, and documenting the multi-tool better. :)

i just tried ./torproject-reboot-rotation and ./torproject-reboot-simple and the unattended operation isn't great... it fires up all those reboots, and doesn't show clearly what it did. for example, it seems to have queued reboots on a bunch of hosts, but it doesn't say which.

after further inspection (with cumin '*' 'screen -ls | grep reboot-job'), i have found it has scheduled reboots on

static-master-fsn.torproject.org
cdn-backend-sunet-01.torproject.org
web-fsn-01.torproject.org
onionoo-frontend-01.torproject.org
orestis.torproject.org
nutans.torproject.org
chives.torproject.org
onionbalance-01.torproject.org
listera.torproject.org
peninsulare.torproject.org

Most of those are okay and should return unattended. But in some cases, those could have been covered by a libvirt reboot (i had performed those before, in this case, so non were). Eventually though, that point is moot because we'll all be running under ganeti and will separate host and guest reboot procedures.

one host is problematic in there (chives) as it needs a specific warning to users. maybe chives should be taken out of "justdoit" rotation...

i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.

then i don't know which hosts are left to do manually, but i guess that, with time, nagios will let us know. it would be nice to have a scenario for those as well.

Replying to anarcat:

i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.

This idea might not at all be worth the hassle of implementing it, but your "rebooting x", "x is back" lines from #tor-project irc seem eminently automatable.

Replying to arma:

Replying to anarcat:

i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.

This idea might not at all be worth the hassle of implementing it, but your "rebooting x", "x is back" lines from #tor-project irc seem eminently automatable.

That's exactly what I had in mind. The trick is whether individual hosts should connect to IRC to issue those notifications (?!) or whether the calling script should. Either way, we'd need some sort of notification bot, which has been kind of a pain in the arse before in my experience. But maybe we could leverage KGB for this?

It's one of the reasons I'm thinking of rebuilding this system in the first place as well...

Thanks for the feedback!

filed bug legacy/trac#33412 (moved) about the ganeti-reboot-cluster bug

Trac:
Description: in legacy/trac#31957 (moved) we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.

we have various mechanisms to do so right now:

tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
ganeti-reboot-cluster - a tool to reboot the ganeti cluster

There are various problems with all this:

the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts
the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced
the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow
we have 5 different ways of performing reboots, we should have just one script that does it all
reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does)

In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.

I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.

to

in legacy/trac#31957 (moved) we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.

we have various mechanisms to do so right now:

tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
ganeti-reboot-cluster - a tool to reboot the ganeti cluster

There are various problems with all this:

the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts
the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced
the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow (see legacy/trac#33412 (moved))
we have 5 different ways of performing reboots, we should have just one script that does it all
reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does)

In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.

I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.

just for future reference, ganeti-reboot-cluster, as we have in our puppet repo, doesn't work in our cluster, because it relies on assumptions specific to the DSA clusters (namely that the last node is an empty spare). so it fails with:

fsn-node-03.torproject.org not empty.

apparently, the latest version of the script might fix that with the crossmigratemany function:

https://salsa.debian.org/dsa-team/mirror/dsa-puppet/raw/master/modules/ganeti2/files/ganeti-reboot-cluster

for now, i'll just do the reboot by hand.

in theory, rebooting a ganeti node is to:

migrate all the primaries off of the node: ssh $master gnt-migrate -f $node
if it's a master, promote another master: ssh $notmaster gnt-cluster master-failover (optional, only if we can't afford having the master down during the reboot)
reboot the node ssh -tt $node reboot

... for each node.

i'm testing that procedure on fsn-node-03 now.

i wrote a simple reboot prototype that does just that, but can also be used as a reboot-guest replacement:

https://gitweb.torproject.org/admin/tsa-misc.git/tree/ganeti-reboot-cluster-fabric-prototype

it's mostly a test to see how Fabric works and is not intended to be a replacement for all tools just yet.

but i find the results promising: it's much nicer to work in python with that stuff: errors are (mostly) well defined and it's easy to modularize things. for example, i originally wrote the thing to migrate fsn-node-01 (and that worked) but then i could extend it to also reboot arbitrary node (and i rebooted gayi).

that prototype is now a library, in https://gitweb.torproject.org/admin/tsa-misc.git/tree/fabric_tpa/reboot.py

it can be called with a wrapper script in https://gitweb.torproject.org/admin/tsa-misc.git/tree/reboot

with something like:

./reboot -H fsn-node-03.torproject.org,...

it handles ganeti nodes, but not libvirt nodes. it therefore replaces the following:

tsa-misc/reboot-guest
ganeti-reboot-cluster

it could also replace the following, provided that (a) a host list is somewhat generated out of band and (b) the operator stays online long enough for the job to complete:

misc/multi-tool/torproject-reboot-simple
misc/multi-tool/torproject-reboot-rotation - with an explicit 30 minutes delay

The remaining script (tsa-misc/reboot-host) has been marked as deprecated, and will be removed once we get rid of the last KVM/libvirt server (legacy/trac#33084 (moved)).

So the remaining work here is to extend the reboot script to do an automatic inventory of the hosts requiring a reboot and to schedule them according to policy. We should also make sure the ganeti reboot handlers schedule a rebalance of the cluster when it's done, like it's currently done by ganeti-reboot-cluster. This should be documented in the ganeti and upgrades wiki pages when done.

We also don't check if a reboot is required at all right now, and we should do so. All those "TODO" items are documented in the tsa-misc source code listed above.

Trac:
Description: in legacy/trac#31957 (moved) we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.

we have various mechanisms to do so right now:

tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
ganeti-reboot-cluster - a tool to reboot the ganeti cluster

There are various problems with all this:

the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts
the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced
the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow (see legacy/trac#33412 (moved))
we have 5 different ways of performing reboots, we should have just one script that does it all
reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does)

In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.

I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.

to

in legacy/trac#31957 (moved) we have worked on automating upgrades, but that's only part of the problem. we also need to reboot in some situations.

we have various mechanisms to do so right now:

tsa-misc/reboot-host - reboot script for kvm boxes, kind of a mess, to be removed when we finish the kvm-ganeti migration
tsa-misc/reboot-guest - reboot a single host. kind of a hack, but useful to reboot a single machine
misc/multi-tool/torproject-reboot-simple - iterate over all hosts with rebootPolicy=justdoit in LDAP and reboot them with torproject-reboot-many
misc/multi-tool/torproject-reboot-rotation - iterate over all hosts with rebootPolicy=rotation in LDAP and reboot them with torproject-reboot-many, with a 30 minute delay between each host
ganeti-reboot-cluster - a tool to reboot the ganeti cluster

There are various problems with all this:

the torproject-reboot-* scripts do not take care of rebootPolicy=manual hosts
the ganeti-reboot-cluster script has been known to fail if a cluster is unbalanced
the ganeti-reboot-cluster script currently fails when hosts talk to each other over IPv6 somehow (see legacy/trac#33412 (moved))
we have 5 different ways of performing reboots, we should have just one script that does it all
reboot-{host,guest} do not check if hosts need reboot before rebooting (but the multi-tool does)

In short, this is kind of a mess, and we should refactor this. We should consider using needrestart, which knows how to reboot individual hosts.

I also added a feature request to the needrestart puppet module to expose its knowledge as a puppet fact, so we can use that information from PuppetDB instead of SSH'ing in each host and calling the dsa-* tools.

Replying to arma:

Replying to anarcat:

i also wonder, in general, if we should warn users about those reboots, as part of the reboot script.

This idea might not at all be worth the hassle of implementing it, but your "rebooting x", "x is back" lines from #tor-project irc seem eminently automatable.

This is getting closer to reality now. There's a KGB bot living on chives now (but just use the kgb-bot.torproject.org alias instead) that can be used for such notifications. It's not hooked into fabric just yet, but that's the next step. With the configuration from /etc/kgb-client-tpa.conf, one can do:

kgb-client --conf kgb-client-tpa.conf --relay-msg test

... and that will say "test" in #tor-project and #tor-bots. This is obviously configurable, but the next step here is to find the best way to hook this into Fabric.

I'm tempted to just shell out locally and do exactly the above to send notifications, as opposed to implementing a full KGB client in Python (!). But then again, "it's just JSON-RPC with some authentication mechanism". And we just use the "relay_message" bit:

https://manpages.debian.org/unstable/kgb-bot/kgb-protocol.7p.en.html#relay_message_message

... so "how hard could it be"?

Fun times.

i did more work on the reboot procedures today, and rebooted the ganeti cluster using the reboot command. there were some issues with the initrd interfering with the wait_for_boot (now called wait_for_ping) checks so I did some refactoring, but i'm still confused about the exception that's raised by Fabric in this case.

the exception I got here is:

    All instances migrated successfully.
    Shutdown scheduled for Thu 2020-04-02 18:30:55 UTC, use 'shutdown -c' to cancel.
    waiting 0 minutes for reboot to happen
    waiting up to 30 seconds for host to go down
    waiting 300 seconds for host to go up
    host fsn-node-01.torproject.org should be back online, checking uptime
    Traceback (most recent call last):
      File "./reboot", line 132, in <module>
        logging.getLogger(mod).setLevel('WARNING')
      File "./reboot", line 116, in main
        delay_up=args.delay_up,
      File "/usr/lib/python3/dist-packages/invoke/tasks.py", line 127, in __call__
        result = self.body(*args, **kwargs)
      File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/reboot.py", line 197, in shutdown_and_wait
        res = con.run('uptime', watchers=[responder], pty=True, warn=True)
      File "<decorator-gen-3>", line 2, in run
      File "/usr/lib/python3/dist-packages/fabric/connection.py", line 29, in opens
        self.open()
      File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/__init__.py", line 106, in safe_open
        Connection.open_orig(self)
      File "/usr/lib/python3/dist-packages/fabric/connection.py", line 634, in open
        self.client.connect(**kwargs)
      File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349, in connect
        retry_on_signal(lambda: sock.connect(addr))
      File "/usr/lib/python3/dist-packages/paramiko/util.py", line 280, in retry_on_signal
        return function()
      File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349, in <lambda>
        retry_on_signal(lambda: sock.connect(addr))
    TimeoutError: [Errno 110] Connection timed out

maybe the exception gets generated above our code, in the fabric task handler itself, in which case it might mean we shouldn't use a @task for this at all, at least in our code.

Trac:
Status: new to accepted
Owner: tpa to anarcat
Keywords: tpa-roadmap-march deleted, tpa-roadmap-april added

i fixed the timeout error, and did today's round of upgrades without too many problems. one issue that came up is that ganeti wasn't happy to chain-reboot machines: some instances had to have a activate-disks ran so they recognize their secondary. that has been added as a TODO in the code.

i also made some experiments with feeding LDAP hosts lists as an argument to the reboot command which also worked well. this, for example, rebooted the rotation hosts with a 10-minute delay:

./reboot -H $(ssh alberti.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b dc=torproject,dc=org -LLL "(&(hostname=*.torproject.org)(rebootPolicy=rotation))" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort') -v

I added a modified recipe to the upgrades page, which covers all cases.

I also set the reboot policy on a few hosts so they are classified properly, those didn't have a policy, and now have:

manual:

moly (KVM, requires special handling)
kvm4 (KVM)
kvm5 (KVM)
scw-arm-par1 (buggy buildbox, see legacy/trac#32920 (moved))
fsn-node-01 (ganeti, requires special handling)
fsn-node-02 (ganeti)
fsn-node-03 (ganeti)
weissii (windows buildbox, no ssh)
woronowii (windows buildbox, no ssh)
winklerianum (windows buildbox, no ssh)

justdoit:

pauli (puppet)
rude (rt)
alberti (ldap)
eugeni (mail)
majus (translation)
rouyi (jenkins)
troodi (trac)
nevii (dns primary)
henryi (consensus-health)
vineale (gitweb)
gayi (svn)
polyanthum (bridges)
materculae (exonerator)
meronense (metrics.tpo)
colchicifolium (collector backend)
carinatum (DocTor)
build-x86-05 (buildbox)
build-x86-06 (buildbox)
build-x86-08 (buildbox)
build-x86-09 (buildbox)
perdulce (people.tpo)
staticiforme (static master)
forrestii (fpcentral)
subnotabile (survey)
crm-int-01 (CRM backend)
crm-ext-01 (CRM frontend)
submit-01 (mail)

rotation:

fallax (DNS secondary)
omeiense (onionoo backend)
oo-hetzner-03 (onionoo backend)
neriniflorum (DNS secondary)
web-hetzner-01 (web frontend)
web-cymru-01 (web frontend)

the following were already configured as...

rotation:

orestis (onionoo backend)
nutans (DNS secondary)
cdn-backend-sunet-01 (web frontend)
hetzner-hel1-02 (DNS secondary)
hetzner-hel1-03 (web frontend)
onionoo-backend-01 (onionoo backend)
web-fsn-01 (web frontend)
web-fsn-02 (web frontend)
onionoo-frontend-01 (onionoo frontend)
cache01 (cache frontend)
cache-02 (cache frontend)
onionoo-backend-02 (onionoo backend)

justdoit:

corsicum (collector)
hetzner-hel1-01 (nagios)
bungei (backup storage)
hetzner-nbg1-01 (prometheus)
hetzner-nbg1-02 (prometheus)
archive-01 (non-redundant web frontend)
loghost01 (syslog)
static-master-fsn (static master)
bacula-director-01 (backup director)
gettor-01 (gettor)
onionbalance-01 (onionbalance)
chives (IRC)
build-arm-10 (buildbox)
tbb-nightlies-master (static master)
gitlab-02 (gitlab)
check-01 (check.tpo)

manual:

mandos-01 (mandos, requires crypto)
fsn-node-04
fsn-node-05

In other words, I made the following diff in LDAP:

--- policy-before	2020-04-30 19:48:50.158412413 -0400
+++ policy-after	2020-04-30 19:54:15.209832522 -0400
@@ -6,27 +6,35 @@
 
 dn: host=moly,ou=hosts,dc=torproject,dc=org
 host: moly
+rebootPolicy: manual
 
 dn: host=pauli,ou=hosts,dc=torproject,dc=org
 host: pauli
+rebootPolicy: justdoit
 
 dn: host=rude,ou=hosts,dc=torproject,dc=org
 host: rude
+rebootPolicy: justdoit
 
 dn: host=alberti,ou=hosts,dc=torproject,dc=org
 host: alberti
+rebootPolicy: justdoit
 
 dn: host=cupani,ou=hosts,dc=torproject,dc=org
 host: cupani
+rebootPolicy: justdoit
 
 dn: host=fallax,ou=hosts,dc=torproject,dc=org
 host: fallax
+rebootPolicy: rotation
 
 dn: host=eugeni,ou=hosts,dc=torproject,dc=org
 host: eugeni
+rebootPolicy: justdoit
 
 dn: host=majus,ou=hosts,dc=torproject,dc=org
 host: majus
+rebootPolicy: justdoit
 
 dn: host=listera,ou=hosts,dc=torproject,dc=org
 host: listera
@@ -34,63 +42,83 @@
 
 dn: host=rouyi,ou=hosts,dc=torproject,dc=org
 host: rouyi
+rebootPolicy: justdoit
 
 dn: host=palmeri,ou=hosts,dc=torproject,dc=org
 host: palmeri
+rebootPolicy: justdoit
 
 dn: host=weissii,ou=hosts,dc=torproject,dc=org
 host: weissii
+rebootPolicy: manual
 
 dn: host=troodi,ou=hosts,dc=torproject,dc=org
 host: troodi
+rebootPolicy: justdoit
 
 dn: host=nevii,ou=hosts,dc=torproject,dc=org
 host: nevii
+rebootPolicy: justdoit
 
 dn: host=henryi,ou=hosts,dc=torproject,dc=org
 host: henryi
+rebootPolicy: justdoit
 
 dn: host=vineale,ou=hosts,dc=torproject,dc=org
 host: vineale
+rebootPolicy: justdoit
 
 dn: host=gayi,ou=hosts,dc=torproject,dc=org
 host: gayi
+rebootPolicy: justdoit
 
 dn: host=polyanthum,ou=hosts,dc=torproject,dc=org
 host: polyanthum
+rebootPolicy: justdoit
 
 dn: host=materculae,ou=hosts,dc=torproject,dc=org
 host: materculae
+rebootPolicy: justdoit
 
 dn: host=omeiense,ou=hosts,dc=torproject,dc=org
 host: omeiense
+rebootPolicy: rotation
 
 dn: host=meronense,ou=hosts,dc=torproject,dc=org
 host: meronense
+rebootPolicy: justdoit
 
 dn: host=colchicifolium,ou=hosts,dc=torproject,dc=org
 host: colchicifolium
+rebootPolicy: justdoit
 
 dn: host=carinatum,ou=hosts,dc=torproject,dc=org
 host: carinatum
+rebootPolicy: justdoit
 
 dn: host=build-x86-05,ou=hosts,dc=torproject,dc=org
 host: build-x86-05
+rebootPolicy: justdoit
 
 dn: host=build-x86-06,ou=hosts,dc=torproject,dc=org
 host: build-x86-06
+rebootPolicy: justdoit
 
 dn: host=perdulce,ou=hosts,dc=torproject,dc=org
 host: perdulce
+rebootPolicy: justdoit
 
 dn: host=staticiforme,ou=hosts,dc=torproject,dc=org
 host: staticiforme
+rebootPolicy: justdoit
 
 dn: host=woronowii,ou=hosts,dc=torproject,dc=org
 host: woronowii
+rebootPolicy: manual
 
 dn: host=winklerianum,ou=hosts,dc=torproject,dc=org
 host: winklerianum
+rebootPolicy: manual
 
 dn: host=orestis,ou=hosts,dc=torproject,dc=org
 host: orestis
@@ -106,21 +134,27 @@
 
 dn: host=kvm4,ou=hosts,dc=torproject,dc=org
 host: kvm4
+rebootPolicy: manual
 
 dn: host=oo-hetzner-03,ou=hosts,dc=torproject,dc=org
 host: oo-hetzner-03
+rebootPolicy: rotation
 
 dn: host=forrestii,ou=hosts,dc=torproject,dc=org
 host: forrestii
+rebootPolicy: justdoit
 
 dn: host=subnotabile,ou=hosts,dc=torproject,dc=org
 host: subnotabile
+rebootPolicy: justdoit
 
 dn: host=kvm5,ou=hosts,dc=torproject,dc=org
 host: kvm5
+rebootPolicy: manual
 
 dn: host=neriniflorum,ou=hosts,dc=torproject,dc=org
 host: neriniflorum
+rebootPolicy: rotation
 
 dn: host=hetzner-hel1-01,ou=hosts,dc=torproject,dc=org
 host: hetzner-hel1-01
@@ -132,12 +166,15 @@
 
 dn: host=build-x86-08,ou=hosts,dc=torproject,dc=org
 host: build-x86-08
+rebootPolicy: justdoit
 
 dn: host=web-hetzner-01,ou=hosts,dc=torproject,dc=org
 host: web-hetzner-01
+rebootPolicy: rotation
 
 dn: host=scw-arm-par-01,ou=hosts,dc=torproject,dc=org
 host: scw-arm-par-01
+rebootPolicy: manual
 
 dn: host=hetzner-hel1-02,ou=hosts,dc=torproject,dc=org
 host: hetzner-hel1-02
@@ -149,15 +186,19 @@
 
 dn: host=web-cymru-01,ou=hosts,dc=torproject,dc=org
 host: web-cymru-01
+rebootPolicy: rotation
 
 dn: host=crm-int-01,ou=hosts,dc=torproject,dc=org
 host: crm-int-01
+rebootPolicy: justdoit
 
 dn: host=crm-ext-01,ou=hosts,dc=torproject,dc=org
 host: crm-ext-01
+rebootPolicy: justdoit
 
 dn: host=build-x86-09,ou=hosts,dc=torproject,dc=org
 host: build-x86-09
+rebootPolicy: justdoit
 
 dn: host=bungei,ou=hosts,dc=torproject,dc=org
 host: bungei
@@ -181,9 +222,11 @@
 
 dn: host=fsn-node-01,ou=hosts,dc=torproject,dc=org
 host: fsn-node-01
+rebootPolicy: manual
 
 dn: host=fsn-node-02,ou=hosts,dc=torproject,dc=org
 host: fsn-node-02
+rebootPolicy: manual
 
 dn: host=loghost01.torproject.org,ou=hosts,dc=torproject,dc=org
 host: loghost01
@@ -243,6 +286,7 @@
 
 dn: host=fsn-node-03,ou=hosts,dc=torproject,dc=org
 host: fsn-node-03
+rebootPolicy: manual
 
 dn: host=onionoo-backend-02,ou=hosts,dc=torproject,dc=org
 host: onionoo-backend-02

The policy is being interpreted here as:

manual: requires manual intervention or special tools (fabric in case of ganeti, reboot-host in the case of KVM, nothing for windows boxes)
justdoit: can be rebooted with proper prior warning (10 minutes), possibly in parallel with each other
rotation: must not be rebooted together, longer warning (30 minutes)

I tried to update the "upgrades" docs to reflect this.

I think the last steps here are:

add LDAP support in the reboot script
parallelize "justdoit" jobs
turn ganeti hosts into "rotation" once we officialize this new procedure

This is therefore likely to be completed in may.

Trac:
Keywords: tpa-roadmap-april deleted, tpa-roadmap-may added

i obviously did not have time to complete this in may, and i'm unlikely to do so in june either, but just in case, moving there.

Trac:
Keywords: tpa-roadmap-may deleted, tpa-roadmap-june added

moved from legacy/trac#33406 (moved)

added Backlog label

added Project label and removed 1 deleted label

removed Backlog label

added Sysadmin label and removed 1 deleted label

added Backlog label

removed Backlog label

added Backlog label

removed Backlog label

added Backlog label

added Icebox label and removed Backlog label

Quick note here to add that it would be good if the tsa-misc/reboot script would trigger a failover for hosts that are down instead of running, because the gnt-node migrate command fails if it encounters a node that's shut down, eg.:

Can't migrate, please use failover: Instance woronowii.torproject.org is not running

Usually when rebooting a node for upgrades, all nodes in the cluster are rebooted one after the other because all of them require the same upgrades. Thus, nodes are often rebooted one after the other.

In this context, a problem with tsa-misc/reboot is sometimes encountered when gnt-node migrate encounters an instance that has a DRBD volume for which synchronisation is still ongoing from a previous node reboot. This can be determined by looking at /proc/drbd. The volume status will look like ds:SyncSource/SyncTarget and oos: (out of sync blocks) will be non-zero.

It would be great if the reboot script would wait until all DRBD volumes are UpToDate before running gnt-node migrate.

mentioned in issue #40380

I wrote this, in April 2020:

I think the last steps here are:

add LDAP support in the reboot script

parallelize "justdoit" jobs

turn ganeti hosts into "rotation" once we officialize this new procedure

This is therefore likely to be completed in may.

Then this got pushed down far, far into the icebox. This definitely needs a refresh, but I should just note that I just opened #40380 to make sure that Nagios checks that mandos is up and running. This won't directly help with automation but may make the process easier if someone reboots a machine and that script just hangs there.

I'm also thinking that unattended-upgrades could reboot some boxes unattended provided the following conditions are met:

the box is justdoit or rotation
if the box is rotation, its reboot schedule is different than other rotation hosts (may be hard to do)
if the box is justdoit, a reboot delay is applied
if the box has full disk encryption, mandos is correctly configured (see #40380, and this would require the box to talk with nagios, not great)

In other words, we could save some labour by instructing u-u to directly reboot justdoit hosts as needed, with a delay, if they don't have FDE and there is a delay.

This seems like a low-hanging time-saving-fruit, so to speak...

The script doesn't handle backport kernel versions correctly:

./reboot -v --delay-shutdown 1 --delay-hosts 300 -H chi-node-14.torproject.org
checking if host chi-node-14.torproject.org requires a reboot
dpkg: warning: version '1.linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1)' has bad syntax: invalid character in version number
dpkg: warning: version '1.linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1)' has bad syntax: invalid character in version number
WARNING: Kernel needs upgrade [linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1) != linux-image-5.10.0-0.bpo.8-amd64]
OK: current ucode 0x5003102 greater or equal to available 0x5003102
rebooting host chi-node-14.torproject.org
checking for ganeti master on host chi-node-14.torproject.org
host chi-node-14.torproject.org is not a ganeti node
Shutdown scheduled for Tue 2021-09-21 14:15:06 UTC, use 'shutdown -c' to cancel.
waiting 1 minutes for reboot to happen, at 2021-09-21 14:15:05.942718+00:00 (now is 2021-09-21 14:14:05.942718+00:00)

fabric just calls this nagios check here:

/usr/lib/nagios/plugins/dsa-check-running-kernel

That's where this particular bug lives.

a.

...

On 2021-09-21 14:16:24, Jérôme Charaoui (@lavamind) wrote:

Jérôme Charaoui commented:

The script doesn't handle backport kernel versions correctly:

./reboot -v --delay-shutdown 1 --delay-hosts 300 -H chi-node-14.torproject.org
checking if host chi-node-14.torproject.org requires a reboot
dpkg: warning: version '1.linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1)' has bad syntax: invalid character in version number
dpkg: warning: version '1.linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1)' has bad syntax: invalid character in version number
WARNING: Kernel needs upgrade [linux-image-5.10.0-0.bpo.8-amd64(=5.10.46-4~bpo10+1) != linux-image-5.10.0-0.bpo.8-amd64]
OK: current ucode 0x5003102 greater or equal to available 0x5003102
rebooting host chi-node-14.torproject.org
checking for ganeti master on host chi-node-14.torproject.org
host chi-node-14.torproject.org is not a ganeti node
Shutdown scheduled for Tue 2021-09-21 14:15:06 UTC, use 'shutdown -c' to cancel.
waiting 1 minutes for reboot to happen, at 2021-09-21 14:15:05.942718+00:00 (now is 2021-09-21 14:14:05.942718+00:00)

-- Antoine Beaupré torproject.org system administration

that particular bug was documented and fixed in #40428 (closed)

marked this issue as related to #40428 (closed)

added lifecycle label

marked this issue as related to #31957 (closed)

changed the description

so, another 7 months later rotting in the icebox, this ticket has become mostly confusing and irrelevant. i've updated the description to turn the issues there into a checklist, and almost all of the entries in there are directly fixed by the fabric script we now use.

i think the remaining TODOs might be:

let unattended-upgrades reboot hosts on its own
have a single command to do a fleet-wide reboot

that said, the automation we have right now works pretty well. i think the above two are sugar on top and maybe not worth having this very old ticket lying around forever. we could, instead, open a ticket for each of those issues to track them, if we really do feel they are warranted.

i think not, for now. so, closing. i'll update the docs to make sure we have good pointers towards the magic reboot script for various scenarios now instead.

closed

mentioned in commit wiki-replica@2607e1cd

marked this issue as related to #40993

mentioned in issue #40993

automate reboots

Designs

Child items 0

Activity