automate installs

Trac:
Child Ticket(s): #32901 (moved), #32902 (moved), #33387 (moved), #32914 (moved), #33332 (moved), #33143 (moved), #32283 (moved)

added component::internal services/tor sysadmin team owner::anarcat priority::low severity::normal status::assigned tpa-roadmap-november type::enhancement labels

right now the "installers" are shell scripts and snippets in tsa-misc. there's a tor-install-hetzner monolithic script that has been used to install virtual machines, and other scripts that are "chunks" of things that can be done on new servers (partitionning, LDAP entry, luks setup).

the process is documented in new-machine.

possible tools to research further:

cobbler - takes care of PXE and boot, delegates to kickstart the autoinstall, more relevant to RPM-based distros
terraform - config management for the cloud kind of thing, supports Hetzner Cloud, but not Hetzner Robot or Ganeti
FAI - built by a debian developer, used to build live images since buster, might require complex setup (e.g. an NFS server), setup-storage(8) might be reusable on its own
list of debian setup tools, see also AutomatedInstallation
himblock has some interesting post-install configure bits in Python, along with pyparted bridges
livewrapper is also one of those installers, in a way

Unfortuantely, I ruled out the official debian-installer because of the complexity of the preseeding system and partman.

Update: that list is now maintained in https://help.torproject.org/tsa/howto/new-machine/#Alternatives_considered

i want to tackle this. i think we're pretty close with the ganeti stuff and the half-assed installer I wrote, but i would maybe like to make a spec on how to phase out and replace, or improve the latter. maybe our installer could be formally released as a standalone thing, if only to get feedback from the community and provoke some discussion and maybe something better. right now, Debian is still working on the debian-installer distribution (for servers) and calamares (for desktop), none of which are a good fit for our environment.

as far as VMs are concerned, the non-ganeti installers should be progressively phased out as we migrate everything into ganeti cluster(s), so that is probably a non-issue. there was a bug with the ganeti installer (#31781 (moved)) but that should (eventually) be fixed upstream or in puppet.

Trac:
Status: new to assigned
Owner: tpa to anarcat

link to the auto upgrade and questionnaire bits.

Trac:
Description: right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.

some of it is done by hand, some is done in puppet.

we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.

this ticket aims at documenting what we already have and where we could possibly go.

to

right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.

some of it is done by hand, some is done in puppet.

we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.

this ticket aims at documenting what we already have and where we could possibly go. this is one of the question we answered "no" on in the "ops questionnaire" in #30881 (moved). see also the automated upgrade part in #31957 (moved).

i had a nice chat with Thomas Lange who confirmed a few things about FAI:

it requires a server (fai-server to be more precise)
it needs control over the boot environment (custom ISO or PXE + NFS)
it does not use the debian-installer, instead the base system is installed through tar files which have the same content as a debootstrap call
preseeding works by running dpkg-reconfigure on the packages part of the tar file
custom FAI-enabled boot images are available from https://fai-project.org/FAIme/ but you can also create your own
setup-storage can be used without an installer

i created a "discussion" section in the new machine wiki page where i copied the alternatives listed earlier here and added a few. documentation on those tools should be done over there from here on.

in #32902 (moved), hiro and I played with draw.io to draw diagrams of what the current install process looks like. it was a fun exercise, and showed a few interesting things:

too much duplication between the two disk formatters, which should be resolved
duplication between the disk formatters and luks-setup
inconsistencies between sites: hrobot writes authorized-keys in /root/.ssh, hcloud in /etc/ssh/userkeys/, one uses grml-debootstrap, the other debootstrap

I'm leaning towards scrapping the current install process and converging towards a simpler process that would be basically:

pick IP address, hostname and other static parameters
create metal/cloud upstream
get a console (ssh, web console, whatever)
use setup-storage to partition the disk, based on well-defined templates
mount everything
run debootstrap
setup network, including hostname (maybe reusing gnt-network stuff?)
populate LDAP
bootstrap Puppet in the chroot
reboot

Every remaining manual step can then be done in Puppet, as it runs before the first boot. Those steps, currently done manually, are already done by Puppet so automating this is just a matter of ordering:

SSH daemon and keys configuration
automated upgrades (part of the larger #31957 (moved))
/etc/hosts management?

Those would need some coding work in Puppet:

root password management (trocla? abandon?)
swapfile (move to setup-storage?)
kernel and grub setup?
mdadm.conf, fstab and crypttab config (setup-storage?)
dropbear-initramfs setup
mandos setup
net.ifnames=0

Those steps would stay manual until they are configured in Puppet.

So the next step seems to be to experiment with changing the order of the install process to bootstrap Puppet earlier and see what happens. We should also experiment with a different partionning tool, probably setup-storage.

TL;DR: next steps:

test setup-storage
bootstrap Puppet earlier

I agree that the current install process has too many manual bits and needs to be improved. I'd like to get to a point where we have as much as possible into puppet and a few as possible scripts to bootstrap the system. The idea to use ansible up to the point where puppet kicks in is great in this sense imo.

The idea to use ansible up to the point where puppet kicks in is great in this sense imo.

I'd be open to this idea. But before I would start messing around with Ansible, I'd do things by hand and refactor things around Puppet. I'm not familiar enough with Ansible to be confident I would go anywhere. :p

One problem I feel is inherent to Ansible is that it has its own bootstrap problems. We first need to setup SSH to get it working, and that means fiddling around with networking and SSH configuration by hand. But maybe that would be easier than bootstrapping the entire host (partitionning, networking and debootstrap) by hand?

Or is there an easier way to bootstrap ansible? Could we git clone an ansible playbook on new hosts and run it directly from there?

one thing to consider is that if we're ready to go the pure-systemd way, we can totally get rid of /etc/fstab and rely on the magics of systemd for boot.

https://www.freedesktop.org/wiki/Specifications/DiscoverablePartitionsSpec/ https://wiki.archlinux.org/index.php/Systemd#GPT_partition_automounting https://wiki.archlinux.org/index.php/Swap#Activation_by_systemd

this way we just have to partition and format the disks 'just so', mount and debootstrap and everything follows.

#32937 (moved) has seen a fairly successful install using setup-storage that would remove the need for custom shell scripts in favor of reusable, fairly readable config files.

i've also reshuffled the new-machine-hetzner-robot docs in that direction, but the scripts still need to be removed and teh docs updated accordingly.

the installer/tor-install-format-disks-nvme+hdds script was rewritten to use setup-storage. the docs don't really need an update since they just tell the operator to look around for the script.

once we have converted the other partitionner, however, we might want to change the rest of the install procedure to assume we have used setup-storage and source /tmp/fai/disk_var.sh to get the BOOT_DEVICE, which we currently prompt for.

i did just that and ditched the formatting script, which is now just a legacy wrapper.

Trac:
Keywords: N/A deleted, tpa-roadmap-february added

one part that was missing in our documentation is the firewall setup. we had network allow blocks covering all hosts configured by hand in tor-puppet/modules/ferm/templates/defs.conf.erb. instead of updating the install docs, I just fixed this and shoved it in puppet, in #33143 (moved).

removed two more steps: the /etc/aliases junk (#32283 (moved)) and the portmap/etc package removal (also done in puppet).

Trac:
Priority: Medium to Low

Document how many steps we had when we drew the diagrams:

When we started this work, the installer had this many manual steps:

new-machine (common trunk): 14 steps

new-machine-hetzner-robot: +43 steps (57 total)

new-machine-hetzner-cloud: +21 steps (35 total)

Now we're at:

new-machine (common trunk): 13 steps (3 steps possibly obsolete, 4 more being worked on)
new-machine-hetzner-robot: +25 steps left (38 total)
new-machine-hetzner-cloud: +21 steps (35 total, unchanged, needs to merge with setup-storage process)

i.e. we have eliminated a whopping 19 steps, most of which through the setup-storage refactoring.

Trac:
Description: right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.

some of it is done by hand, some is done in puppet.

we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.

this ticket aims at documenting what we already have and where we could possibly go. this is one of the question we answered "no" on in the "ops questionnaire" in #30881 (moved). see also the automated upgrade part in #31957 (moved).

to

right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.

some of it is done by hand, some is done in puppet.

we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.

this ticket aims at documenting what we already have and where we could possibly go. this is one of the question we answered "no" on in the "ops questionnaire" in #30881 (moved). see also the automated upgrade part in #31957 (moved).

When we started this work, the installer had this many manual steps:

new-machine (common trunk): 14 steps
new-machine-hetzner-robot: +43 steps (57 total)
new-machine-hetzner-cloud: +21 steps (35 total)

while setting up the fsn-node-04 server, i got the checklist from 17 to 12 steps, with 5 of those being only safety checks! we're under way to have this being a single "deploy git repo and run this one command" installer :)

removed another 4 steps from the common trunk, we're now at 9 steps there, which are fairly streamlined and can't be trimmed further without changing the design (ie. we need orchestration).

we're now at this state:

new-machine (common trunk): 9 steps
new-machine-hetzner-robot: +12 steps (21 total), many of which can be merged into hooks next time
new-machine-hetzner-cloud: unchanged

one thing that might be interesting is to look at stuff the grml people are doing in production. this here is a grml-debootstrap wrapper that does a bunch of interesting things:

https://github.com/sipwise/deployment-iso/blob/1b1e54b822b8af6b6c691993eae9d6589ed8b483/templates/scripts/includes/deployment.sh#L2175

namely:

EFI support
grub configuration (e.g. net.ifnames=0)
multiple disks support (reported upstream as bug 152)
mmdebootstrap instead of debootstrap (simply export DEBOOTSTRAP=mmdebstrap!)
third-party repo configuration
etckeeper configuration
/etc/hosts configuration
a partitionning shell script
a reset /etc/debootstrap/packages (just like us)
automated grml-debootstrap run (echo y | grml-debootstrap??)
an elaborate puppet bootstrap

Trac:
Keywords: tpa-roadmap-february deleted, tpa-roadmap-april added

today, i did a new-machine-hetzner-robot process almost entirely automatically, using fabric, with the followign command:

./install -H root@88.99.194.57 --fingerprint 0d:4a:c0:85:c4:e1:fe:03:15:e0:99:fe:7d:cc:34:f7 --verbose hetzner-robot fsn-node-05.torproject.org installer/disk-config/gnt-fsn-NVMe installer/packages installer/post-scripts/

the fingerprint was the ed25519 one provided by hetzner email.

this is a major step in the automation work because we reviewed the way Fabric handles remote hosts SSH keys (it doesn't, ouch), and worked around the problems found. we especially were able to add the --fingerprint argument fairly easily once I understood the internal mechanics of Paramiko (which wasn't quite obvious).

the next step of this process is to finish converting the common trunk, new-machine, into fabric, so that (e.g.) puppet procedures are fully automated.

but i can believe this can wait until the next server. doing this install took about a day because of the automation, so we shouldn't burn too much work credit on that...

we might have a problem with automated installs using debootstrap, as it sets up usrmerge by default, which seems to cause significant problems:

https://wiki.debian.org/Teams/Dpkg/MergedUsr

we might want to switch to mmdebstrap for performance if not reliability anyways.

this will at least need research and testing to confirm this is a problem.

i filed #34115 (moved) to followup on usrmerge.

running out of time to do more automation, so pushing back 6 months.

Trac:
Keywords: tpa-roadmap-april deleted, tpa-roadmap-november added

mentioned in issue #31957 (moved)

mentioned in issue #32283 (moved)

mentioned in issue #32914 (moved)

mentioned in issue #32901 (moved)

mentioned in issue #32902 (moved)

mentioned in issue #33143 (moved)

mentioned in issue #33332 (moved)

mentioned in issue #33387 (moved)

automate installs

Child items 0

Activity