right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.
some of it is done by hand, some is done in puppet.
we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.
this ticket aims at documenting what we already have and where we could possibly go. this is one of the question we answered "no" on in the "ops questionnaire" in #30881 (moved). see also the automated upgrade part in #31957 (moved).
When we started this work, the installer had this many manual steps:
new-machine (common trunk): 14 steps
new-machine-hetzner-robot: +43 steps (57 total)
new-machine-hetzner-cloud: +21 steps (35 total)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
right now the "installers" are shell scripts and snippets in tsa-misc. there's a tor-install-hetzner monolithic script that has been used to install virtual machines, and other scripts that are "chunks" of things that can be done on new servers (partitionning, LDAP entry, luks setup).
cobbler - takes care of PXE and boot, delegates to kickstart the autoinstall, more relevant to RPM-based distros
terraform - config management for the cloud kind of thing, supports Hetzner Cloud, but not Hetzner Robot or Ganeti
FAI - built by a debian developer, used to build live images since buster, might require complex setup (e.g. an NFS server), setup-storage(8) might be reusable on its own
i want to tackle this. i think we're pretty close with the ganeti stuff and the half-assed installer I wrote, but i would maybe like to make a spec on how to phase out and replace, or improve the latter. maybe our installer could be formally released as a standalone thing, if only to get feedback from the community and provoke some discussion and maybe something better. right now, Debian is still working on the debian-installer distribution (for servers) and calamares (for desktop), none of which are a good fit for our environment.
as far as VMs are concerned, the non-ganeti installers should be progressively phased out as we migrate everything into ganeti cluster(s), so that is probably a non-issue. there was a bug with the ganeti installer (#31781 (moved)) but that should (eventually) be fixed upstream or in puppet.
Trac: Status: new to assigned Owner: tpa to anarcat
Trac: Description: right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.
some of it is done by hand, some is done in puppet.
we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.
this ticket aims at documenting what we already have and where we could possibly go.
to
right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.
some of it is done by hand, some is done in puppet.
we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.
this ticket aims at documenting what we already have and where we could possibly go. this is one of the question we answered "no" on in the "ops questionnaire" in #30881 (moved). see also the automated upgrade part in #31957 (moved).
i created a "discussion" section in the new machine wiki page where i copied the alternatives listed earlier here and added a few. documentation on those tools should be done over there from here on.
in #32902 (moved), hiro and I played with draw.io to draw diagrams of what the current install process looks like. it was a fun exercise, and showed a few interesting things:
too much duplication between the two disk formatters, which should be resolved
duplication between the disk formatters and luks-setup
inconsistencies between sites: hrobot writes authorized-keys in /root/.ssh, hcloud in /etc/ssh/userkeys/, one uses grml-debootstrap, the other debootstrap
I'm leaning towards scrapping the current install process and converging towards a simpler process that would be basically:
pick IP address, hostname and other static parameters
create metal/cloud upstream
get a console (ssh, web console, whatever)
use setup-storage to partition the disk, based on well-defined templates
mount everything
run debootstrap
setup network, including hostname (maybe reusing gnt-network stuff?)
populate LDAP
bootstrap Puppet in the chroot
reboot
Every remaining manual step can then be done in Puppet, as it runs before the first boot. Those steps, currently done manually, are already done by Puppet so automating this is just a matter of ordering:
mdadm.conf, fstab and crypttab config (setup-storage?)
dropbear-initramfs setup
mandos setup
net.ifnames=0
Those steps would stay manual until they are configured in Puppet.
So the next step seems to be to experiment with changing the order of the install process to bootstrap Puppet earlier and see what happens. We should also experiment with a different partionning tool, probably setup-storage.
I agree that the current install process has too many manual bits and needs to be improved. I'd like to get to a point where we have as much as possible into puppet and a few as possible scripts to bootstrap the system. The idea to use ansible up to the point where puppet kicks in is great in this sense imo.
The idea to use ansible up to the point where puppet kicks in is great in this sense imo.
I'd be open to this idea. But before I would start messing around with Ansible, I'd do things by hand and refactor things around Puppet. I'm not familiar enough with Ansible to be confident I would go anywhere. :p
One problem I feel is inherent to Ansible is that it has its own bootstrap problems. We first need to setup SSH to get it working, and that means fiddling around with networking and SSH configuration by hand. But maybe that would be easier than bootstrapping the entire host (partitionning, networking and debootstrap) by hand?
Or is there an easier way to bootstrap ansible? Could we git clone an ansible playbook on new hosts and run it directly from there?
one thing to consider is that if we're ready to go the pure-systemd way, we can totally get rid of /etc/fstab and rely on the magics of systemd for boot.
#32937 (moved) has seen a fairly successful install using setup-storage that would remove the need for custom shell scripts in favor of reusable, fairly readable config files.
i've also reshuffled the new-machine-hetzner-robot docs in that direction, but the scripts still need to be removed and teh docs updated accordingly.
the installer/tor-install-format-disks-nvme+hdds script was rewritten to use setup-storage. the docs don't really need an update since they just tell the operator to look around for the script.
once we have converted the other partitionner, however, we might want to change the rest of the install procedure to assume we have used setup-storage and source /tmp/fai/disk_var.sh to get the BOOT_DEVICE, which we currently prompt for.
one part that was missing in our documentation is the firewall setup. we had network allow blocks covering all hosts configured by hand in tor-puppet/modules/ferm/templates/defs.conf.erb. instead of updating the install docs, I just fixed this and shoved it in puppet, in #33143 (moved).
Document how many steps we had when we drew the diagrams:
When we started this work, the installer had this many manual steps:
new-machine (common trunk): 14 steps
new-machine-hetzner-robot: +43 steps (57 total)
new-machine-hetzner-cloud: +21 steps (35 total)
Now we're at:
new-machine (common trunk): 13 steps (3 steps possibly obsolete, 4 more being worked on)
new-machine-hetzner-robot: +25 steps left (38 total)
new-machine-hetzner-cloud: +21 steps (35 total, unchanged, needs to merge with setup-storage process)
i.e. we have eliminated a whopping 19 steps, most of which through the setup-storage refactoring.
Trac: Description: right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.
some of it is done by hand, some is done in puppet.
we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.
this ticket aims at documenting what we already have and where we could possibly go. this is one of the question we answered "no" on in the "ops questionnaire" in #30881 (moved). see also the automated upgrade part in #31957 (moved).
to
right now, installing machines is mostly a manual, or semi-manual process: we install debian, preferably with crypto, and then do stuff on top.
some of it is done by hand, some is done in puppet.
we should have a standardized install process that gives us a reproducable, identical install across platforms. then Puppet is what customizes the machine on top of that.
this ticket aims at documenting what we already have and where we could possibly go. this is one of the question we answered "no" on in the "ops questionnaire" in #30881 (moved). see also the automated upgrade part in #31957 (moved).
When we started this work, the installer had this many manual steps:
while setting up the fsn-node-04 server, i got the checklist from 17 to 12 steps, with 5 of those being only safety checks! we're under way to have this being a single "deploy git repo and run this one command" installer :)
removed another 4 steps from the common trunk, we're now at 9 steps there, which are fairly streamlined and can't be trimmed further without changing the design (ie. we need orchestration).
we're now at this state:
new-machine (common trunk): 9 steps
new-machine-hetzner-robot: +12 steps (21 total), many of which can be merged into hooks next time
one thing that might be interesting is to look at stuff the grml people are doing in production. this here is a grml-debootstrap wrapper that does a bunch of interesting things:
the fingerprint was the ed25519 one provided by hetzner email.
this is a major step in the automation work because we reviewed the way Fabric handles remote hosts SSH keys (it doesn't, ouch), and worked around the problems found. we especially were able to add the --fingerprint argument fairly easily once I understood the internal mechanics of Paramiko (which wasn't quite obvious).
the next step of this process is to finish converting the common trunk, new-machine, into fabric, so that (e.g.) puppet procedures are fully automated.
but i can believe this can wait until the next server. doing this install took about a day because of the automation, so we shouldn't burn too much work credit on that...