[Fabric][] is a Python module, built on top of [Invoke][] that could be described as "make for sysadmins". It allows us to establish "best practices" for routine tasks like: * installing a server (TODO) * retiring a server ([howto/retire-a-host](howto/retire-a-host)) * migrating machines ([howto/ganeti](howto/ganeti#Importing-external-instances)) * retiring a user (TODO) * reboots ([howto/upgrades](howto/upgrades#Kernel-upgrades-and-reboots)) * ... etc Fabric makes easy things reproducible and hard things possible. It is *not* designed to handle larger-scale configuration management, for which we use [howto/puppet](howto/puppet). [Invoke]: https://www.pyinvoke.org/ [Fabric]: https://www.fabfile.org/ [[_TOC_]] # Tutorial All of the instructions below assume you have a copy of the TPA fabric library, fetch it with: git clone git@git-rw.torproject.org:admin/tsa-misc.git && cd tsa-misc ## Running a command on hosts Fabric can be used from the commandline to run arbitrary commands on servers, like this: fab -H hostname.example.com -- COMMAND For example: $ fab -H perdulce.torproject.org -- uptime 17:53:22 up 24 days, 19:34, 1 user, load average: 0.00, 0.00, 0.07 This is equivalent to: ssh hostname.example.com COMMAND ... except that you can run it on *multiple* servers: $ fab -H perdulce.torproject.org,chives.torproject.org -- uptime 17:54:48 up 24 days, 19:36, 1 user, load average: 0.00, 0.00, 0.06 17:54:52 up 24 days, 17:35, 21 users, load average: 0.00, 0.00, 0.00 ## Listing tasks and self-documentation The `tsa-misc` repository has a good library of tasks that can be ran from the commandline. To show the list, use: fab -l Help for individual tasks can also be inspected with `--help`, for example: $ fab -h host.fetch-ssh-host-pubkey Usage: fab [--core-opts] host.fetch-ssh-host-pubkey [--options] [other tasks here ...] Docstring: fetch public host key from server Options: -t STRING, --type=STRING The name of the server to run the command against is implicit in the usage: it must be passed with the `-H` (short for `--hosts`) argument. For example: $ fab -H perdulce.torproject.org host.fetch-ssh-host-pubkey b'ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGOnZX95ZQ0mliL0++Enm4oXMdf1caZrGEgMjw5Ykuwp root@perdulce\n' # How-to ## A simple Fabric function Each procedure mentioned in the introduction above has its own documentation. This tutorial aims more to show how to make a simple Fabric program inside TPA. Here we will create a `uptime` task which will simply run the `uptime` command on the provided hosts. It's a trivial example that shouldn't be implemented (it is easier to just tell `fab` to run the shell command) but it should give you an idea of how to write new tasks. 1. edit the source $EDITOR fabric_tpa/host.py we pick the "generic" host library (`host.py`) here, but there are other libraries that might be more appropriate, for example `ganeti`, `libvirt` or `reboot`. Fabric-specific extensions, monkeypatching and other hacks should live in `__init__.py`. 2. add a task, which is simply a Python function: @task def uptime(con): return con.run('uptime') The `@task` string is a [decorator](https://docs.python.org/3/glossary.html#term-decorator) which indicates to Fabric the function should be exposed as a command-line task. In that case, it gets a Connection object passed which we can run stuff from. In this case, we run the `uptime` command over SSH. 3. the task will automatically be loaded as it is part of the `host` module, but if this is a new module, add it to `fabfile.py` in the parent directory 4. the task should now be available: $ fab -H perdulce.torproject.org host.uptime 18:06:56 up 24 days, 19:48, 1 user, load average: 0.00, 0.00, 0.02 ## Pager playbook N/A for now. Fabric is an ad-hoc tool and, as such, doesn't have monitoring that should trigger a response. It *could* however be used for some oncall work, which remains to be determined. ## Disaster recovery N/A. # Reference ## Installation Fabric is available as a Debian package: apt install fabric See also the [upstream instructions](https://www.fabfile.org/installing.html) for other platforms (e.g. Pip). Fabric code grew out of the installer and reboot scripts in the `tsa-misc` repository. To get access to the code, simply clone the repository and run from the top level directory: git clone git@git-rw.torproject.org:admin/tsa-misc.git && cd tsa-misc && fab -l This code could also be moved to its own repository altogether. ### Installing Fabric on Debian buster Fabric has been [part of Debian since at least Debian jessie][], but you should install the newer, 2.x version that is only available in bullseye and later. The bullseye version is a "trivial backport" which means it can be installed directly in stable with: apt install -t bullseye fabric [part of Debian since at least Debian jessie]: https://tracker.debian.org/pkg/fabric This will also pull [invoke](https://tracker.debian.org/pkg/python-invoke) (from unstable) and [paramiko](https://tracker.debian.org/pkg/paramiko) (from stable). The latter will show a *lot* of warnings when running by default, however, so you might want to upgrade to backports as well: apt install -t buster-backports python3-paramiko ## SLA N/A ## Design TPA's fabric library lives in the `tsa-misc` repository and consists of multiple Python modules, at the time of writing: anarcat@curie:tsa-misc(master)$ wc -l fabric_tpa/*.py 463 fabric_tpa/ganeti.py 297 fabric_tpa/host.py 46 fabric_tpa/__init__.py 262 fabric_tpa/libvirt.py 224 fabric_tpa/reboot.py 125 fabric_tpa/retire.py 1417 total Each module encompasses Fabric tasks that can be called from the commandline `fab` tool or Python functions, both of which can be reused in other modules as well. There are also wrapper scripts for certain jobs that are a poor fit for the `fab` tool, especially `reboot` which requires particular host scheduling. The fabric functions currently only communicate with the rest of the infrastructure through SSH. It is assumed the operator will have direct `root` access on all the affected servers. Server lists are provided by the operator but should eventually be extracted from PuppetDB or LDAP. It's also possible scripts will eventually edit existing (but local) git repositories. Most of the TPA-specific code was written and is maintained by `anarcat`. The Fabric project itself is headed by [Jeff Forcier AKA bitprophet](http://bitprophet.org/) it is, obviously, a much smaller community than Ansible but still active. There is a mailing list, IRC channel, and GitHub issues for upstream support (see [contact](https://www.fabfile.org/contact.html)) along with commercial support through [Tidelift](https://tidelift.com/subscription/pkg/pypi-fabric). There are no formal releases of the code for now. Those are the main jobs being automated by fabric: * [automate installs][] * [automate reboots][] * [automate retirement][] [automate installs]: https://bugs.torproject.org/31239 [automate reboots]: https://bugs.torproject.org/33406 [automate retirement]: https://bugs.torproject.org/33477 ## Issues There is no issue tracker specifically for this project, [File][] or [search][] for issues in the [team issue tracker][search] component. [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues ## Monitoring and testing There is no monitoring of this service, as it's not running continously. Fabric tasks *should* implement some form of unit testing. Ideally, we would have 100% test coverage. We use [pytest](https://www.pytest.org/) to write unit tests. To run the test suite, use: pytest-3 fabric_tpa # Discussion ## Problem overview There are multiple tasks in TPA that require manual copy-pasting of code from documentation to the shell or, worse, to grep backwards in history to find the magic command (e.g. `ldapvi`). A lot of those jobs are error-prone and hard to do correctly. In case of the installer, this leads to significant variation and chaos in the installs, which results in instability and inconsistencies between servers. It was determined that the installs would be automated as part of [ticket 31239](https://bugs.torproject.org/31239) and that analysis and work is being done in [howto/new-machine](howto/new-machine). It was later realised that *other* areas were suffering from a similar problem. The upgrade process, for example, has mostly been manual until adhoc shell scripts were written. But unfortunately now we have *many* shell scripts, none of which work correctly. So work started on automating reboots as part of [ticket 33406](https://bugs.torproject.org/33406). And then it was time to migrate the second libvirt server to [howto/ganeti](howto/ganeti) (unifolium/kvm2, [ticket 33085](https://bugs.torproject.org/33085)) and by then it was clear some more generic solution was required. An [attempt](https://gitlab.com/anarcat/tpa-ansible-libvirt-ganeti-importer) to implement this work in Ansible only led to frustration at the complexity of the task and tests were started on Fabric instead, which were positive. A few weeks later, a library of functions was available and the migration procedure was almost entirely automated. ## LDAP notes LDAP integration might be something we could consider, because it's a large part of the automation that's required in a lot of our work. One alternative is to talk with `ldapvi` or commandline tools, the other is to implement some things natively in Python: * [Python LDAP][] could be used to automate talking with ud-ldap, see in particular the [Python LDAP functions][], in particular [add][] and [delete][] * The above docs are very limited, and they [suggest][] external resources also: * https://hub.packtpub.com/python-ldap-applications-extra-ldap-operations-and-ldap-url-library/ * https://hub.packtpub.com/configuring-and-securing-python-ldap-applications-part-2/ * https://www.linuxjournal.com/article/6988 [Python LDAP]: https://www.python-ldap.org/ [Python LDAP functions]: https://www.python-ldap.org/en/python-ldap-3.2.0/reference/ldap.html#functions [delete]: https://www.python-ldap.org/en/python-ldap-3.2.0/reference/ldap.html#ldap.LDAPObject.delete [add]: https://www.python-ldap.org/en/python-ldap-3.2.0/reference/ldap.html#ldap.LDAPObject.add [suggest]: https://www.python-ldap.org/en/python-ldap-3.2.0/resources.html ## Goals ### Must have * ease of use - it should be easy to write new tasks and to understand existing ones * operation on multiple servers - many of the tricky tasks we need to do operate on multiple servers *synchronously* something that, for example, is hard to do in Puppet ### Nice to have * long term maintenance - this should not be Legacy Code and must be unit tested, at least for parts that are designed to stay in the long term (e.g. not the libvirt importer) ### Non-Goals * sharing with the community - it is assumed that those are tasks too site-specific to be reused by other groups, although the code is still shared publicly. shared code belongs to Puppet. * performance - this does not need to be high performance, as those tasks are done rarely ## Approvals required TPA. Approved in [/meeting/2020-03-09/](/meeting/2020-03-09/). ## Proposed Solution We are testing Fabric. Fabric was picked mostly over Ansible because it allowed more flexibility in processing data from remote hosts. The YAML templating language of Ansible was seen as too limiting and difficult to use for the particular things we needed to do (such as host migration). Furthermore, we did not want to introduce another configuration management system. Using Ansible could have led to a parallel configuration management interface "creeping in" next to Puppet. The intention of this deployment is to have the absolute minimal amount of code needed to do things Puppet cannot do, not to replace it. ## Cost Time and labor. ## Alternatives considered ### ansible * [Ansible](https://www.ansible.com/) makes easy things easy and scalable, but makes it hard to do hard stuff * for example, how would you do a disk inventory and pass it to another host to recreate those disk? for an Ansible ignorant like me, it's far from trivial. it probably implies something like this [dictionnary type][] but in Fabric, it's: json.loads(con.run('qemu-img info --output=json %s' % disk_path).stdout) Any person somewhat familiar with Python can tell what this does. * we use Puppet for high-level configuration management, and Ansible conflicts with that problem space, leading to higher cognitive load [dictionnary type]: https://docs.ansible.com/ansible/latest/plugins/lookup/dict.html ### mcollective * [MCollective][] was (it's deprecated) a tool that could be used to fire jobs on Puppet nodes from the Puppet master [MCollective]: https://puppet.com/docs/mcollective/ * Not relevant for our use case because we want to bootstrap Puppet (in which case Puppet is not available yet) or retire Puppet (in which case it will go away). ### bolt * [Bolt][] is interesting because it *can* be used to [bootstrap Puppet][] * Unfortunately, it does *not* reuse the Puppet primitives and instead Bolt "tasks" are just arbitrary commands, usually shell commands (e.g. [this task][]) along with a [copious][] [amount][] of JSON metadata [amount]: https://github.com/puppetlabs/puppetlabs-puppet_agent/blob/master/tasks/install_shell.json [copious]: https://github.com/puppetlabs/puppetlabs-puppet_agent/blob/master/tasks/install.json [this task]: https://github.com/puppetlabs/puppetlabs-puppet_agent/blob/master/tasks/install_shell.sh [Bolt]: https://puppet.com/docs/bolt/ [bootstrap Puppet]: https://github.com/puppetlabs/puppetlabs-puppet_agent#puppet_agentinstall * does *not* have much privileged access to PuppetDB or the Puppet CA infrastructure, that needs to be [bolted on][] by hand [bolted on]: https://puppet.com/docs/bolt/latest/bolt_connect_puppetdb.html ### Doing things by hand * timing is sometimes critical * sets best practices in code instead of in documentation * makes recipes easily reusable ### Another custom Python script * is it `subprocess.check_output`? or `check_call`? or `run`? what if you want both the output and the status code? can you remember? * argument parsing code built-in, self-documenting code * exposes Python functions as commandline jobs ### Shell scripts * hard to reuse * hard to read, audit * missing a lot of basic programming primitives (hashes, objects, etc) * no unit testing out of the box ### Perl * notoriously hard to read ### mitogen A late-comer to the "alternatives considered" section, I actually found out about the [mitogen](https://mitogen.networkgenomics.com/) project after the choice of Fabric was made, and a significant amount of code written for it (about 2000 [SLOC](https://en.wikipedia.org/wiki/Source_lines_of_code)). A major problem with Fabric, I discovered, is that it *only* allows executing **commands** on remote servers. That is, it's a glorified shell script. Yes, it allows things like SFTP file transfers, but that's about it: it's [not possible to directly execute Python code on the remote node](https://github.com/fabric/fabric/issues/237). This limitation makes it hard to implement more complex business logic on the remote server. It also makes error control in Fabric less intuitive as normal Python code reflexes (like exception handling) cannot be used. Exception handling, in Fabric, is particularly tricky, see for example [issue 2061](https://github.com/fabric/fabric/issues/2061) but generally: exceptions don't work well inside Fabric. Basically, I wish I had found out about mitogen before I wrote all this code. It would make code like the LDAP connector much easier to write (as it could run directly on the LDAP server, bypassing the firewall issues). A rewrite of the post-install grml-deboostrap hooks would also be easier to implement than right now. Considering there isn't that much code written, it's still possible to switch to Mitogen. The major downside of mitogen is that it doesn't have a commandline interface: it's "only" a Python library and everything needs to be written on top of that. In fact, it seems like Mitogen is primarily written as an Ansible backend, so it is possible that non-Ansible use cases might be less supported. The "makefile" (`fabfile`, really) approach is also not supported at all by mitogen. So all the nice "self-documentation" and "automatic usage" goodness brought to use by the Fabric decorator would need to be rebuilt by hand. There are existing dispatchers (say like [click](https://click.palletsprojects.com/en/7.x/) or [fire](https://github.com/google/python-fire)) which could be used to work around that. And obviously, the dispatcher (say: run this command on all those hosts) is not directly useable from the commandline, out of the box. But it seems like a minor annoyance considering we're generally rewriting that on top of Fabric right now because of [serious](https://github.com/fabric/fabric/issues/2071) [limitations](https://github.com/fabric/fabric/issues/2069) in the current scheduler. Finally, mitogen seems to be better maintained than fabric: at the time of writing: | Stat | Mitogen | Fabric | |--------------|------------|------------| | Last commit | 2020-07-30 | 2020-01-20 | | Last release | 2019-11-02 | 2019-08-06 | | Open issues | 119 | 385 | | Open PRs | 12 | 47 | | Contributors | 18 | 10 | Those numbers are based on the GitHub current statistics. Another comparison is the [openhub dashboard](https://www.openhub.net/p/_compare?project_0=Fabric&project_1=mitogen&project_2=pyinvoke) comparing Fabric, Mitogen and pyinvoke (the Fabric backend). It should be noted that: * all three projects have "decreasing" activity * the code size is in a similar range: when added together, Fabric and invoke are about 26k SLOC, while mitogen is 36k SLOC. but this does show that mitogen is more complex than Fabric * there has been more activity in mitogen in the past 12 months * but more contributors in Fabric (pyinvoke, specifically) over time The Fabric author also posted a [request for help](http://bitprophet.org/blog/2020/07/02/help-wanted/) with his projects, which doesn't bid well for the project in the long term. A few people offered help, but so far no major change has happened in the issue queue (lots of duplicates and trivial PRs remain open). On the other hand, the Mitogen author seems to have moved onto other things. He hasn't committed to the project [in over a year](https://github.com/dw/mitogen/commit/91f74a04acbc0ebeae939132bfdef0b6b3817e97), shortly after [announcing](https://sweetness.hmmz.org/2019-10-28-operon.html) a "private-source" (GPL, but no public code release) rewrite of the Ansible engine, called [Operon](https://networkgenomics.com/operon/). So it's [unclear what the fate of mitogen will be](https://github.com/dw/mitogen/issues/751).