Fabric is a Python module, built on top of Invoke that could be described as "make for sysadmins". It allows us to establish "best practices" for routine tasks like:
- installing a server (TODO)
- retiring a server (howto/retire-a-host)
- migrating machines (howto/ganeti)
- retiring a user (TODO)
- reboots (howto/reboots)
- ... etc
Fabric makes easy things reproducible and hard things possible. It is not designed to handle larger-scale configuration management, for which we use howto/puppet.
- Tutorial
- How-to
- Reference
- Discussion
Tutorial
All of the instructions below assume you have a copy of the TPA fabric library, fetch it with:
git clone https://gitlab.torproject.org/tpo/tpa/fabric-tasks.git &&
cd fabric-tasks
Don't trust the GitLab server! This should be done only once, in TOFU (Trust On First Use) mode: further uses of the repository should verify OpenPGP signatures or Git hashes from a known source.
Normally, this is done on your laptop, not on the servers. Servers
including the profile::fabric
will have the code deployed globally
(/usr/local/lib/fabric-tasks
as of this writing), with the actual
fabric
package (and fab
binary) available if manage_package
is
true
. See tpo/tpa/team#41484 for the plans with that
(currently progressive) deployment.
Running a command on hosts
Fabric can be used from the commandline to run arbitrary commands on servers, like this:
fab -H hostname.example.com -- COMMAND
For example:
$ fab -H perdulce.torproject.org -- uptime
17:53:22 up 24 days, 19:34, 1 user, load average: 0.00, 0.00, 0.07
This is equivalent to:
ssh hostname.example.com COMMAND
... except that you can run it on multiple servers:
$ fab -H perdulce.torproject.org,chives.torproject.org -- uptime
17:54:48 up 24 days, 19:36, 1 user, load average: 0.00, 0.00, 0.06
17:54:52 up 24 days, 17:35, 21 users, load average: 0.00, 0.00, 0.00
Listing tasks and self-documentation
The fabric-tasks
repository has a good library of tasks that can be ran
from the commandline. To show the list, use:
fab -l
Help for individual tasks can also be inspected with --help
, for
example:
$ fab -h host.fetch-ssh-host-pubkey
Usage: fab [--core-opts] host.fetch-ssh-host-pubkey [--options] [other tasks here ...]
Docstring:
fetch public host key from server
Options:
-t STRING, --type=STRING
The name of the server to run the command against is implicit in the
usage: it must be passed with the -H
(short for --hosts
)
argument. For example:
$ fab -H perdulce.torproject.org host.fetch-ssh-host-pubkey
b'ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGOnZX95ZQ0mliL0++Enm4oXMdf1caZrGEgMjw5Ykuwp root@perdulce\n'
How-to
A simple Fabric function
Each procedure mentioned in the introduction above has its own
documentation. This tutorial aims more to show how to make a simple
Fabric program inside TPA. Here we will create a uptime
task which
will simply run the uptime
command on the provided hosts. It's a
trivial example that shouldn't be implemented (it is easier to just
tell fab
to run the shell command) but it should give you an idea of
how to write new tasks.
-
edit the source
$EDITOR fabric_tpa/host.py
we pick the "generic" host library (
host.py
) here, but there are other libraries that might be more appropriate, for exampleganeti
,libvirt
orreboot
. Fabric-specific extensions, monkeypatching and other hacks should live in__init__.py
. -
add a task, which is simply a Python function:
@task def uptime(con): return con.run('uptime')
The
@task
string is a decorator which indicates to Fabric the function should be exposed as a command-line task. In that case, it gets a Connection object passed which we can run stuff from. In this case, we run theuptime
command over SSH. -
the task will automatically be loaded as it is part of the
host
module, but if this is a new module, add it tofabfile.py
in the parent directory -
the task should now be available:
$ fab -H perdulce.torproject.org host.uptime 18:06:56 up 24 days, 19:48, 1 user, load average: 0.00, 0.00, 0.02
Pager playbook
N/A for now. Fabric is an ad-hoc tool and, as such, doesn't have monitoring that should trigger a response. It could however be used for some oncall work, which remains to be determined.
Disaster recovery
N/A.
Reference
Installation
Fabric is available as a Debian package:
apt install fabric
See also the upstream instructions for other platforms (e.g. Pip).
To use tpa's fabric code, you will most likely also need at least python ldap support:
apt install python3-ldap
Fabric code grew out of the installer and reboot scripts in the
fabric-tasks
repository. To get access to the code, simply clone the
repository and run from the top level directory:
git clone https://gitlab.torproject.org/tpo/tpa/fabric-tasks.git &&
cd fabric-tasks &&
fab -l
This code could also be moved to its own repository altogether.
Installing Fabric on Debian buster
Fabric has been part of Debian since at least Debian jessie, but you should install the newer, 2.x version that is only available in bullseye and later. The bullseye version is a "trivial backport" which means it can be installed directly in stable with:
apt install fabric/buster-backports
This will also pull invoke (from unstable) and paramiko (from stable). The latter will show a lot of warnings when running by default, however, so you might want to upgrade to backports as well:
apt install python3-paramiko/buster-backports
SLA
N/A
Design
TPA's fabric library lives in the fabric-tasks
repository and consists
of multiple Python modules, at the time of writing:
anarcat@curie:fabric-tasks(master)$ wc -l fabric_tpa/*.py
463 fabric_tpa/ganeti.py
297 fabric_tpa/host.py
46 fabric_tpa/__init__.py
262 fabric_tpa/libvirt.py
224 fabric_tpa/reboot.py
125 fabric_tpa/retire.py
1417 total
Each module encompasses Fabric tasks that can be called from the
commandline fab
tool or Python functions, both of which can be
reused in other modules as well. There are also wrapper scripts for
certain jobs that are a poor fit for the fab
tool, especially
reboot
which requires particular host scheduling.
The fabric functions currently only communicate with the rest of the
infrastructure through SSH. It is assumed the operator will have
direct root
access on all the affected servers. Server lists are
provided by the operator but should eventually be extracted from
PuppetDB or LDAP. It's also possible scripts will eventually edit
existing (but local) git repositories.
Most of the TPA-specific code was written and is maintained by
anarcat
. The Fabric project itself is headed by Jeff Forcier AKA
bitprophet it is, obviously, a much smaller community than Ansible
but still active. There is a mailing list, IRC channel, and GitHub
issues for upstream support (see contact) along with commercial
support through Tidelift.
There are no formal releases of the code for now.
Those are the main jobs being automated by fabric:
Issues
There is no issue tracker specifically for this project, File or search for issues in the team issue tracker component.
Monitoring and testing
There is no monitoring of this service, as it's not running continuously.
Fabric tasks should implement some form of unit testing. Ideally, we would have 100% test coverage.
We use pytest to write unit tests. To run the test suite, use:
pytest-3 fabric_tpa
Discussion
Problem overview
There are multiple tasks in TPA that require manual copy-pasting of
code from documentation to the shell or, worse, to grep backwards in
history to find the magic command (e.g. ldapvi
). A lot of those jobs
are error-prone and hard to do correctly.
In case of the installer, this leads to significant variation and chaos in the installs, which results in instability and inconsistencies between servers. It was determined that the installs would be automated as part of ticket 31239 and that analysis and work is being done in howto/new-machine.
It was later realised that other areas were suffering from a similar problem. The upgrade process, for example, has mostly been manual until adhoc shell scripts were written. But unfortunately now we have many shell scripts, none of which work correctly. So work started on automating reboots as part of ticket 33406.
And then it was time to migrate the second libvirt server to howto/ganeti (unifolium/kvm2, ticket 33085) and by then it was clear some more generic solution was required. An attempt to implement this work in Ansible only led to frustration at the complexity of the task and tests were started on Fabric instead, which were positive. A few weeks later, a library of functions was available and the migration procedure was almost entirely automated.
LDAP notes
LDAP integration might be something we could consider, because it's a
large part of the automation that's required in a lot of our work. One
alternative is to talk with ldapvi
or commandline tools, the other
is to implement some things natively in Python:
- Python LDAP could be used to automate talking with ud-ldap, see in particular the Python LDAP functions, in particular add and delete
- The above docs are very limited, and they suggest external resources also:
Goals
Must have
-
ease of use - it should be easy to write new tasks and to understand existing ones
-
operation on multiple servers - many of the tricky tasks we need to do operate on multiple servers synchronously something that, for example, is hard to do in Puppet
-
lifecycle management in an heterogeneous environment: we need to be able to:
-
provision bare-metal on our leased machines at Cymru, on rented machines at Hetzner, on Hetzner cloud, in Openstack (currently done by hand, with shell scripts, and Fabric)
-
reboot the entire infrastructure, considering mirrors and ganeti clusters (currently done with Fabric)
-
do ad-hoc operations like "where is php-fpm running?" (currently done with Cumin) or "grub exploded, i need to load a rescue and rebuild the boot loader" (currently done by hand) or "i need to resize a filesystem" (currently done by copy-pasting from the wiki)
-
retire machines (currently done by hand and Fabric)
-
Nice to have
- long term maintenance - this should not be Legacy Code and must be unit tested, at least for parts that are designed to stay in the long term (e.g. not the libvirt importer)
Non-Goals
-
sharing with the community - it is assumed that those are tasks too site-specific to be reused by other groups, although the code is still shared publicly. shared code belongs to Puppet.
-
performance - this does not need to be high performance, as those tasks are done rarely
Approvals required
TPA. Approved in /meeting/2020-03-09/.
Proposed Solution
We are testing Fabric.
Fabric was picked mostly over Ansible because it allowed more flexibility in processing data from remote hosts. The YAML templating language of Ansible was seen as too limiting and difficult to use for the particular things we needed to do (such as host migration).
Furthermore, we did not want to introduce another configuration management system. Using Ansible could have led to a parallel configuration management interface "creeping in" next to Puppet. The intention of this deployment is to have the absolute minimal amount of code needed to do things Puppet cannot do, not to replace it.
One major problem with Fabric is that it creates pretty terrible code: it is basically a glorified Makefile, because we cannot actually run Python code on the remote servers directly. (Well, we could, but we'd first need to upload the code and call it a shell command, so it is not real IPC.) In that sense, Mitogen is a real eye-opener and game-changer.
Cost
Time and labor.
Alternatives considered
ansible
Ansible makes easy things easy, but it can make it hard to do hard stuff.
For example, how would you do a disk inventory and pass it to another host to recreate those disk? for an Ansible ignorant like me, it's far from trivial, but in Fabric, it's:
json.loads(con.run('qemu-img info --output=json %s' % disk_path).stdout)
Any person somewhat familiar with Python can probably tell what this does. In Ansible, you'll need to first run the command and have a second task to parse the result, both of which involves slow round-trips with the server:
- name: gather information about disk
shell: "qemu-img info --output=json {{disk_path}}"
register: result
- name: parse disk information as JSON
set_fact:
disk_info: "{{ result.stdout | from_json }}"
That is much more verbose and harder to discover unless you're already deeply familiar with Ansible's processes and data structures.
Compared with Puppet, Ansible's "collections" look pretty chaotic. The official collections index is weirdly disparate and incomplete while Ansible Galaxy is a wild jungle.
For example, there are 677 different Prometheus collections at the time of writing. The most popular Prometheus collection one has lots of issues, namely:
-
no support for installation through Debian packages (but you can "skip installation")
-
even if you do, incompatible service names for exporters (e.g.
blackbox-exporter
), arguably a common problem that was also plaguing the Puppet module until @anarcat worked on it -
the module's documentation is kind of hidden inside the source code, for example here is the source docs which show use cases and actual configurations, compared to the actual role docs, which just lists supported variables
Another example is the nginx collection. In general, collections are pretty confusing coming from Puppet, where everything is united under a "module". A Collection is actually closer to a module than a role is, but collections and roles are sometimes, as is the case for nginx, split in separate git repositories, which can be confusing (see the nginx role.
Taking a look at the language in general, Ansible's variable are all
globals, which means they all get "scoped" by using a prefix
(e.g. prometheus_alert_rules
).
Documentation is sparse and confusing. For example, I eventually
figured out how to pull data from a host using a lookup
function,
but that wasn't because of the lookup documentation or the pipe
plugin documentation, neither of which show this simple example:
- name: debug list hosts
debug: msg="{{ lookup('pipe', '/home/anarcat/src/prometheus.debian.net/list-debian.net.sh')}}"
YAML is hell. I could not find a way to put the following shell
pipeline in a pipe
lookup above, hence the shell script:
ldapsearch -u -x -H ldap://db.debian.org -b dc=debian,dc=org '(dnsZoneEntry=*)' dnsZoneEntry | grep ^dnsZoneEntry | grep -e ' A ' -e ' AAAA ' -e ' CNAME ' | sed -s 's/dnsZoneEntry: //;s/ .*/.debian.net/' | sort -u
For a first time user, the distinction between a lookup()
function
and a shell
task is really not obvious, and the documentation
doesn't make it exactly clear that the former runs on the "client" and
the latter runs on the "server" (although even the latter can be
fuzzy, through delegation).
And since this is becoming a "Ansible crash course for Puppet developers", might as well add a few key references:
-
the working with playbooks section is possibly the most important and useful part of the Ansible documentation
-
that includes variables and filters, critical and powerful functions that allow processing data from variables, files, etc
-
tags can be used to run a subset of a playbook but also skip certain parts
Finally, Ansible is notoriously slow. A relatively simple Ansible playbook to deploy Prometheus runs in 44 seconds while a fully-fledged Puppet configuration of a production server runs in 20 seconds, and this includes a collection of slow facts that takes 10 of those 18 seconds, actual execution is nearer to 7 seconds. The Puppet configuration manages 757 resources while the Ansible configuration manages 115 resources. And that is with ansible-mitogen: without that hack, the playbook takes nearly two minutes to run.
In the end, the main reason we use Fabric instead of Ansible is that we use Puppet for high-level configuration management, and Ansible conflicts with that problem space, leading to higher cognitive load. It's also easier to just program custom processes in Python than in Ansible. So far, however, Fabric has effectively been creating more legacy code as it has been proven hard to effectively unit test unless a lot of care is given to keeping functions small and locally testable.
mcollective
- MCollective was (it's deprecated) a tool that could be used to fire jobs on Puppet nodes from the Puppet master
- Not relevant for our use case because we want to bootstrap Puppet (in which case Puppet is not available yet) or retire Puppet (in which case it will go away).
bolt
-
Bolt is interesting because it can be used to bootstrap Puppet
-
Unfortunately, it does not reuse the Puppet primitives and instead Bolt "tasks" are just arbitrary commands, usually shell commands (e.g. this task) along with a copious amount of JSON metadata
- does not have much privileged access to PuppetDB or the Puppet CA infrastructure, that needs to be bolted on by hand
Doing things by hand
- timing is sometimes critical
- sets best practices in code instead of in documentation
- makes recipes easily reusable
Another custom Python script
- is it
subprocess.check_output
? orcheck_call
? orrun
? what if you want both the output and the status code? can you remember? - argument parsing code built-in, self-documenting code
- exposes Python functions as commandline jobs
Shell scripts
- hard to reuse
- hard to read, audit
- missing a lot of basic programming primitives (hashes, objects, etc)
- no unit testing out of the box
Perl
- notoriously hard to read
mitogen
A late-comer to the "alternatives considered" section, I actually found out about the mitogen project after the choice of Fabric was made, and a significant amount of code written for it (about 2000 SLOC).
A major problem with Fabric, I discovered, is that it only allows executing commands on remote servers. That is, it's a glorified shell script. Yes, it allows things like SFTP file transfers, but that's about it: it's not possible to directly execute Python code on the remote node. This limitation makes it hard to implement more complex business logic on the remote server. It also makes error control in Fabric less intuitive as normal Python code reflexes (like exception handling) cannot be used. Exception handling, in Fabric, is particularly tricky, see for example issue 2061 but generally: exceptions don't work well inside Fabric.
Basically, I wish I had found out about mitogen before I wrote all this code. It would make code like the LDAP connector much easier to write (as it could run directly on the LDAP server, bypassing the firewall issues). A rewrite of the post-install grml-deboostrap hooks would also be easier to implement than right now.
Considering there isn't that much code written, it's still possible to switch to Mitogen. The major downside of mitogen is that it doesn't have a commandline interface: it's "only" a Python library and everything needs to be written on top of that. In fact, it seems like Mitogen is primarily written as an Ansible backend, so it is possible that non-Ansible use cases might be less supported.
The "makefile" (fabfile
, really) approach is also not supported at
all by mitogen. So all the nice "self-documentation" and "automatic
usage" goodness brought to use by the Fabric decorator would need to
be rebuilt by hand. There are existing dispatchers (say like
click or fire) which could be used to work around that.
And obviously, the dispatcher (say: run this command on all those hosts) is not directly usable from the commandline, out of the box. But it seems like a minor annoyance considering we're generally rewriting that on top of Fabric right now because of serious limitations in the current scheduler.
Finally, mitogen seems to be better maintained than fabric: at the time of writing:
Stat | Mitogen | Fabric |
---|---|---|
Last commit | 2021-10-23 | 2021-10-15 |
Last release | 2021-10-28 | 2021-01-18 |
Open issues | 165 | 382 |
Open PRs | 16 | 44 |
Contributors | 23 | 14 |
Those numbers are based on the GitHub current statistics. Another comparison is the openhub dashboard comparing Fabric, Mitogen and pyinvoke (the Fabric backend). It should be noted that:
- all three projects have "decreasing" activity
- the code size is in a similar range: when added together, Fabric and invoke are about 26k SLOC, while mitogen is 36k SLOC. but this does show that mitogen is more complex than Fabric
- there has been more activity in mitogen in the past 12 months
- but more contributors in Fabric (pyinvoke, specifically) over time
The Fabric author also posted a request for help with his projects, which doesn't bid well for the project in the long term. A few people offered help, but so far no major change has happened in the issue queue (lots of duplicates and trivial PRs remain open).
On the other hand, the Mitogen author seems to have moved onto other things. He hasn't committed to the project in over a year, shortly after announcing a "private-source" (GPL, but no public code release) rewrite of the Ansible engine, called Operon. So it's unclear what the fate of mitogen will be.
transilience
Enrico Zini has created something called transilience which sites on top of Mitogen that is somewhat of a Ansible replacement, but without the templatized YAML. Fast, declarative, yet Python. Might be exactly what we need, and certainly better than starting on top of mitogen only.
The biggest advantage of transiliance is that it builds on top of mitogen, because we can run Python code remotely, transparently. Zini was also especially careful about creating a somewhat simple API.
The biggest flaw is that it is basically just a prototype with limited documentation and no stability promises. It's not exactly clear how to write new actions, for example, unless you count this series of blog posts. It might also suffer second-system syndrome in the sense that it might become also complicated as it tries to replicate more of Ansible's features. It could still offer a good source of library items to do common tasks like install packages and so on.
spicerack and cumin
The Wikimedia Foundation (WMF, the organisation running Wikipedia) created a set of tools called spicerack (source code). It is a framework of Python code built on top of Cumin, on top of which they wrote a set of cookbooks to automate various ad-hoc operations on the cluster.
Like Fabric, it doesn't ship Python code on the remote servers: it merely executes shell commands. The advantage over Fabric is that it bridges with the Cumin inventory system to target servers based on the domain-specific language (DSL) available there.
It is also very WMF-specific, and could be difficult to use outside of that context. Specifically, there might be a lot of hardcoded assumptions in the code that we'd need to patch out (example, Ganeti instance creation code, which would then therefore require a fork. Fortunately, spicerack has regular releases which makes tracking forks easier. Collaboration with upstream is possible, but requires registering and contributing to their Gerrit instance (see for example the work anarcat did on Cumin).
It does have good examples of how Cumin can be used as a library for certain operations, however.
One major limitation of Spicerack is that it uses Cumin as a transport, which implies that it can only execute shell commands on the remote server: no complex business logic can be carried over to the remote side, or, in other words, we can't run Python code remotely.
Other Python tools
This article reviews a bunch of Ansible alternatives in Python, let's take a look:
-
Bundlewrap: Python-based DSL, push over SSH, needs password-less sudo over SSH for localhost operation, defers to SSH multiplexing for performance (!), uses mako templates, unclear how to write new extend with new "items", active
-
Pulumi: lots of YAML, somewhat language agnostic (support for TypeScript, JavaScript, Python, Golang, C#), lots of YAML, requires a backend, too complicated, unclear how to write new backends, active
-
Nuka: asyncio + SSH, unclear scoping ("how does
shell.command
know whichhost
to talk with?"), minimal documentation, not active -
pyinfra: lots of facts, operations, control flow can be unclear, performance close to Fabric, popular, active
-
Nornir: no DSL: just Python, plugins, YAML inventory, active
Other discarded alternatives
- FAI: might resolve installer scenario (and maybe not in all cases), but does not resolve ad-hoc tasks or host retirement. we can still use it for parts of the installer, as we currently do, obviously.
Other ideas
One thing that all of those solutions could try to do is the do nothing scripting approach. The idea behind this is that, to reduce toil in complex task, you break it down in individual steps that are documented in a script, split in many functions. This way it becomes possible to automate parts of that script, possibly with reusable code across many tasks.
That, in turns, make automating really complex tasks possible in an incremental fashion...