|
|
|
[Fabric][] is a Python module, built on top of [Invoke][] that could
|
|
|
|
be described as "make for sysadmins". It allows us to establish "best
|
|
|
|
practices" for routine tasks like:
|
|
|
|
|
|
|
|
* installing a server (TODO)
|
|
|
|
* retiring a server ([howto/retire-a-host](howto/retire-a-host))
|
|
|
|
* migrating machines ([howto/ganeti](howto/ganeti#Importing-external-instances))
|
|
|
|
* retiring a user (TODO)
|
|
|
|
* reboots ([howto/upgrades](howto/upgrades#Kernel-upgrades-and-reboots))
|
|
|
|
* ... etc
|
|
|
|
|
|
|
|
Fabric makes easy things reproducible and hard things possible. It is
|
|
|
|
*not* designed to handle larger-scale configuration management, for
|
|
|
|
which we use [howto/puppet](howto/puppet).
|
|
|
|
|
|
|
|
[Invoke]: https://www.pyinvoke.org/
|
|
|
|
[Fabric]: https://www.fabfile.org/
|
|
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
|
|
# Tutorial
|
|
|
|
|
|
|
|
All of the instructions below assume you have a copy of the TPA fabric
|
|
|
|
library, fetch it with:
|
|
|
|
|
|
|
|
git clone git@git-rw.torproject.org:admin/tsa-misc.git &&
|
|
|
|
cd tsa-misc
|
|
|
|
|
|
|
|
## Running a command on hosts
|
|
|
|
|
|
|
|
Fabric can be used from the commandline to run arbitrary commands on
|
|
|
|
servers, like this:
|
|
|
|
|
|
|
|
fab -H hostname.example.com -- COMMAND
|
|
|
|
|
|
|
|
For example:
|
|
|
|
|
|
|
|
$ fab -H perdulce.torproject.org -- uptime
|
|
|
|
17:53:22 up 24 days, 19:34, 1 user, load average: 0.00, 0.00, 0.07
|
|
|
|
|
|
|
|
This is equivalent to:
|
|
|
|
|
|
|
|
ssh hostname.example.com COMMAND
|
|
|
|
|
|
|
|
... except that you can run it on *multiple* servers:
|
|
|
|
|
|
|
|
$ fab -H perdulce.torproject.org,chives.torproject.org -- uptime
|
|
|
|
17:54:48 up 24 days, 19:36, 1 user, load average: 0.00, 0.00, 0.06
|
|
|
|
17:54:52 up 24 days, 17:35, 21 users, load average: 0.00, 0.00, 0.00
|
|
|
|
|
|
|
|
## Listing tasks and self-documentation
|
|
|
|
|
|
|
|
The `tsa-misc` repository has a good library of tasks that can be ran
|
|
|
|
from the commandline. To show the list, use:
|
|
|
|
|
|
|
|
fab -l
|
|
|
|
|
|
|
|
Help for individual tasks can also be inspected with `--help`, for
|
|
|
|
example:
|
|
|
|
|
|
|
|
$ fab -h host.fetch-ssh-host-pubkey
|
|
|
|
Usage: fab [--core-opts] host.fetch-ssh-host-pubkey [--options] [other tasks here ...]
|
|
|
|
|
|
|
|
Docstring:
|
|
|
|
fetch public host key from server
|
|
|
|
|
|
|
|
Options:
|
|
|
|
-t STRING, --type=STRING
|
|
|
|
|
|
|
|
The name of the server to run the command against is implicit in the
|
|
|
|
usage: it must be passed with the `-H` (short for `--hosts`)
|
|
|
|
argument. For example:
|
|
|
|
|
|
|
|
$ fab -H perdulce.torproject.org host.fetch-ssh-host-pubkey
|
|
|
|
b'ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGOnZX95ZQ0mliL0++Enm4oXMdf1caZrGEgMjw5Ykuwp root@perdulce\n'
|
|
|
|
|
|
|
|
# How-to
|
|
|
|
|
|
|
|
## A simple Fabric function
|
|
|
|
|
|
|
|
Each procedure mentioned in the introduction above has its own
|
|
|
|
documentation. This tutorial aims more to show how to make a simple
|
|
|
|
Fabric program inside TPA. Here we will create a `uptime` task which
|
|
|
|
will simply run the `uptime` command on the provided hosts. It's a
|
|
|
|
trivial example that shouldn't be implemented (it is easier to just
|
|
|
|
tell `fab` to run the shell command) but it should give you an idea of
|
|
|
|
how to write new tasks.
|
|
|
|
|
|
|
|
1. edit the source
|
|
|
|
|
|
|
|
$EDITOR fabric_tpa/host.py
|
|
|
|
|
|
|
|
we pick the "generic" host library (`host.py`) here, but there are
|
|
|
|
other libraries that might be more appropriate, for example
|
|
|
|
`ganeti`, `libvirt` or `reboot`. Fabric-specific extensions,
|
|
|
|
monkeypatching and other hacks should live in `__init__.py`.
|
|
|
|
|
|
|
|
2. add a task, which is simply a Python function:
|
|
|
|
|
|
|
|
@task
|
|
|
|
def uptime(con):
|
|
|
|
return con.run('uptime')
|
|
|
|
|
|
|
|
The `@task` string is a [decorator](https://docs.python.org/3/glossary.html#term-decorator) which indicates to Fabric
|
|
|
|
the function should be exposed as a command-line task. In that
|
|
|
|
case, it gets a Connection object passed which we can run stuff
|
|
|
|
from. In this case, we run the `uptime` command over SSH.
|
|
|
|
|
|
|
|
3. the task will automatically be loaded as it is part of the
|
|
|
|
`host` module, but if this is a new module, add it to `fabfile.py`
|
|
|
|
in the parent directory
|
|
|
|
|
|
|
|
4. the task should now be available:
|
|
|
|
|
|
|
|
$ fab -H perdulce.torproject.org host.uptime
|
|
|
|
18:06:56 up 24 days, 19:48, 1 user, load average: 0.00, 0.00, 0.02
|
|
|
|
|
|
|
|
## Pager playbook
|
|
|
|
|
|
|
|
N/A for now. Fabric is an ad-hoc tool and, as such, doesn't have
|
|
|
|
monitoring that should trigger a response. It *could* however be used
|
|
|
|
for some oncall work, which remains to be determined.
|
|
|
|
|
|
|
|
## Disaster recovery
|
|
|
|
|
|
|
|
N/A.
|
|
|
|
|
|
|
|
# Reference
|
|
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
|
|
Fabric is available as a Debian package:
|
|
|
|
|
|
|
|
apt install fabric
|
|
|
|
|
|
|
|
See also the [upstream instructions](https://www.fabfile.org/installing.html) for other platforms
|
|
|
|
(e.g. Pip).
|
|
|
|
|
|
|
|
Fabric code grew out of the installer and reboot scripts in the
|
|
|
|
`tsa-misc` repository. To get access to the code, simply clone the
|
|
|
|
repository and run from the top level directory:
|
|
|
|
|
|
|
|
git clone git@git-rw.torproject.org:admin/tsa-misc.git &&
|
|
|
|
cd tsa-misc &&
|
|
|
|
fab -l
|
|
|
|
|
|
|
|
This code could also be moved to its own repository altogether.
|
|
|
|
|
|
|
|
### Installing Fabric on Debian buster
|
|
|
|
|
|
|
|
Fabric has been [part of Debian since at least Debian jessie][], but
|
|
|
|
you should install the newer, 2.x version that is only available in
|
|
|
|
bullseye and later. The bullseye version is a "trivial backport" which
|
|
|
|
means it can be installed directly in stable with:
|
|
|
|
|
|
|
|
apt install -t bullseye fabric
|
|
|
|
|
|
|
|
[part of Debian since at least Debian jessie]: https://tracker.debian.org/pkg/fabric
|
|
|
|
|
|
|
|
This will also pull [invoke](https://tracker.debian.org/pkg/python-invoke) (from unstable) and [paramiko](https://tracker.debian.org/pkg/paramiko)
|
|
|
|
(from stable). The latter will show a *lot* of warnings when running
|
|
|
|
by default, however, so you might want to upgrade to backports as
|
|
|
|
well:
|
|
|
|
|
|
|
|
apt install -t buster-backports python3-paramiko
|
|
|
|
|
|
|
|
## SLA
|
|
|
|
|
|
|
|
N/A
|
|
|
|
|
|
|
|
## Design
|
|
|
|
|
|
|
|
TPA's fabric library lives in the `tsa-misc` repository and consists
|
|
|
|
of multiple Python modules, at the time of writing:
|
|
|
|
|
|
|
|
anarcat@curie:tsa-misc(master)$ wc -l fabric_tpa/*.py
|
|
|
|
463 fabric_tpa/ganeti.py
|
|
|
|
297 fabric_tpa/host.py
|
|
|
|
46 fabric_tpa/__init__.py
|
|
|
|
262 fabric_tpa/libvirt.py
|
|
|
|
224 fabric_tpa/reboot.py
|
|
|
|
125 fabric_tpa/retire.py
|
|
|
|
1417 total
|
|
|
|
|
|
|
|
Each module encompasses Fabric tasks that can be called from the
|
|
|
|
commandline `fab` tool or Python functions, both of which can be
|
|
|
|
reused in other modules as well. There are also wrapper scripts for
|
|
|
|
certain jobs that are a poor fit for the `fab` tool, especially
|
|
|
|
`reboot` which requires particular host scheduling.
|
|
|
|
|
|
|
|
The fabric functions currently only communicate with the rest of the
|
|
|
|
infrastructure through SSH. It is assumed the operator will have
|
|
|
|
direct `root` access on all the affected servers. Server lists are
|
|
|
|
provided by the operator but should eventually be extracted from
|
|
|
|
PuppetDB or LDAP. It's also possible scripts will eventually edit
|
|
|
|
existing (but local) git repositories.
|
|
|
|
|
|
|
|
Most of the TPA-specific code was written and is maintained by
|
|
|
|
`anarcat`. The Fabric project itself is headed by [Jeff Forcier AKA
|
|
|
|
bitprophet](http://bitprophet.org/) it is, obviously, a much smaller community than Ansible
|
|
|
|
but still active. There is a mailing list, IRC channel, and GitHub
|
|
|
|
issues for upstream support (see [contact](https://www.fabfile.org/contact.html)) along with commercial
|
|
|
|
support through [Tidelift](https://tidelift.com/subscription/pkg/pypi-fabric).
|
|
|
|
|
|
|
|
There are no formal releases of the code for now.
|
|
|
|
|
|
|
|
Those are the main jobs being automated by fabric:
|
|
|
|
|
|
|
|
* [automate installs][]
|
|
|
|
* [automate reboots][]
|
|
|
|
* [automate retirement][]
|
|
|
|
|
|
|
|
[automate installs]: https://bugs.torproject.org/31239
|
|
|
|
[automate reboots]: https://bugs.torproject.org/33406
|
|
|
|
[automate retirement]: https://bugs.torproject.org/33477
|
|
|
|
|
|
|
|
## Issues
|
|
|
|
|
|
|
|
There is no issue tracker specifically for this project, [File][] or
|
|
|
|
[search][] for issues in the [generic internal services][search] component.
|
|
|
|
|
|
|
|
[File]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
|
|
|
|
[search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team
|
|
|
|
|
|
|
|
## Monitoring and testing
|
|
|
|
|
|
|
|
There is no monitoring of this service, as it's not running continously.
|
|
|
|
|
|
|
|
Fabric tasks *should* implement some form of unit testing. Ideally, we
|
|
|
|
would have 100% test coverage.
|
|
|
|
|
|
|
|
We use [pytest](https://www.pytest.org/) to write unit tests. To run the test suite, use:
|
|
|
|
|
|
|
|
pytest-3 fabric_tpa
|
|
|
|
|
|
|
|
# Discussion
|
|
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
|
|
There are multiple tasks in TPA that require manual copy-pasting of
|
|
|
|
code from documentation to the shell or, worse, to grep backwards in
|
|
|
|
history to find the magic command (e.g. `ldapvi`). A lot of those jobs
|
|
|
|
are error-prone and hard to do correctly.
|
|
|
|
|
|
|
|
In case of the installer, this leads to significant variation and
|
|
|
|
chaos in the installs, which results in instability and
|
|
|
|
inconsistencies between servers. It was determined that the installs
|
|
|
|
would be automated as part of [ticket 31239](https://bugs.torproject.org/31239) and that analysis and
|
|
|
|
work is being done in [howto/new-machine](howto/new-machine).
|
|
|
|
|
|
|
|
It was later realised that *other* areas were suffering from a similar
|
|
|
|
problem. The upgrade process, for example, has mostly been manual
|
|
|
|
until adhoc shell scripts were written. But unfortunately now we have
|
|
|
|
*many* shell scripts, none of which work correctly. So work started on
|
|
|
|
automating reboots as part of [ticket 33406](https://bugs.torproject.org/33406).
|
|
|
|
|
|
|
|
And then it was time to migrate the second libvirt server to
|
|
|
|
[howto/ganeti](howto/ganeti) (unifolium/kvm2, [ticket 33085](https://bugs.torproject.org/33085)) and by then it was
|
|
|
|
clear some more generic solution was required. An [attempt](https://gitlab.com/anarcat/tpa-ansible-libvirt-ganeti-importer) to
|
|
|
|
implement this work in Ansible only led to frustration at the
|
|
|
|
complexity of the task and tests were started on Fabric instead, which
|
|
|
|
were positive. A few weeks later, a library of functions was available
|
|
|
|
and the migration procedure was almost entirely automated.
|
|
|
|
|
|
|
|
## LDAP notes
|
|
|
|
|
|
|
|
LDAP integration might be something we could consider, because it's a
|
|
|
|
large part of the automation that's required in a lot of our work. One
|
|
|
|
alternative is to talk with `ldapvi` or commandline tools, the other
|
|
|
|
is to implement some things natively in Python:
|
|
|
|
|
|
|
|
* [Python LDAP][] could be used to automate talking with ud-ldap,
|
|
|
|
see in particular the [Python LDAP functions][], in particular
|
|
|
|
[add][] and [delete][]
|
|
|
|
* The above docs are very limited, and they [suggest][] external
|
|
|
|
resources also:
|
|
|
|
* https://hub.packtpub.com/python-ldap-applications-extra-ldap-operations-and-ldap-url-library/
|
|
|
|
* https://hub.packtpub.com/configuring-and-securing-python-ldap-applications-part-2/
|
|
|
|
* https://www.linuxjournal.com/article/6988
|
|
|
|
|
|
|
|
[Python LDAP]: https://www.python-ldap.org/
|
|
|
|
[Python LDAP functions]: https://www.python-ldap.org/en/python-ldap-3.2.0/reference/ldap.html#functions
|
|
|
|
[delete]: https://www.python-ldap.org/en/python-ldap-3.2.0/reference/ldap.html#ldap.LDAPObject.delete
|
|
|
|
[add]: https://www.python-ldap.org/en/python-ldap-3.2.0/reference/ldap.html#ldap.LDAPObject.add
|
|
|
|
[suggest]: https://www.python-ldap.org/en/python-ldap-3.2.0/resources.html
|
|
|
|
|
|
|
|
## Goals
|
|
|
|
|
|
|
|
### Must have
|
|
|
|
|
|
|
|
* ease of use - it should be easy to write new tasks and to
|
|
|
|
understand existing ones
|
|
|
|
|
|
|
|
* operation on multiple servers - many of the tricky tasks we need to
|
|
|
|
do operate on multiple servers *synchronously* something that, for
|
|
|
|
example, is hard to do in Puppet
|
|
|
|
|
|
|
|
### Nice to have
|
|
|
|
|
|
|
|
* long term maintenance - this should not be Legacy Code and must be
|
|
|
|
unit tested, at least for parts that are designed to stay in the
|
|
|
|
long term (e.g. not the libvirt importer)
|
|
|
|
|
|
|
|
### Non-Goals
|
|
|
|
|
|
|
|
* sharing with the community - it is assumed that those are tasks too
|
|
|
|
site-specific to be reused by other groups, although the code is
|
|
|
|
still shared publicly. shared code belongs to Puppet.
|
|
|
|
|
|
|
|
* performance - this does not need to be high performance, as those
|
|
|
|
tasks are done rarely
|
|
|
|
|
|
|
|
## Approvals required
|
|
|
|
|
|
|
|
TPA. Approved in [/meeting/2020-03-09/](/meeting/2020-03-09/).
|
|
|
|
|
|
|
|
## Proposed Solution
|
|
|
|
|
|
|
|
We are testing Fabric.
|
|
|
|
|
|
|
|
## Cost
|
|
|
|
|
|
|
|
Time and labor.
|
|
|
|
|
|
|
|
## Alternatives considered
|
|
|
|
|
|
|
|
### ansible
|
|
|
|
|
|
|
|
* [Ansible](https://www.ansible.com/) makes easy things easy and scalable, but makes it hard
|
|
|
|
to do hard stuff
|
|
|
|
|
|
|
|
* for example, how would you do a disk inventory and pass it to
|
|
|
|
another host to recreate those disk? for an Ansible ignorant like
|
|
|
|
me, it's far from trivial. it probably implies something like this
|
|
|
|
[dictionnary type][] but in Fabric, it's:
|
|
|
|
|
|
|
|
json.loads(con.run('qemu-img info --output=json %s' % disk_path).stdout)
|
|
|
|
|
|
|
|
Any person somewhat familiar with Python can tell what this does.
|
|
|
|
|
|
|
|
* we use Puppet for high-level configuration management, and Ansible
|
|
|
|
conflicts with that problem space, leading to higher cognitive load
|
|
|
|
|
|
|
|
[dictionnary type]: https://docs.ansible.com/ansible/latest/plugins/lookup/dict.html
|
|
|
|
|
|
|
|
### mcollective
|
|
|
|
|
|
|
|
* [MCollective][] was (it's deprecated) a tool that could be used to
|
|
|
|
fire jobs on Puppet nodes from the Puppet master
|
|
|
|
|
|
|
|
[MCollective]: https://puppet.com/docs/mcollective/
|
|
|
|
|
|
|
|
* Not relevant for our use case because we want to bootstrap Puppet
|
|
|
|
(in which case Puppet is not available yet) or retire Puppet (in
|
|
|
|
which case it will go away).
|
|
|
|
|
|
|
|
### bolt
|
|
|
|
|
|
|
|
* [Bolt][] is interesting because it *can* be used to [bootstrap
|
|
|
|
Puppet][]
|
|
|
|
|
|
|
|
* Unfortunately, it does *not* reuse the Puppet primitives and
|
|
|
|
instead Bolt "tasks" are just arbitrary commands, usually shell
|
|
|
|
commands (e.g. [this task][]) along with a [copious][]
|
|
|
|
[amount][] of JSON metadata
|
|
|
|
|
|
|
|
[amount]: https://github.com/puppetlabs/puppetlabs-puppet_agent/blob/master/tasks/install_shell.json
|
|
|
|
[copious]: https://github.com/puppetlabs/puppetlabs-puppet_agent/blob/master/tasks/install.json
|
|
|
|
[this task]: https://github.com/puppetlabs/puppetlabs-puppet_agent/blob/master/tasks/install_shell.sh
|
|
|
|
[Bolt]: https://puppet.com/docs/bolt/
|
|
|
|
[bootstrap Puppet]: https://github.com/puppetlabs/puppetlabs-puppet_agent#puppet_agentinstall
|
|
|
|
|
|
|
|
* does *not* have much privileged access to PuppetDB or the Puppet CA
|
|
|
|
infrastructure, that needs to be [bolted on][] by hand
|
|
|
|
|
|
|
|
[bolted on]: https://puppet.com/docs/bolt/latest/bolt_connect_puppetdb.html
|
|
|
|
|
|
|
|
### Doing things by hand
|
|
|
|
|
|
|
|
* timing is sometimes critical
|
|
|
|
* sets best practices in code instead of in documentation
|
|
|
|
* makes recipes easily reusable
|
|
|
|
|
|
|
|
### Another custom Python script
|
|
|
|
|
|
|
|
* is it `subprocess.check_output`? or `check_call`? or `run`? what if
|
|
|
|
you want both the output and the status code? can you remember?
|
|
|
|
* argument parsing code built-in, self-documenting code
|
|
|
|
* exposes Python functions as commandline jobs
|
|
|
|
|
|
|
|
### Shell scripts
|
|
|
|
|
|
|
|
* hard to reuse
|
|
|
|
* hard to read, audit
|
|
|
|
* missing a lot of basic programming primitives (hashes, objects,
|
|
|
|
etc)
|
|
|
|
* no unit testing out of the box
|
|
|
|
|
|
|
|
### Perl
|
|
|
|
|
|
|
|
* notoriously hard to read |