Skip to content
Snippets Groups Projects
Verified Commit 0d400c8c authored by anarcat's avatar anarcat
Browse files

add fabric documentation

parent 0a7d90ba
No related branches found
No related tags found
No related merge requests found
[Fabric][] is a Python module, built on top of [Invoke][] that could
be described as "make for sysadmins". It allows us to establish "best
practices" for routine tasks like:
* installing a server (TODO)
* retiring a server ([[retire-a-host]])
* migrating machines ([[ganeti#Importing-external-instances]])
* retiring a user (TODO)
* reboots ([[upgrades#Kernel-upgrades-and-reboots]])
* ... etc
Fabric makes easy things reproducible and hard things possible. It is
*not* designed to handle larger-scale configuration management, for
which we use [[puppet]].
[Invoke]: https://www.pyinvoke.org/
[Fabric]: https://www.fabfile.org/
[[!toc levels=3]]
# Tutorial
## Running a command on hosts
Fabric can be used from the commandline to run arbitrary commands on
servers, like this:
fab -H hostname.example.com -- COMMAND
For example:
$ fab -H perdulce.torproject.org -- uptime
17:53:22 up 24 days, 19:34, 1 user, load average: 0.00, 0.00, 0.07
This is equivalent to:
ssh hostname.example.com COMMAND
... except that you can run it on *multiple* servers:
$ fab -H perdulce.torproject.org,chives.torproject.org -- uptime
17:54:48 up 24 days, 19:36, 1 user, load average: 0.00, 0.00, 0.06
17:54:52 up 24 days, 17:35, 21 users, load average: 0.00, 0.00, 0.00
## Listing tasks and self-documentation
The `tsa-misc` repository has a good library of tasks that can be ran
from the commandline. To show the list, use:
fab -l
Help for individual tasks can also be inspected with `--help`, for
example:
$ fab -h host.fetch-ssh-host-pubkey
Usage: fab [--core-opts] host.fetch-ssh-host-pubkey [--options] [other tasks here ...]
Docstring:
fetch public host key from server
Options:
-t STRING, --type=STRING
The name of the server to run the command against is implicit in the
usage: it must be passed with the `-H` (short for `--hosts`)
argument. For example:
$ fab -H perdulce.torproject.org host.fetch-ssh-host-pubkey
b'ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGOnZX95ZQ0mliL0++Enm4oXMdf1caZrGEgMjw5Ykuwp root@perdulce\n'
# How-to
## A simple Fabric function
Each procedure mentioned in the introduction above has its own
documentation. This tutorial aims more to show how to make a simple
Fabric program inside TPA. Here we will create a `uptime` task which
will simply run the `uptime` command on the provided hosts. It's a
trivial example that shouldn't be implemented (it is easier to just
tell `fab` to run the shell command) but it should give you an idea of
how to write new tasks.
1. clone and edit the source
git clone git@git-rw.torproject.org:admin/tsa-misc.git
cd tsa-misc/fabric_tpa
$EDITOR host.py
we pick the "generic" host library (`host.py`) here, but there are
other libraries that might be more appropriate, for example
`ganeti`, `libvirt` or `reboot`. Fabric-specific extensions,
monkeypatching and other hacks should live in `__init__.py`.
2. add a task, which is simply a Python function:
@task
def uptime(con):
return con.run('uptime')
The `@task` string is a [decorator](https://docs.python.org/3/glossary.html#term-decorator) which indicates to Fabric
the function should be exposed as a command-line task. In that
case, it gets a Connection object passed which we can run stuff
from. In this case, we run the `uptime` command over SSH.
3. the task will automatically be loaded as it is part of the
`host` module, but if this is a new module, add it to `fabfile.py`
in the parent directory
4. the task should now be available:
$ fab -H perdulce.torproject.org host.uptime
18:06:56 up 24 days, 19:48, 1 user, load average: 0.00, 0.00, 0.02
## Pager playbook
N/A for now. Fabric is an ad-hoc tool and, as such, doesn't have
monitoring that should trigger a response. It *could* however be used
for some oncall work, which remains to be determined.
## Disaster recovery
N/A.
# Reference
## Installation
Fabric code grew out of the installer and reboot scripts in the
`tsa-misc` repository. It could also be moved to its own repository
altogether.
### Installing Fabric on Debian buster
Fabric has been [part of Debian since at least Debian jessie][], but
you should install the newer, 2.x version that is only available in
bullseye and later. The bullseye version is a "trivial backport" which
means it can be installed directly in stable with:
apt install -t bullseye fabric
[part of Debian since at least Debian jessie]: https://tracker.debian.org/pkg/fabric
This will also pull [invoke](https://tracker.debian.org/pkg/python-invoke) (from unstable) and [paramiko](https://tracker.debian.org/pkg/paramiko)
(from stable). The latter will show a *lot* of warnings when running
by default, however, so you might want to upgrade to backports as
well:
apt install -t buster-backports python3-paramiko
## SLA
N/A
## Design
TPA's fabric library lives in the `tsa-misc` repository and consists
of multiple Python modules, at the time of writing:
anarcat@curie:tsa-misc(master)$ wc -l fabric_tpa/*.py
463 fabric_tpa/ganeti.py
297 fabric_tpa/host.py
46 fabric_tpa/__init__.py
262 fabric_tpa/libvirt.py
224 fabric_tpa/reboot.py
125 fabric_tpa/retire.py
1417 total
Each module encompasses Fabric tasks that can be called from the
commandline `fab` tool or Python functions, both of which can be
reused in other modules as well. There are also wrapper scripts for
certain jobs that are a poor fit for the `fab` tool, especially
`reboot` which requires particular host scheduling.
The fabric functions currently only communicate with the rest of the
infrastructure through SSH. It is assumed the operator will have
direct `root` access on all the affected servers. Server lists are
provided by the operator but should eventually be extracted from
PuppetDB or LDAP. It's also possible scripts will eventually edit
existing (but local) git repositories.
Most of the TPA-specific code was written and is maintained by
`anarcat`. The Fabric project itself is headed by [Jeff Forcier AKA
bitprophet](http://bitprophet.org/) it is, obviously, a much smaller community than Ansible
but still active. There is a mailing list, IRC channel, and GitHub
issues for upstream support (see [contact](https://www.fabfile.org/contact.html)) along with commercial
support through [Tidelift](https://tidelift.com/subscription/pkg/pypi-fabric).
There are no formal releases of the code for now.
<!-- a good guide to "audit" an existing project's design: -->
<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
## Issues
There is no issue tracker specifically for this project, [File][] or
[search][] for issues in the [generic internal services][search] component.
[File]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
[search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team
## Monitoring and testing
There is no monitoring of this service, as it's not running continously.
Fabric tasks *should* implement some form of unit testing. Ideally, we
would have 100% test coverage.
We use [pytest](https://www.pytest.org/) to write unit tests. To run the test suite, use:
pytest-3 fabric_tpa
# Discussion
## Overview
There are multiple tasks in TPA that require manual copy-pasting of
code from documentation to the shell or, worse, to grep backwards in
history to find the magic command (e.g. `ldapvi`). A lot of those jobs
are error-prone and hard to do correctly.
In case of the installer, this leads to significant variation and
chaos in the installs, which results in instability and
inconsistencies between servers. It was determined that the installs
would be automated as part of [ticket 31239](https://trac.torproject.org/projects/tor/ticket/31239) and that analysis and
work is being done in [[new-machine]].
It was later realised that *other* areas were suffering from a similar
problem. The upgrade process, for example, has mostly been manual
until adhoc shell scripts were written. But unfortunately now we have
*many* shell scripts, none of which work correctly. So work started on
automating reboots as part of [ticket 33406](https://trac.torproject.org/projects/tor/ticket/33406).
And then it was time to migrate the second libvirt server to
[[ganeti]] (unifolium/kvm2, [ticket 33085](https://trac.torproject.org/projects/tor/ticket/33085)) and by then it was
clear some more generic solution was required. An [attempt](https://gitlab.com/anarcat/tpa-ansible-libvirt-ganeti-importer) to
implement this work in Ansible only led to frustration at the
complexity of the task and tests were started on Fabric instead, which
were positive. A few weeks later, a library of functions was available
and the migration procedure was almost entirely automated.
## Goals
### Must have
* ease of use - it should be easy to write new tasks and to
understand existing ones
* operation on multiple servers - many of the tricky tasks we need to
do operate on multiple servers *synchronously* something that, for
example, is hard to do in Puppet
### Nice to have
* long term maintenance - this should not be Legacy Code and must be
unit tested, at least for parts that are designed to stay in the
long term (e.g. not the libvirt importer)
### Non-Goals
* sharing with the community - it is assumed that those are tasks too
site-specific to be reused by other groups, although the code is
still shared publicly. shared code belongs to Puppet.
* performance - this does not need to be high performance, as those
tasks are done rarely
## Approvals required
TPA. Approved in [[tsa/meeting/2020-03-09/]].
## Proposed Solution
We are testing Fabric.
## Cost
Time and labor.
## Alternatives considered
### ansible
* [Ansible](https://www.ansible.com/) makes easy things easy and scalable, but makes it hard
to do hard stuff
* for example, how would you do a disk inventory and pass it to
another host to recreate those disk? for an Ansible ignorant like
me, it's far from trivial. it probably implies something like this
[dictionnary type][] but in Fabric, it's:
json.loads(con.run('qemu-img info --output=json %s' % disk_path).stdout)
Any person somewhat familiar with Python can tell what this does.
* we use Puppet for high-level configuration management, and Ansible
conflicts with that problem space, leading to higher cognitive load
[dictionnary type]: https://docs.ansible.com/ansible/latest/plugins/lookup/dict.html
### mcollective
* [MCollective][] was (it's deprecated) a tool that could be used to
fire jobs on Puppet nodes from the Puppet master
[MCollective]: https://puppet.com/docs/mcollective/
* Not relevant for our use case because we want to bootstrap Puppet
(in which case Puppet is not available yet) or retire Puppet (in
which case it will go away).
### bolt
* [Bolt][] is interesting because it *can* be used to [bootstrap
Puppet][]
* Unfortunately, it does *not* reuse the Puppet primitives and
instead Bolt "tasks" are just arbitrary commands, usually shell
commands (e.g. [this task][]) along with a [copious][]
[amount][] of JSON metadata
[amount]: https://github.com/puppetlabs/puppetlabs-puppet_agent/blob/master/tasks/install_shell.json
[copious]: https://github.com/puppetlabs/puppetlabs-puppet_agent/blob/master/tasks/install.json
[this task]: https://github.com/puppetlabs/puppetlabs-puppet_agent/blob/master/tasks/install_shell.sh
[Bolt]: https://puppet.com/docs/bolt/
[bootstrap Puppet]: https://github.com/puppetlabs/puppetlabs-puppet_agent#puppet_agentinstall
* does *not* have much privileged access to PuppetDB or the Puppet CA
infrastructure, that needs to be [bolted on][] by hand
[bolted on]: https://puppet.com/docs/bolt/latest/bolt_connect_puppetdb.html
### Doing things by hand
* timing is sometimes critical
* sets best practices in code instead of in documentation
* makes recipes easily reusable
### Another custom Python script
* is it `subprocess.check_output`? or `check_call`? or `run`? what if
you want both the output and the status code? can you remember?
* argument parsing code built-in, self-documenting code
* exposes Python functions as commandline jobs
### Shell scripts
* hard to reuse
* hard to read, audit
* missing a lot of basic programming primitives (hashes, objects,
etc)
* no unit testing out of the box
### Perl
* notoriously hard to read
TPA uses [Puppet](https://puppet.com/) to manage all servers it operates. It handles
most of the system configuration and some services.
most of the configuration management of the base operating system and
some services. It is *not* designed to handle ad-hoc tasks, for which
we favor the use of [[fabric]].
[[!toc levels=3]]
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment