Skip to content
Snippets Groups Projects
Verified Commit 47f8bb84 authored by anarcat's avatar anarcat
Browse files

expand on failure modes a bit

parent 57eb6099
No related branches found
No related tags found
No related merge requests found
......@@ -296,19 +296,50 @@ downtime, because users and passwords are *copied* over to all
hosts. In other words, authentication doesn't rely on the LDAP server
being up.
In general, OpenLDAP is very stable and doesn't generally crash, so we
haven't had many emergencies scenarios with it yet. If anything
happens, make sure the `slapd` service is running.
The `ud-ldap` software, on the other hand, is a little more
complicated and can be hard to diagnose. It has a large number of
moving parts (Python, Perl, Bash, Shell scripts) and talks over a
large number of protocols (email, DNS, HTTPS, SSH, finger). The
failure modes documented here are far from exhaustive and you should
expect exotic failures and error messages.
### LDAP server failure
That said, if the LDAP server goes down, password changes will not
work, and the server inventory (at <https://db.torproject.org/>) will
be gone. A mitigation is to use Puppet manifests and/or PuppetDB to
get a host list and server inventory, see the [Puppet
documentation](puppet) for details.
In general, OpenLDAP is very stable and doesn't generally crash, so we
haven't had many emergencies scenarios with it yet. If anything
happens, make sure the `slapd` service is running.
### Git server failure
The `ud-ldap` software, on the other hand, is a little more
complicated and can be hard to diagnose. TODO: expand on the failure
modes.
The LDAP server will fail to regenerate (and therefore update) zone
files and zone records if the Git server is unavailable. This is
described in [issue 33766](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33766). The fix is to recover the git server. A
workaround is to run this command on the primary DNS server (currently
`nevii`):
sudo -u dnsadm /srv/dns.torproject.org/bin/update --force
### ud-replicate failures
TODO: i seem to recall `ud-replicate` failing somehow, possibly
because of SSH multiplexing or something?
### Dependency loop on new installs
Installing a new server requires granting the new server access
various machines, including [puppet](puppet) and the LDAP server
itself. This is granted ... by Puppet through LDAP!
So a server cannot register itself on the LDAP server and needs an
operator to first create a `host` snippet on the LDAP server, and then
run Puppet on the Puppet server. This is documented in the
[installation notes](new-machine).
## Disaster recovery
......@@ -316,7 +347,11 @@ The LDAP server is mostly built by hand and should therefore be
restored from backups in case of a catastrophic failure. Care should
be taken to keep the SSH keys of the server intact.
TODO: analyse <https://gitlab.torproject.org/tpo/tpa/team/-/issues/33908>.
The IP address (and name?) of the LDAP server should not be hardcoded
anywhere. When the server was last renumbered ([issue 33908](https://gitlab.torproject.org/tpo/tpa/team/-/issues/33908)), the
only changes necessary were on the server itself, in `/etc`. So in
theory, a fresh new server could be deployed (from backups) in a new
location (and new address) without having to do much.
# Reference
## Installation
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment