... | ... | @@ -1098,9 +1098,7 @@ The GDB errors were documented in [issue 33732](https://gitlab.torproject.org/tp |
|
|
|
|
|
## Disaster recovery
|
|
|
|
|
|
<a name="Restoring-the-directory-server"></a>
|
|
|
|
|
|
Also known as "Restoring the directory server".
|
|
|
### Restoring the directory server
|
|
|
|
|
|
If the storage daemon disappears catastrophically, there's nothing we
|
|
|
can do: the data is lost. But if the *director* disappears, we can
|
... | ... | @@ -1303,7 +1301,7 @@ TODO: some psql users still refer to host-specific usernames like |
|
|
`bacula-dictyotum-reader`, maybe they should refer to role-specific
|
|
|
names instead?
|
|
|
|
|
|
### Troubleshooting
|
|
|
#### Troubleshooting
|
|
|
|
|
|
If you get this error:
|
|
|
|
... | ... | @@ -1337,6 +1335,168 @@ If the director takes a long time to start and ultimately fails with: |
|
|
|
|
|
It's because you forgot to reset the director password, in step 9.
|
|
|
|
|
|
### Recovering deleted files
|
|
|
|
|
|
This is not specific to the backup server, but could be seen as a
|
|
|
(no)backup/restore situation, and besides, not sure where else this
|
|
|
would fit.
|
|
|
|
|
|
If a file was deleted by mistake *and* it is gone from the backup
|
|
|
server, not all is lost. This is the story of how an entire PostgreSQL
|
|
|
cluster was deleted in production, then, 7 days later, from the backup
|
|
|
servers. Files were completely gone from the filesystem, both on the
|
|
|
production server and on the backup server, see [issue 41388](https://gitlab.torproject.org/tpo/tpa/team/-/issues/incident/41388/).
|
|
|
|
|
|
In the following, we'll assume you're working on files deleted
|
|
|
multiple days in the past. For files deleted more recently, you might
|
|
|
have better luck with [ext4magic](https://manpages.debian.org/bookworm/ext4magic/ext4magic.8.en.html), which can tap into the journal
|
|
|
to find recently deleted files more easily. Example commands you might
|
|
|
try:
|
|
|
|
|
|
umount /srv/backup/pg
|
|
|
extundelete --restore-all /dev/mapper/vg_bulk-backups--pg
|
|
|
ext4magic /dev/vg_bulk/backups-pg -f weather-01-13
|
|
|
ext4magic /dev/vg_bulk/backups-pg -RQ -f weather-01-13
|
|
|
ext4magic /dev/vg_bulk/backups-pg -Lx -f weather-01-13
|
|
|
ext4magic /dev/mapper/vg_bulk-backups--pg -b $(date -d "2023-11-01 12:00:00" +%s) -a $(date -d "2023-10-30 12:00:00" +%s) -l
|
|
|
|
|
|
In this case, we're actually going to scrub the entire "free space"
|
|
|
area of the disk to hunt for file signatures.
|
|
|
|
|
|
1. unmount the affected filesystem:
|
|
|
|
|
|
umount /srv/backup/pg
|
|
|
|
|
|
2. start `photorec`, part of the [testdisk package](https://tracker.debian.org/pkg/testdisk):
|
|
|
|
|
|
photorec /dev/mapper/vg_bulk-backups--pg
|
|
|
|
|
|
3. this will get you into an interactive interface, there you should
|
|
|
chose to inspect free space and leave most options as is, although
|
|
|
you should probably only select `tar` and `gz` files to
|
|
|
restore. pick a directory with a lot of free space to restore to.
|
|
|
|
|
|
4. start the procedure. `photorec` will inspect the entire disk
|
|
|
looking for signatures. in this case we're assuming we will be
|
|
|
able to restore the "BASE" backups.
|
|
|
|
|
|
5. once `photorec` starts reporting it found `.gz` files, you can
|
|
|
already start inspecting those, for example with this shell rune:
|
|
|
|
|
|
for file in recup_dir.*/*gz; do
|
|
|
tar -O -x -z -f $file backup_label 2>/dev/null \
|
|
|
| grep weather && ls -alh $file
|
|
|
done
|
|
|
|
|
|
here we're iterating over all restored files in the current
|
|
|
directory (`photorec` puts files in `recup_dir.N` directories,
|
|
|
where `N` is some arbitrary-looking integer), trying to decompress
|
|
|
the file, ignoring errors because restored files are typically
|
|
|
truncated or padded with garbage, then extracting only the
|
|
|
`backup_label` file to stdout, and looking for the hostname (in
|
|
|
this case `weather`) and, if it match, list the file size (phew!)
|
|
|
|
|
|
6. once the recovery is complete, you will end up with a ton of
|
|
|
recovered files. using the above pipeline, you might be lucky and
|
|
|
find a base backup that makes sense. copy those files over to the
|
|
|
actual server (or a new one), e.g. (assuming you setup SSH keys
|
|
|
right):
|
|
|
|
|
|
rsync --progress /srv/backups/bacula/recup_dir.20/f3005349888.gz root@weather-01.torproject.org:/srv
|
|
|
|
|
|
7. then, on the target server, restore that file to a directory with
|
|
|
enough disk space:
|
|
|
|
|
|
mkdir f1959051264
|
|
|
cd f1959051264/
|
|
|
tar zfx ../f1959051264.gz
|
|
|
|
|
|
8. inspect the backup to verify its integrity (postgresql backups
|
|
|
have a manifest that can be checked):
|
|
|
|
|
|
/usr/lib/postgresql/13/bin/pg_verifybackup -n .
|
|
|
|
|
|
Here's an example of a working backup, even if `gzip` and `tar`
|
|
|
complain about the archive itself:
|
|
|
|
|
|
root@weather-01:/srv# mkdir f1959051264
|
|
|
root@weather-01:/srv# cd f1959051264/
|
|
|
root@weather-01:/srv/f1959051264# tar zfx ../f1959051264.gz
|
|
|
|
|
|
gzip: stdin: decompression OK, trailing garbage ignored
|
|
|
tar: Child returned status 2
|
|
|
tar: Error is not recoverable: exiting now
|
|
|
root@weather-01:/srv/f1959051264# cd ^C
|
|
|
root@weather-01:/srv/f1959051264# du -sch .
|
|
|
39M .
|
|
|
39M total
|
|
|
root@weather-01:/srv/f1959051264# ls -alh ../f1959051264.gz
|
|
|
-rw-r--r-- 1 root root 3.5G Nov 8 17:14 ../f1959051264.gz
|
|
|
root@weather-01:/srv/f1959051264# cat backup_label
|
|
|
START WAL LOCATION: E/46000028 (file 000000010000000E00000046)
|
|
|
CHECKPOINT LOCATION: E/46000060
|
|
|
BACKUP METHOD: streamed
|
|
|
BACKUP FROM: master
|
|
|
START TIME: 2023-10-08 00:51:04 UTC
|
|
|
LABEL: bungei.torproject.org-20231008-005104-weather-01.torproject.org-main-13-backup
|
|
|
START TIMELINE: 1
|
|
|
|
|
|
and it's quite promising, that thing, actually:
|
|
|
|
|
|
root@weather-01:/srv/f1959051264# /usr/lib/postgresql/13/bin/pg_verifybackup -n .
|
|
|
backup successfully verified
|
|
|
|
|
|
9. disable Puppet. you're going to mess with stopping and starting
|
|
|
services and you don't want it in the way:
|
|
|
|
|
|
puppet agent --disable 'keeping control of postgresql startup -- anarcat 2023-11-08 tpo/tpa/team#41388'
|
|
|
|
|
|
10. install the right PostgreSQL server (we're entering the actual
|
|
|
PostgreSQL restore procedure here, getting out of scope):
|
|
|
|
|
|
apt install postgresql-13
|
|
|
|
|
|
11. move the cluster out of the way:
|
|
|
|
|
|
mv /var/lib/postgresql/13/main{,.orig}
|
|
|
|
|
|
12. restore files:
|
|
|
|
|
|
rsync -a ./ /var/lib/postgresql/13/main/
|
|
|
chown postgres:postgres /var/lib/postgresql/13/main/
|
|
|
chmod 750 /var/lib/postgresql/13/main/
|
|
|
|
|
|
13. create a recovery.conf file and tweak the postgres configuration:
|
|
|
|
|
|
echo "restore_command = 'true'" > /etc/postgresql/13/main/conf.d/recovery.conf
|
|
|
touch /var/lib/postgresql/13/main/recovery.signal
|
|
|
rm /var/lib/postgresql/13/main/backup_label
|
|
|
|
|
|
echo max_wal_senders = 0 > /etc/postgresql/13/main/conf.d/wal.conf
|
|
|
echo hot_standby = no >> /etc/postgresql/13/main/conf.d/wal.conf
|
|
|
|
|
|
14. reset the WAL (Write Ahead Log) since we don't have those (this
|
|
|
implies possible data loss, but we're already missing a lot of
|
|
|
WALs since we're restoring to a past base backup anyway):
|
|
|
|
|
|
sudo -u postgres /usr/lib/postgresql/13/bin/pg_resetwal -f /var/lib/postgresql/13/main/
|
|
|
|
|
|
15. cross your fingers, pray to the flying spaghetti monster, and
|
|
|
start the server:
|
|
|
|
|
|
systemctl start postgresql@13-main.service & journalctl -u postgresql@13-main.service -f
|
|
|
|
|
|
16. if you're extremely lucky, it will start and then you should be
|
|
|
able to dump the database and restore in the new cluster:
|
|
|
|
|
|
sudo -u postgres pg_dumpall -p 5433 | pv > /srv/dump/dump.sql
|
|
|
sudo -u postgres psql < /srv/dump/dump.sql
|
|
|
|
|
|
DO NOT USE THE DATABASE AS IS! Only dump the content and restore
|
|
|
in a new cluster.
|
|
|
|
|
|
17. if all goes well, clear out the old cluster, and restart Puppet
|
|
|
|
|
|
# Reference
|
|
|
|
|
|
## Installation
|
... | ... | |