anarcat · 6c23d5ea
--- a/howto/backup.md
+++ b/howto/backup.md
@@ -1098,9 +1098,7 @@ The GDB errors were documented in [issue 33732](https://gitlab.torproject.org/tp

 ## Disaster recovery

-<a name="Restoring-the-directory-server"></a>
-
-Also known as "Restoring the directory server".
+### Restoring the directory server

 If the storage daemon disappears catastrophically, there's nothing we
 can do: the data is lost. But if the *director* disappears, we can
@@ -1303,7 +1301,7 @@ TODO: some psql users still refer to host-specific usernames like
 `bacula-dictyotum-reader`, maybe they should refer to role-specific
 names instead?

-### Troubleshooting
+#### Troubleshooting

 If you get this error:

@@ -1337,6 +1335,168 @@ If the director takes a long time to start and ultimately fails with:

 It's because you forgot to reset the director password, in step 9.

+### Recovering deleted files
+
+This is not specific to the backup server, but could be seen as a
+(no)backup/restore situation, and besides, not sure where else this
+would fit.
+
+If a file was deleted by mistake *and* it is gone from the backup
+server, not all is lost. This is the story of how an entire PostgreSQL
+cluster was deleted in production, then, 7 days later, from the backup
+servers. Files were completely gone from the filesystem, both on the
+production server and on the backup server, see [issue 41388](https://gitlab.torproject.org/tpo/tpa/team/-/issues/incident/41388/).
+
+In the following, we'll assume you're working on files deleted
+multiple days in the past. For files deleted more recently, you might
+have better luck with [ext4magic](https://manpages.debian.org/bookworm/ext4magic/ext4magic.8.en.html), which can tap into the journal
+to find recently deleted files more easily. Example commands you might
+try:
+
+    umount /srv/backup/pg
+    extundelete --restore-all /dev/mapper/vg_bulk-backups--pg
+    ext4magic /dev/vg_bulk/backups-pg -f weather-01-13
+    ext4magic /dev/vg_bulk/backups-pg -RQ -f weather-01-13
+    ext4magic /dev/vg_bulk/backups-pg -Lx -f weather-01-13
+    ext4magic /dev/mapper/vg_bulk-backups--pg -b $(date -d "2023-11-01 12:00:00" +%s) -a $(date -d "2023-10-30 12:00:00" +%s) -l
+
+In this case, we're actually going to scrub the entire "free space"
+area of the disk to hunt for file signatures.
+
+ 1. unmount the affected filesystem:
+
+        umount /srv/backup/pg
+
+ 2. start `photorec`, part of the [testdisk package](https://tracker.debian.org/pkg/testdisk):
+
+        photorec /dev/mapper/vg_bulk-backups--pg
+
+ 3. this will get you into an interactive interface, there you should
+    chose to inspect free space and leave most options as is, although
+    you should probably only select `tar` and `gz` files to
+    restore. pick a directory with a lot of free space to restore to.
+
+ 4. start the procedure. `photorec` will inspect the entire disk
+    looking for signatures. in this case we're assuming we will be
+    able to restore the "BASE" backups.
+
+ 5. once `photorec` starts reporting it found `.gz` files, you can
+    already start inspecting those, for example with this shell rune:
+
+        for file in recup_dir.*/*gz; do
+            tar -O -x -z -f $file backup_label 2>/dev/null \
+                | grep weather  && ls -alh $file
+        done
+
+    here we're iterating over all restored files in the current
+    directory (`photorec` puts files in `recup_dir.N` directories,
+    where `N` is some arbitrary-looking integer), trying to decompress
+    the file, ignoring errors because restored files are typically
+    truncated or padded with garbage, then extracting only the
+    `backup_label` file to stdout, and looking for the hostname (in
+    this case `weather`) and, if it match, list the file size (phew!)
+
+ 6. once the recovery is complete, you will end up with a ton of
+    recovered files. using the above pipeline, you might be lucky and
+    find a base backup that makes sense. copy those files over to the
+    actual server (or a new one), e.g. (assuming you setup SSH keys
+    right):
+
+        rsync --progress /srv/backups/bacula/recup_dir.20/f3005349888.gz root@weather-01.torproject.org:/srv
+
+ 7. then, on the target server, restore that file to a directory with
+    enough disk space:
+
+        mkdir f1959051264
+        cd f1959051264/
+        tar zfx ../f1959051264.gz
+
+ 8. inspect the backup to verify its integrity (postgresql backups
+    have a manifest that can be checked):
+
+        /usr/lib/postgresql/13/bin/pg_verifybackup -n .
+
+    Here's an example of a working backup, even if `gzip` and `tar`
+    complain about the archive itself:
+
+        root@weather-01:/srv# mkdir f1959051264
+        root@weather-01:/srv# cd f1959051264/
+        root@weather-01:/srv/f1959051264# tar zfx ../f1959051264.gz
+
+        gzip: stdin: decompression OK, trailing garbage ignored
+        tar: Child returned status 2
+        tar: Error is not recoverable: exiting now
+        root@weather-01:/srv/f1959051264# cd ^C
+        root@weather-01:/srv/f1959051264# du -sch .
+        39M	.
+        39M	total
+        root@weather-01:/srv/f1959051264# ls -alh ../f1959051264.gz
+        -rw-r--r-- 1 root root 3.5G Nov  8 17:14 ../f1959051264.gz
+        root@weather-01:/srv/f1959051264# cat backup_label
+        START WAL LOCATION: E/46000028 (file 000000010000000E00000046)
+        CHECKPOINT LOCATION: E/46000060
+        BACKUP METHOD: streamed
+        BACKUP FROM: master
+        START TIME: 2023-10-08 00:51:04 UTC
+        LABEL: bungei.torproject.org-20231008-005104-weather-01.torproject.org-main-13-backup
+        START TIMELINE: 1
+
+        and it's quite promising, that thing, actually:
+
+        root@weather-01:/srv/f1959051264# /usr/lib/postgresql/13/bin/pg_verifybackup -n .
+        backup successfully verified
+
+ 9. disable Puppet. you're going to mess with stopping and starting
+    services and you don't want it in the way:
+
+        puppet agent --disable 'keeping control of postgresql startup -- anarcat 2023-11-08 tpo/tpa/team#41388'
+
+ 10. install the right PostgreSQL server (we're entering the actual
+     PostgreSQL restore procedure here, getting out of scope):
+
+        apt install postgresql-13
+
+ 11. move the cluster out of the way:
+
+        mv /var/lib/postgresql/13/main{,.orig}
+
+ 12. restore files:
+
+        rsync -a ./ /var/lib/postgresql/13/main/
+        chown postgres:postgres /var/lib/postgresql/13/main/
+        chmod 750 /var/lib/postgresql/13/main/
+
+ 13. create a recovery.conf file and tweak the postgres configuration:
+
+        echo "restore_command = 'true'" > /etc/postgresql/13/main/conf.d/recovery.conf
+        touch /var/lib/postgresql/13/main/recovery.signal
+        rm /var/lib/postgresql/13/main/backup_label
+
+        echo max_wal_senders = 0 > /etc/postgresql/13/main/conf.d/wal.conf
+        echo hot_standby = no >> /etc/postgresql/13/main/conf.d/wal.conf
+
+ 14. reset the WAL (Write Ahead Log) since we don't have those (this
+     implies possible data loss, but we're already missing a lot of
+     WALs since we're restoring to a past base backup anyway):
+
+        sudo -u postgres /usr/lib/postgresql/13/bin/pg_resetwal -f /var/lib/postgresql/13/main/
+
+ 15. cross your fingers, pray to the flying spaghetti monster, and
+     start the server:
+
+        systemctl start postgresql@13-main.service & journalctl -u postgresql@13-main.service -f
+
+ 16. if you're extremely lucky, it will start and then you should be
+     able to dump the database and restore in the new cluster:
+
+        sudo -u postgres pg_dumpall  -p 5433 | pv > /srv/dump/dump.sql
+        sudo -u postgres psql < /srv/dump/dump.sql
+
+     DO NOT USE THE DATABASE AS IS! Only dump the content and restore
+     in a new cluster.
+
+ 17. if all goes well, clear out the old cluster, and restart Puppet
+
 # Reference

 ## Installation