We were getting to an intolerable number of subsections (5!) and I actually looked for those procedures there at first, and was disturbed not to find them.

anarcat · 5129799e
--- a/howto/postgresql.md
+++ b/howto/postgresql.md
@@ -645,293 +645,21 @@ significantly, going from hours (if not days) to minutes.
 Note that [pg_squeeze](https://github.com/cybertec-postgresql/pg_squeeze) is another alternative to `pg_repack`, but
 it isn't available in Debian.

-## Pager playbook
-
-### BASE-IS-OLD
-
-Our nagios check warns us when a backup server has not successfully
-fetched a base backup recently. The causes often are that either the
-postgres server or the backup host went down or was down during the
-time of the weekly cronjob. The TLS certificate could have also been
-renewed without the postgres daemon having been restarted.
-
-To re-run a base backup for a specific cluster, log into the backup
-server(s) (currently `bungei`), then find the port for the affected
-cluster:
-
-    tail /usr/local/bin/postgres-make-base-backups
-
-Then run this backup in a screen:
-
-    sudo -u torbackup /usr/local/bin/postgres-make-base-backups <host>:<port>
-
-### MISSING-BASE
-
-WARNING: this procedure was taken from the [DSA wiki](https://dsa.debian.org/howto/postgres-backup/) and is
-untested.
-
-If you get a Nagios warning like this:
-
-    [fasolo, dak] MISSING-BASE: dak.BASE.backuphost.debian.org-20180211-012002-fasolo.debian.org-dak-9.6-backup.tar.gz
-
-This means that we started doing a base backup (as witnessed by a
-.backup file next to a WAL), but for some reason we don't have the
-corresponding base file.
-
-    root@backuphost:/srv/backups/pg/fasolo# ls -l *backup*
-    -rw------- 1 debbackup debbackup 9201093916 Jan 14 06:18 dak.BASE.backuphost.debian.org-20180114-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
-    -rw------- 1 debbackup debbackup 9227651542 Jan 21 06:25 dak.BASE.backuphost.debian.org-20180121-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
-    -rw------- 1 debbackup debbackup 9266306750 Jan 28 07:59 dak.BASE.backuphost.debian.org-20180128-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
-    -rw------- 1 debbackup debbackup 9312602089 Feb  5 11:00 dak.BASE.backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
-    -rw------- 1 debbackup debbackup 9346830509 Feb 12 10:25 dak.BASE.backuphost.debian.org-20180212-094930-fasolo.debian.org-dak-9.6-backup.tar.gz
-    -rw------- 1 debbackup debbackup        353 Jan 14 06:18 dak.WAL.0000000100000033000000A6.00000028.backup
-    -rw------- 1 debbackup debbackup        350 Jan 20 11:20 dak.WAL.00000001000000350000008C.00000028.backup
-    -rw------- 1 debbackup debbackup        353 Jan 21 06:25 dak.WAL.000000010000003600000068.00000028.backup
-    -rw------- 1 debbackup debbackup        353 Jan 28 07:59 dak.WAL.0000000100000038000000E3.00000028.backup
-    -rw------- 1 debbackup debbackup        353 Feb  5 11:00 dak.WAL.000000010000003B00000090.00000028.backup
-    -rw------- 1 debbackup debbackup        350 Feb  5 15:49 dak.WAL.000000010000003B0000009B.00000108.backup
-    -rw------- 1 debbackup debbackup        353 Feb 11 10:09 dak.WAL.000000010000003D000000AC.00000028.backup
-    -rw------- 1 debbackup debbackup        353 Feb 12 10:25 dak.WAL.000000010000003E00000027.00000178.backup
-
-`.backup` files are created on the postgres server and shipped to the
-backup hosts whenever a base backup is initiated. We do some
-labelling, so we know which backup host the corresponding tarball
-should end up with. For example:
-
-    root@backuphost:/srv/backups/pg/fasolo# cat dak.WAL.000000010000003B00000090.00000028.backup
-    START WAL LOCATION: 3B/90000028 (file 000000010000003B00000090)
-    STOP WAL LOCATION: 3B/97CF2138 (file 000000010000003B00000097)
-    CHECKPOINT LOCATION: 3B/90000098
-    BACKUP METHOD: streamed
-    BACKUP FROM: master
-    START TIME: 2018-02-05 10:25:28 UTC
-    LABEL: backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup
-    STOP TIME: 2018-02-05 10:59:50 UTC
-
-To fix this, verify we have a later base tarball, or that we are fine
-for some other reason, and remove the corresponding `.backup` file from
-the backup host. In the case above, we would remove
-`dak.WAL.000000010000003D000000AC.00000028.backup`.
-
-### WAL-MISSING-AFTER
-
-Example message:
-
-    [troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD
-
-This means that a WAL file is missing after the specified
-file. Specifically, in the above scenario, the following files are
-present, in chronological order:
-
-    -rw------- 1 torbackup torbackup   16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
-    -rw------- 1 torbackup torbackup   16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
-    -rw------- 1 torbackup torbackup   16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
-    -rw------- 1 torbackup torbackup   16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
-    -rw------- 1 torbackup torbackup   16777216 May 10 13:57 main.WAL.00000001000000D9000000B5
-
-Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
-`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
-happened between 6:26 and 13:57 on that day.
-
- 1. List the files in chronological order:
-
-        ls -ltr /srv/backups/pg/troodi/ | less
-
- 2. Find the file warned about, using `/` then the filename
-    (`main.WAL.00000001000000D9000000AD`), above
-
- 3. Look for a `.BASE.` file *following* the missing file, using `/`
-    again
-
- 4. Either:
-
-    * if a `.BASE.` backup is present after the missing files, it is
-      harmless to the extent that the missing timeframe is not
-      necessary. to remove the warnings, previous WAL files *do* need
-      to be removed, by hand.
-
-    * if a `.BASE.` backup is *not* present after the missing files,
-      the backup integrity is faulty, and a new base backup needs to
-      be performed. See [Running a full
-      backup](#running-a-full-backup) above.
-
-Here is an example of the second case above, and recovery:
-
-    root@bungei:~# cd /srv/backups/pg/meronense
-    root@bungei:/srv/backups/pg/meronense# /usr/lib/nagios/plugins/dsa-check-backuppg -e -v
-    [meronense, main] WAL-MISSING-AFTER: meronense/main.WAL.00000001000005A900000073
-    [meronense, main] NO-BASE: no base backup found?
-    [meronense, main] NO-BACKUP: no backups! (no .backup files found)
-    NOT-CONFIGURED: meronense-11
-    IGNORED: .nobackup
-    IGNORED: lost+found
-    IGNORED: BASE-meronense.torproject.org:5432-G0uKzfYhtH.tar.gz
-    root@bungei:/srv/backups/pg/meronense# ls -altr
-    total 98312
-    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:49 main.WAL.00000001000005A90000006F
-    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:49 main.WAL.00000001000005A900000070
-    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:58 main.WAL.00000001000005A900000071
-    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:59 main.WAL.00000001000005A900000072
-    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:59 main.WAL.00000001000005A900000073
-    -rw-------  1 torbackup torbackup 16777216 Jun 23 14:14 main.WAL.00000001000005A90000007F
-    drwxr-xr-x 10 torbackup torbackup     4096 Jun 23 14:14 ..
-    drwxr-xr-x  2 torbackup torbackup     4096 Jun 23 14:14 .
-    root@bungei:/srv/backups/pg/meronense# rm main.WAL.00000001000005A90000006F main.WAL.00000001000005A900000070 main.WAL.00000001000005A900000071 main.WAL.00000001000005A900000072 main.WAL.00000001000005A900000073
-    root@bungei:/srv/backups/pg/meronense# /usr/lib/nagios/plugins/dsa-check-backuppg -e -v
-    [meronense, main] NO-BASE: no base backup found?
-    [meronense, main] NO-BACKUP: no backups! (no .backup files found)
-    NOT-CONFIGURED: meronense-11
-    IGNORED: .nobackup
-    IGNORED: lost+found
-    IGNORED: BASE-meronense.torproject.org:5432-G0uKzfYhtH.tar.gz
-
-There was a gap between `main.WAL.00000001000005A900000073` and
-`main.WAL.00000001000005A90000007F` so everything up to and
-including the former were removed by hand. Then a full backup was
-performed. The reason why the BASE backup was missing is this was
-following a failed upgrade (see [tpo/tpa/team#40809](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40809)).
-
-## Disaster recovery
-
-If all fails, we need to restore PostgreSQL from backups. We have two
-ways of restoring PostgreSQL backups, **indirect** and **direct**.
-
-The **indirect** procedures first transfers the backup files to the
-server and then runs the restore, so it require more space than the
-direct procedure. It might also be slower than the direct restore
-procedure because files are copied around twice: once from the backup
-server, and again loaded in the database.
-
-The **direct** procedure *streams* the backup files directly from the
-backup server. The `BASE` file is streamed into the `tar` command for
-restore of the base snapshot, then the PostgreSQL restore command
-copies each log directly from the backup server as well. It requires
-minimal amount of space, as files are not copied to a temporary
-location on the new server. The downside is it might be more difficult
-to deploy and diagnose, as it has more moving parts.
-
-Both procedures are adaptations of the [official recovery
-procedure](https://www.postgresql.org/docs/11/continuous-archiving.html#BACKUP-PITR-RECOVERY), which can be refered to for more information.
-
-### Indirect restore procedures
-
-Reminder: this procedure **copies** the backup files from the backup
-server to the database server and restores from those.
-
- 1. First, you will need to give the backup server access to the new
-    postgresql server, which we'll call the "client" for now. First,
-    login to the client and allow the backup server to connect, and
-    show the public hostkey:
-
-        iptables -I INPUT -s $BACKUP_SERVER -j ACCEPT
-        cat /etc/ssh/ssh_host_rsa_key.pub
-
- 2. Then load the server's private key in an agent and show it to
-    allow on the client. On the server, assuming `$IP` is the IP of
-    the client and `$HOSTKEY` is its hostkey:
-
-        ssh-agent bash
-        ssh-add /etc/ssh/ssh_host_rsa_key
-        mkdir -p ~/.ssh
-        echo "$IP $HOSTKEY" >> ~/.ssh/known_hosts
-        cat /etc/ssh/ssh_host_rsa_key.pub
-
- 3. And on the client, allow the server `$HOSTKEY` (the above `cat
-    /etc/ssh/ssh_host_rsa_key.pub` on the backup server):
-
-        echo "$HOSTKEY" >> /etc/ssh/userkeys/root.more
-
- 4. Then, we can transfer files over from the backup server to the
-    client:
-
-        cd /srv/backups/pg
-        rsync -aP $CLIENT $CLIENT:/var/lib/postgresql/restore
-
- 5. Disable Puppet so you have control on when PostgreSQL is running:
-
-        puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09'
-
- 6. Then, on the client, install the software, stop the server and
-    move the template cluster out of the way:
-
-        apt install postgres rsync
-        service postgresql stop
-        mv /var/lib/postgresql/*/main{,.orig}
-        su -c 'mkdir -m 0700 /var/lib/postgresql/9.6/main' postgres
-
-    We'll be restoring files in that `main` directory.
-
-    Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
-    backup! You cannot restore across versions. This might mean
-    installing from backports or an older version of Debian.
-
- 7. Then you need to find the right `BASE` file to restore from. Each
-    `BASE` file has a timestamp in its filename, so just sorting them
-    by name should be enough to find the latest one. Uncompress the
-    `BASE` file in place:
-
-        cat ~postgres/restore/$CLIENT/main.BASE.bungei.torproject.org-20190805-145239-$CLIENT.torproject.org-main-9.6-backup.tar.gz | su postgres -c 'tar -C /var/lib/postgresql/9.6/main -x -z -f -'
-
-    (Use `pv` instead of `cat` for a progress bar with large backups.)
-
- 8. Make sure the `pg_xlog` directory doesn't contain any files.
-
- 9. Then you need to create a `recovery.conf` file in
-    `/var/lib/postgresql/9.6/main` that will tell postgres where to
-    find the WAL files. At least the `restore_command` need to be
-    specified. Something like this should work:
-
-        restore_command = 'cp /var/lib/postgresql/restore/subnotabile/main.WAL.%f %p'
-
-    You can specify a specific recovery point in the `recovery.conf`,
-    see the [upstream documentation](https://www.postgresql.org/docs/11/recovery-target-settings.html) for more information.
-
- 10. Then start the server and look at the logs to follow the recovery
-     process:
-
-        service postgresql start
-        tail -f /var/log/postgresql/*
-
- 11. re-enable Puppet, which should clean up the extra SSH key and
-     firewall rules:
-
-         puppet agent -t
-    
-     make sure it's okay in `/etc/ssh/userkeys/root` and `iptables -L`.
-
-When the restore succeeds, the `recovery.conf` file will be renamed to
-`recovery.done` and you will see something like:
-
-    2019-08-12 21:36:53.453 UTC [16901] LOG:  selected new timeline ID: 2
-    2019-08-12 21:36:53.470 UTC [16901] LOG:  archive recovery complete
-    cp: cannot stat '/var/lib/postgresql/restore/subnotablie/main.WAL.00000001.history': No such file or directory
-    2019-08-12 21:36:53.577 UTC [16901] LOG:  MultiXact member wraparound protections are now enabled
-    2019-08-12 21:36:53.584 UTC [16900] LOG:  database system is ready to accept connections
+## Direct backup recovery

-Ignore the error from `cp` complaining about the `.history` file, it's
-harmless.
-
-#### Troubleshooting
-
-If you find the following error in the logs:
-
-    FATAL:  could not locate required checkpoint record
-
-It's because postgres cannot find the WAL logs to restore from. There
-could be many causes for this, but the ones I stumbled upon were:
+<a name="direct-restore-procedures" />

- * wrong location for `recovery.conf` (it's in `/var`, not `/etc`),
-   probably the case if you don't see an error from `cp`
- * wrong permissions on the archive (put the WAL files in `~postgres`,
-   not `~root`)
- * wrong path or pattern for `restore_command` (double-check the path
-   and make sure to include the right prefix, e.g. `main.WAL`)
+Reminder: this procedure **streams** the files from the backup server
+to the database server and restore directly. See the [disaster
+recovery](#disaster-recovery) section for a more in-depth discussion on the recovery
+systems.

-### Direct restore procedure
+This is probably the procedure you want. 

-Reminder: this procedure **streams** the files from the backup server
-to the database server and restore directly. See above for more
-information on how to pick a restore procedure.
+This procedure assumes you have already a server installed with enough
+disk space to hold the data to recover. It also assumes the server is
+configured with the `profile::postgres` class in Puppet, since it uses
+a script from there to copy the WAL files from backup.

 1. First, disable Puppet so you have control on when PostgreSQL is
    running:
@@ -943,14 +671,10 @@ information on how to pick a restore procedure.
        apt install postgres-13
        service postgresql stop

-    Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
+    Make sure you run the **SAME MAJOR VERSION** of PostgreSQL than the
    backup! You cannot restore across versions. This might mean
    installing from backports or an older version of Debian.
    
-    This procedure also assumes the server is configured with the
-    `profile::postgres` class in Puppet, since it uses a script from
-    there to copy the WAL files from backup.
-
 3. On that new PostgreSQL server, show the `postgres` server public
    key, creating it if missing:

@@ -967,23 +691,24 @@ information on how to pick a restore procedure.
    found on the postgres server above. `$CLIENT` is the hostname of
    the server we are restoring to.

-    Warning: the above will fail if the key is already present in
-    `/etc/ssh/userkeys/torbackup`, edit the key in there instead in
-    that case.
+    NOTE: the above will not work if the key is already present in
+    `/etc/ssh/userkeys/torbackup`, as the key will override the one in
+    `.more`. Edit the key in there instead in that case.

 5. Then you need to find the right `BASE` file to restore from. Each
    `BASE` file has a timestamp in its filename, so just sorting them
-    by name should be enough to find the latest one. Uncompress the
-    `BASE` file in place, as the `postgres` user:
+    by name should be enough to find the latest one. 
+    
+    Decompress the `BASE` file in place, as the `postgres` user:

        sudo -u postgres -i
        mkdir -m 0700 /var/lib/postgresql/13/main
        ssh torbackup@$BACKUPSERVER $(hostname) retrieve-file pg $OLDSERVER $BASEFILE \
          | sed '0,/^$/d' \
+          | pv -s 34G
          | tar -C /var/lib/postgresql/13/main -x -z -f -

-    Add a `pv` before the `tar` call in the pipeline for a progress bar
-    with large backups, and replace:
+    In the above, replace:

     * `$BACKUPSERVER` with the backupserver name and username
        (currently `bungei.torproject.org`)
@@ -996,6 +721,7 @@ information on how to pick a restore procedure.
       or, fully expanded:
       `bacula.BASE.bungei.torproject.org-20191010-184205-dictyotum.torproject.org-bacula-13-backup.tar.gz`
       where the `$CLUSTERNAME` is the non-default `bacula`
+     * `34G` with the size of the BASE file found on the backup server

    The above might hang for a while (yes, maybe even a minute) in the
    beginning, but it should eventually get started and complete. The
@@ -1100,161 +826,449 @@ information on how to pick a restore procedure.
        2019-10-10 01:32:12.438 UTC [16950] LOG:  database system is ready to accept connections
        2019-10-10 01:32:12.439 UTC [26501] LOG:  autovacuum launcher started

-    The server is now ready for use.
+    The server is now ready for use.
+
+ 9. Remove the temporary SSH access on the backup server, either by
+    removing the `.more` key file or restoring the previous key
+    configuration:
+
+        rm /etc/ssh/userkeys/torbackup.more
+
+ 10. re-enable Puppet:
+
+        puppet agent -t
+
+### Troubleshooting
+
+See also the "Direct restore procedure" troubleshooting section, which
+also applies here.
+
+#### Base file copy
+
+If the BASE file copy fails with:
+    
+    ssh: connect to host bungei.torproject.org port 22: Connection refused
+
+It's likely because the host you are restoring to is not configured to
+backup to this backup host, and therefore the backup host does not
+have firewall rules to allow it to connect. You can fix this with
+something like:
+    
+    iptables-legacy -I INPUT -s 78.47.61.104 -j ACCEPT
+    
+or:
+
+    nft insert rule ip filter INPUT ip saddr 78.47.61.104 counter accept
+
+... but really, you should put the host in Puppet, with the
+`profile::postgres` class otherwise you will have more trouble later.
+
+If all fails, you can also use the indirect restore procedure to copy
+the BASE file manually on the server.
+
+#### Unknown format line
+
+If you get this error in the PostgreSQL restore logs:
+
+    No directories from which read is allowed given on cmdline.
+    Unkown format in line 1 () when getting dictyotum:main.WAL.00000005.history from backup host
+
+It's because the SSH keys deployed on the server does not have the
+`--read-allowed` argument.
+
+#### Manually copying WAL files
+
+If the restore doesn't work, try to connect by hand:
+
+    sudo -u postgres /usr/local/bin/pg-receive-file-from-backup dictyotum bacula.WAL.00000001000005AB0000003A /tmp/foo
+
+#### WAL was generated with wal_level=minimum, data may be missing
+
+If you get this kind of errors, it's because you forgot to restore the
+`BASE` file first:
+
+    2019-10-08 21:25:43.711 UTC [14320] [unknown]@[unknown] LOG:  incomplete startup packet
+    2019-10-08 21:25:44.215 UTC [14326] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:44.451 UTC [14316] WARNING:  WAL was generated with wal_level=minimal, data may be missing
+    2019-10-08 21:25:44.451 UTC [14316] HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
+    2019-10-08 21:25:44.451 UTC [14316] LOG:  consistent recovery state reached at 0/153A3F0
+    2019-10-08 21:25:44.451 UTC [14316] LOG:  invalid record length at 0/153A3F0: wanted 24, got 0
+    2019-10-08 21:25:44.451 UTC [14316] LOG:  redo is not required
+    2019-10-08 21:25:44.721 UTC [14334] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:45.226 UTC [14340] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:45.549 UTC [14316] LOG:  selected new timeline ID: 6
+    2019-10-08 21:25:45.732 UTC [14343] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:45.765 UTC [14316] LOG:  archive recovery complete
+    2019-10-08 21:25:46.238 UTC [14349] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:46.466 UTC [14316] LOG:  MultiXact member wraparound protections are now enabled
+    2019-10-08 21:25:46.467 UTC [14315] LOG:  database system is ready to accept connections
+    2019-10-08 21:25:46.467 UTC [14351] LOG:  autovacuum launcher started
+
+#### could not unlink file
+
+TODO: this warning comes up from time to time, problem?
+
+    2019-10-09 23:47:13.446 UTC [16973] LOG:  could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
+
+#### missing "archive recovery complete" message
+
+A block like this should show up in the
+`/var/log/postgresql/postgresql-13-main.log` file:
+
+    2019-10-10 01:30:55.460 UTC [16953] LOG:  redo done at 5B8/9C5BE738
+    2019-10-10 01:30:55.460 UTC [16953] LOG:  last completed transaction was at log time 2019-10-10 01:04:23.238233+00
+    2019-10-10 01:31:03.536 UTC [16953] LOG:  restored log file "00000001000005B80000009C" from archive
+    2019-10-10 01:31:06.458 UTC [16953] LOG:  selected new timeline ID: 2
+    2019-10-10 01:31:17.485 UTC [16953] LOG:  archive recovery complete
+    2019-10-10 01:32:11.975 UTC [16953] LOG:  MultiXact member wraparound protections are now enabled
+    2019-10-10 01:32:12.438 UTC [16950] LOG:  database system is ready to accept connections
+    2019-10-10 01:32:12.439 UTC [26501] LOG:  autovacuum launcher started
+
+The key entry is `archive recovery complete` here.
+
+It *should* show this in the logs. If it is not, it might just be
+still recovering a WAL file, or it might be `pause`d.
+    
+You can confirm what the server is doing by looking at the processes,
+for example, this is still recovering a WAL file:
+
+    root@meronense-backup-01:~# systemctl status postgresql@13-main.service
+    ● postgresql@13-main.service - PostgreSQL Cluster 13-main
+         Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
+         Active: active (running) since Thu 2022-10-27 15:06:40 UTC; 1min 0s ago
+        Process: 67835 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-main start (code=exited, status=0/SUCCESS)
+       Main PID: 67840 (postgres)
+          Tasks: 5 (limit: 9510)
+         Memory: 50.0M
+            CPU: 626ms
+         CGroup: /system.slice/system-postgresql.slice/postgresql@13-main.service
+                 ├─67840 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
+                 ├─67842 postgres: 13/main: startup recovering 0000000100000600000000F5
+                 ├─67851 postgres: 13/main: checkpointer
+                 ├─67853 postgres: 13/main: background writer
+                 └─67855 postgres: 13/main: stats collector
+
+... because there's a process doing:
+
+    67842 postgres: 13/main: startup recovering 0000000100000600000000F5
+
+In that case, it was stuck in "pause" mode, as the logs indicated:
+
+    2022-10-27 15:08:54.882 UTC [67933] LOG:  starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
+    2022-10-27 15:08:54.882 UTC [67933] LOG:  listening on IPv6 address "::1", port 5432
+    2022-10-27 15:08:54.882 UTC [67933] LOG:  listening on IPv4 address "127.0.0.1", port 5432
+    2022-10-27 15:08:54.998 UTC [67933] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
+    2022-10-27 15:08:55.236 UTC [67939] LOG:  database system was shut down in recovery at 2022-10-27 15:08:54 UTC
+    2022-10-27 15:08:55.911 UTC [67939] LOG:  starting point-in-time recovery to 2022-10-01 00:00:00+00
+    2022-10-27 15:08:56.764 UTC [67939] LOG:  restored log file "0000000100000600000000F4" from archive
+    2022-10-27 15:08:57.316 UTC [67939] LOG:  redo starts at 600/F4000028
+    2022-10-27 15:08:58.497 UTC [67939] LOG:  restored log file "0000000100000600000000F5" from archive
+    2022-10-27 15:08:59.119 UTC [67939] LOG:  consistent recovery state reached at 600/F50051F0
+    2022-10-27 15:08:59.119 UTC [67933] LOG:  database system is ready to accept read only connections
+    2022-10-27 15:08:59.120 UTC [67939] LOG:  recovery stopping before commit of transaction 12884886, time 2022-10-01 08:40:35.735422+00
+    2022-10-27 15:08:59.120 UTC [67939] LOG:  pausing at the end of recovery
+    2022-10-27 15:08:59.120 UTC [67939] HINT:  Execute pg_wal_replay_resume() to promote.
+
+The `pg_wal_replay_resume()` is not actually the right statement to
+use here, however. That would put the server back into recovery mode,
+where it would start fetching WAL files again. It's useful for
+replicated setups, but this is not such a case.
+
+In the above scenario, a `recovery_target_time` was added but without
+a `recovery_target_action`, which led the server to be paused instead
+of resuming normal operation.
+
+The correct way to recover here is to issue a `pg_promote` statement:
+
+    sudo -u postgres psql -c 'SELECT pg_promote();'
+
+## Indirect backup recovery
+
+<a name="indirect-restore-procedures" />
+
+Reminder: this procedure **copies** the backup files from the backup
+server to the database server and restores from those.
+
+ 1. First, you will need to give the backup server access to the new
+    postgresql server, which we'll call the "client" for now. First,
+    login to the client and allow the backup server to connect, and
+    show the public hostkey:
+
+        iptables -I INPUT -s $BACKUP_SERVER -j ACCEPT
+        cat /etc/ssh/ssh_host_rsa_key.pub
+
+ 2. Then load the server's private key in an agent and show it to
+    allow on the client. On the server, assuming `$IP` is the IP of
+    the client and `$HOSTKEY` is its hostkey:
+
+        ssh-agent bash
+        ssh-add /etc/ssh/ssh_host_rsa_key
+        mkdir -p ~/.ssh
+        echo "$IP $HOSTKEY" >> ~/.ssh/known_hosts
+        cat /etc/ssh/ssh_host_rsa_key.pub
+
+ 3. And on the client, allow the server `$HOSTKEY` (the above `cat
+    /etc/ssh/ssh_host_rsa_key.pub` on the backup server):
+
+        echo "$HOSTKEY" >> /etc/ssh/userkeys/root.more
+
+ 4. Then, we can transfer files over from the backup server to the
+    client:
+
+        cd /srv/backups/pg
+        rsync -aP $CLIENT $CLIENT:/var/lib/postgresql/restore
+
+ 5. Disable Puppet so you have control on when PostgreSQL is running:
+
+        puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09'
+
+ 6. Then, on the client, install the software, stop the server and
+    move the template cluster out of the way:
+
+        apt install postgres rsync
+        service postgresql stop
+        mv /var/lib/postgresql/*/main{,.orig}
+        su -c 'mkdir -m 0700 /var/lib/postgresql/9.6/main' postgres
+
+    We'll be restoring files in that `main` directory.
+
+    Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
+    backup! You cannot restore across versions. This might mean
+    installing from backports or an older version of Debian.
+
+ 7. Then you need to find the right `BASE` file to restore from. Each
+    `BASE` file has a timestamp in its filename, so just sorting them
+    by name should be enough to find the latest one. Uncompress the
+    `BASE` file in place:
+
+        cat ~postgres/restore/$CLIENT/main.BASE.bungei.torproject.org-20190805-145239-$CLIENT.torproject.org-main-9.6-backup.tar.gz | su postgres -c 'tar -C /var/lib/postgresql/9.6/main -x -z -f -'
+
+    (Use `pv` instead of `cat` for a progress bar with large backups.)
+
+ 8. Make sure the `pg_xlog` directory doesn't contain any files.
+
+ 9. Then you need to create a `recovery.conf` file in
+    `/var/lib/postgresql/9.6/main` that will tell postgres where to
+    find the WAL files. At least the `restore_command` need to be
+    specified. Something like this should work:
+
+        restore_command = 'cp /var/lib/postgresql/restore/subnotabile/main.WAL.%f %p'
+
+    You can specify a specific recovery point in the `recovery.conf`,
+    see the [upstream documentation](https://www.postgresql.org/docs/11/recovery-target-settings.html) for more information.
+
+ 10. Then start the server and look at the logs to follow the recovery
+     process:
+
+        service postgresql start
+        tail -f /var/log/postgresql/*
+
+ 11. re-enable Puppet, which should clean up the extra SSH key and
+     firewall rules:
+
+         puppet agent -t
+    
+     make sure it's okay in `/etc/ssh/userkeys/root` and `iptables -L`.
+
+When the restore succeeds, the `recovery.conf` file will be renamed to
+`recovery.done` and you will see something like:

- 9. Remove the temporary SSH access on the backup server, either by
-    removing the `.more` key file or restoring the previous key
-    configuration:
+    2019-08-12 21:36:53.453 UTC [16901] LOG:  selected new timeline ID: 2
+    2019-08-12 21:36:53.470 UTC [16901] LOG:  archive recovery complete
+    cp: cannot stat '/var/lib/postgresql/restore/subnotablie/main.WAL.00000001.history': No such file or directory
+    2019-08-12 21:36:53.577 UTC [16901] LOG:  MultiXact member wraparound protections are now enabled
+    2019-08-12 21:36:53.584 UTC [16900] LOG:  database system is ready to accept connections

-        rm /etc/ssh/userkeys/torbackup.more
+Ignore the error from `cp` complaining about the `.history` file, it's
+harmless.

- 10. re-enable Puppet:
+### Troubleshooting

-        puppet agent -t
+If you find the following error in the logs:

-#### Troubleshooting
+    FATAL:  could not locate required checkpoint record

-See also the "Direct restore procedure" troubleshooting section, which
-also applies here.
+It's because postgres cannot find the WAL logs to restore from. There
+could be many causes for this, but the ones I stumbled upon were:

-##### Base file copy
+ * wrong location for `recovery.conf` (it's in `/var`, not `/etc`),
+   probably the case if you don't see an error from `cp`
+ * wrong permissions on the archive (put the WAL files in `~postgres`,
+   not `~root`)
+ * wrong path or pattern for `restore_command` (double-check the path
+   and make sure to include the right prefix, e.g. `main.WAL`)

-If the BASE file copy fails with:
+## Pager playbook

-    ssh: connect to host bungei.torproject.org port 22: Connection refused
+### BASE-IS-OLD

-It's likely because the host you are restoring to is not configured to
-backup to this backup host, and therefore the backup host does not
-have firewall rules to allow it to connect. You can fix this with
-something like:
+Our nagios check warns us when a backup server has not successfully
+fetched a base backup recently. The causes often are that either the
+postgres server or the backup host went down or was down during the
+time of the weekly cronjob. The TLS certificate could have also been
+renewed without the postgres daemon having been restarted.

-    iptables-legacy -I INPUT -s 78.47.61.104 -j ACCEPT
+To re-run a base backup for a specific cluster, log into the backup
+server(s) (currently `bungei`), then find the port for the affected
+cluster:

-or:
+    tail /usr/local/bin/postgres-make-base-backups

-    nft insert rule ip filter INPUT ip saddr 78.47.61.104 counter accept
+Then run this backup in a screen:

-... but really, you should put the host in Puppet, with the
-`profile::postgres` class otherwise you will have more trouble later.
+    sudo -u torbackup /usr/local/bin/postgres-make-base-backups <host>:<port>

-If all fails, you can also use the indirect restore procedure to copy
-the BASE file manually on the server.
+### MISSING-BASE

-##### Unknown format line
+WARNING: this procedure was taken from the [DSA wiki](https://dsa.debian.org/howto/postgres-backup/) and is
+untested.

-If you get this error in the PostgreSQL restore logs:
+If you get a Nagios warning like this:

-    No directories from which read is allowed given on cmdline.
-    Unkown format in line 1 () when getting dictyotum:main.WAL.00000005.history from backup host
+    [fasolo, dak] MISSING-BASE: dak.BASE.backuphost.debian.org-20180211-012002-fasolo.debian.org-dak-9.6-backup.tar.gz

-It's because the SSH keys deployed on the server does not have the
-`--read-allowed` argument.
+This means that we started doing a base backup (as witnessed by a
+.backup file next to a WAL), but for some reason we don't have the
+corresponding base file.

-##### Manually copying WAL files
+    root@backuphost:/srv/backups/pg/fasolo# ls -l *backup*
+    -rw------- 1 debbackup debbackup 9201093916 Jan 14 06:18 dak.BASE.backuphost.debian.org-20180114-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
+    -rw------- 1 debbackup debbackup 9227651542 Jan 21 06:25 dak.BASE.backuphost.debian.org-20180121-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
+    -rw------- 1 debbackup debbackup 9266306750 Jan 28 07:59 dak.BASE.backuphost.debian.org-20180128-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
+    -rw------- 1 debbackup debbackup 9312602089 Feb  5 11:00 dak.BASE.backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
+    -rw------- 1 debbackup debbackup 9346830509 Feb 12 10:25 dak.BASE.backuphost.debian.org-20180212-094930-fasolo.debian.org-dak-9.6-backup.tar.gz
+    -rw------- 1 debbackup debbackup        353 Jan 14 06:18 dak.WAL.0000000100000033000000A6.00000028.backup
+    -rw------- 1 debbackup debbackup        350 Jan 20 11:20 dak.WAL.00000001000000350000008C.00000028.backup
+    -rw------- 1 debbackup debbackup        353 Jan 21 06:25 dak.WAL.000000010000003600000068.00000028.backup
+    -rw------- 1 debbackup debbackup        353 Jan 28 07:59 dak.WAL.0000000100000038000000E3.00000028.backup
+    -rw------- 1 debbackup debbackup        353 Feb  5 11:00 dak.WAL.000000010000003B00000090.00000028.backup
+    -rw------- 1 debbackup debbackup        350 Feb  5 15:49 dak.WAL.000000010000003B0000009B.00000108.backup
+    -rw------- 1 debbackup debbackup        353 Feb 11 10:09 dak.WAL.000000010000003D000000AC.00000028.backup
+    -rw------- 1 debbackup debbackup        353 Feb 12 10:25 dak.WAL.000000010000003E00000027.00000178.backup

-If the restore doesn't work, try to connect by hand:
+`.backup` files are created on the postgres server and shipped to the
+backup hosts whenever a base backup is initiated. We do some
+labelling, so we know which backup host the corresponding tarball
+should end up with. For example:

-    sudo -u postgres /usr/local/bin/pg-receive-file-from-backup dictyotum bacula.WAL.00000001000005AB0000003A /tmp/foo
+    root@backuphost:/srv/backups/pg/fasolo# cat dak.WAL.000000010000003B00000090.00000028.backup
+    START WAL LOCATION: 3B/90000028 (file 000000010000003B00000090)
+    STOP WAL LOCATION: 3B/97CF2138 (file 000000010000003B00000097)
+    CHECKPOINT LOCATION: 3B/90000098
+    BACKUP METHOD: streamed
+    BACKUP FROM: master
+    START TIME: 2018-02-05 10:25:28 UTC
+    LABEL: backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup
+    STOP TIME: 2018-02-05 10:59:50 UTC

-##### WAL was generated with wal_level=minimum, data may be missing
+To fix this, verify we have a later base tarball, or that we are fine
+for some other reason, and remove the corresponding `.backup` file from
+the backup host. In the case above, we would remove
+`dak.WAL.000000010000003D000000AC.00000028.backup`.

-If you get this kind of errors, it's because you forgot to restore the
-`BASE` file first:
+### WAL-MISSING-AFTER

-    2019-10-08 21:25:43.711 UTC [14320] [unknown]@[unknown] LOG:  incomplete startup packet
-    2019-10-08 21:25:44.215 UTC [14326] postgres@postgres FATAL:  the database system is starting up
-    2019-10-08 21:25:44.451 UTC [14316] WARNING:  WAL was generated with wal_level=minimal, data may be missing
-    2019-10-08 21:25:44.451 UTC [14316] HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
-    2019-10-08 21:25:44.451 UTC [14316] LOG:  consistent recovery state reached at 0/153A3F0
-    2019-10-08 21:25:44.451 UTC [14316] LOG:  invalid record length at 0/153A3F0: wanted 24, got 0
-    2019-10-08 21:25:44.451 UTC [14316] LOG:  redo is not required
-    2019-10-08 21:25:44.721 UTC [14334] postgres@postgres FATAL:  the database system is starting up
-    2019-10-08 21:25:45.226 UTC [14340] postgres@postgres FATAL:  the database system is starting up
-    2019-10-08 21:25:45.549 UTC [14316] LOG:  selected new timeline ID: 6
-    2019-10-08 21:25:45.732 UTC [14343] postgres@postgres FATAL:  the database system is starting up
-    2019-10-08 21:25:45.765 UTC [14316] LOG:  archive recovery complete
-    2019-10-08 21:25:46.238 UTC [14349] postgres@postgres FATAL:  the database system is starting up
-    2019-10-08 21:25:46.466 UTC [14316] LOG:  MultiXact member wraparound protections are now enabled
-    2019-10-08 21:25:46.467 UTC [14315] LOG:  database system is ready to accept connections
-    2019-10-08 21:25:46.467 UTC [14351] LOG:  autovacuum launcher started
+Example message:

-##### could not unlink file
+    [troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD

-TODO: this warning comes up from time to time, problem?
+This means that a WAL file is missing after the specified
+file. Specifically, in the above scenario, the following files are
+present, in chronological order:

-    2019-10-09 23:47:13.446 UTC [16973] LOG:  could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
+    -rw------- 1 torbackup torbackup   16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
+    -rw------- 1 torbackup torbackup   16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
+    -rw------- 1 torbackup torbackup   16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
+    -rw------- 1 torbackup torbackup   16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
+    -rw------- 1 torbackup torbackup   16777216 May 10 13:57 main.WAL.00000001000000D9000000B5

-##### missing "archive recovery complete" message
+Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
+`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
+happened between 6:26 and 13:57 on that day.

-A block like this should show up in the
-`/var/log/postgresql/postgresql-13-main.log` file:
+ 1. List the files in chronological order:

-    2019-10-10 01:30:55.460 UTC [16953] LOG:  redo done at 5B8/9C5BE738
-    2019-10-10 01:30:55.460 UTC [16953] LOG:  last completed transaction was at log time 2019-10-10 01:04:23.238233+00
-    2019-10-10 01:31:03.536 UTC [16953] LOG:  restored log file "00000001000005B80000009C" from archive
-    2019-10-10 01:31:06.458 UTC [16953] LOG:  selected new timeline ID: 2
-    2019-10-10 01:31:17.485 UTC [16953] LOG:  archive recovery complete
-    2019-10-10 01:32:11.975 UTC [16953] LOG:  MultiXact member wraparound protections are now enabled
-    2019-10-10 01:32:12.438 UTC [16950] LOG:  database system is ready to accept connections
-    2019-10-10 01:32:12.439 UTC [26501] LOG:  autovacuum launcher started
+        ls -ltr /srv/backups/pg/troodi/ | less

-The key entry is `archive recovery complete` here.
+ 2. Find the file warned about, using `/` then the filename
+    (`main.WAL.00000001000000D9000000AD`), above

-It *should* show this in the logs. If it is not, it might just be
-still recovering a WAL file, or it might be `pause`d.
+ 3. Look for a `.BASE.` file *following* the missing file, using `/`
+    again

-You can confirm what the server is doing by looking at the processes,
-for example, this is still recovering a WAL file:
+ 4. Either:

-    root@meronense-backup-01:~# systemctl status postgresql@13-main.service
-    ● postgresql@13-main.service - PostgreSQL Cluster 13-main
-         Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
-         Active: active (running) since Thu 2022-10-27 15:06:40 UTC; 1min 0s ago
-        Process: 67835 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-main start (code=exited, status=0/SUCCESS)
-       Main PID: 67840 (postgres)
-          Tasks: 5 (limit: 9510)
-         Memory: 50.0M
-            CPU: 626ms
-         CGroup: /system.slice/system-postgresql.slice/postgresql@13-main.service
-                 ├─67840 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
-                 ├─67842 postgres: 13/main: startup recovering 0000000100000600000000F5
-                 ├─67851 postgres: 13/main: checkpointer
-                 ├─67853 postgres: 13/main: background writer
-                 └─67855 postgres: 13/main: stats collector
+    * if a `.BASE.` backup is present after the missing files, it is
+      harmless to the extent that the missing timeframe is not
+      necessary. to remove the warnings, previous WAL files *do* need
+      to be removed, by hand.

-... because there's a process doing:
+    * if a `.BASE.` backup is *not* present after the missing files,
+      the backup integrity is faulty, and a new base backup needs to
+      be performed. See [Running a full
+      backup](#running-a-full-backup) above.

-    67842 postgres: 13/main: startup recovering 0000000100000600000000F5
+Here is an example of the second case above, and recovery:

-In that case, it was stuck in "pause" mode, as the logs indicated:
+    root@bungei:~# cd /srv/backups/pg/meronense
+    root@bungei:/srv/backups/pg/meronense# /usr/lib/nagios/plugins/dsa-check-backuppg -e -v
+    [meronense, main] WAL-MISSING-AFTER: meronense/main.WAL.00000001000005A900000073
+    [meronense, main] NO-BASE: no base backup found?
+    [meronense, main] NO-BACKUP: no backups! (no .backup files found)
+    NOT-CONFIGURED: meronense-11
+    IGNORED: .nobackup
+    IGNORED: lost+found
+    IGNORED: BASE-meronense.torproject.org:5432-G0uKzfYhtH.tar.gz
+    root@bungei:/srv/backups/pg/meronense# ls -altr
+    total 98312
+    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:49 main.WAL.00000001000005A90000006F
+    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:49 main.WAL.00000001000005A900000070
+    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:58 main.WAL.00000001000005A900000071
+    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:59 main.WAL.00000001000005A900000072
+    -rw-------  1 torbackup torbackup 16777216 Jun 23 13:59 main.WAL.00000001000005A900000073
+    -rw-------  1 torbackup torbackup 16777216 Jun 23 14:14 main.WAL.00000001000005A90000007F
+    drwxr-xr-x 10 torbackup torbackup     4096 Jun 23 14:14 ..
+    drwxr-xr-x  2 torbackup torbackup     4096 Jun 23 14:14 .
+    root@bungei:/srv/backups/pg/meronense# rm main.WAL.00000001000005A90000006F main.WAL.00000001000005A900000070 main.WAL.00000001000005A900000071 main.WAL.00000001000005A900000072 main.WAL.00000001000005A900000073
+    root@bungei:/srv/backups/pg/meronense# /usr/lib/nagios/plugins/dsa-check-backuppg -e -v
+    [meronense, main] NO-BASE: no base backup found?
+    [meronense, main] NO-BACKUP: no backups! (no .backup files found)
+    NOT-CONFIGURED: meronense-11
+    IGNORED: .nobackup
+    IGNORED: lost+found
+    IGNORED: BASE-meronense.torproject.org:5432-G0uKzfYhtH.tar.gz

-    2022-10-27 15:08:54.882 UTC [67933] LOG:  starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
-    2022-10-27 15:08:54.882 UTC [67933] LOG:  listening on IPv6 address "::1", port 5432
-    2022-10-27 15:08:54.882 UTC [67933] LOG:  listening on IPv4 address "127.0.0.1", port 5432
-    2022-10-27 15:08:54.998 UTC [67933] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
-    2022-10-27 15:08:55.236 UTC [67939] LOG:  database system was shut down in recovery at 2022-10-27 15:08:54 UTC
-    2022-10-27 15:08:55.911 UTC [67939] LOG:  starting point-in-time recovery to 2022-10-01 00:00:00+00
-    2022-10-27 15:08:56.764 UTC [67939] LOG:  restored log file "0000000100000600000000F4" from archive
-    2022-10-27 15:08:57.316 UTC [67939] LOG:  redo starts at 600/F4000028
-    2022-10-27 15:08:58.497 UTC [67939] LOG:  restored log file "0000000100000600000000F5" from archive
-    2022-10-27 15:08:59.119 UTC [67939] LOG:  consistent recovery state reached at 600/F50051F0
-    2022-10-27 15:08:59.119 UTC [67933] LOG:  database system is ready to accept read only connections
-    2022-10-27 15:08:59.120 UTC [67939] LOG:  recovery stopping before commit of transaction 12884886, time 2022-10-01 08:40:35.735422+00
-    2022-10-27 15:08:59.120 UTC [67939] LOG:  pausing at the end of recovery
-    2022-10-27 15:08:59.120 UTC [67939] HINT:  Execute pg_wal_replay_resume() to promote.
+There was a gap between `main.WAL.00000001000005A900000073` and
+`main.WAL.00000001000005A90000007F` so everything up to and
+including the former were removed by hand. Then a full backup was
+performed. The reason why the BASE backup was missing is this was
+following a failed upgrade (see [tpo/tpa/team#40809](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40809)).

-The `pg_wal_replay_resume()` is not actually the right statement to
-use here, however. That would put the server back into recovery mode,
-where it would start fetching WAL files again. It's useful for
-replicated setups, but this is not such a case.
+## Disaster recovery

-In the above scenario, a `recovery_target_time` was added but without
-a `recovery_target_action`, which led the server to be paused instead
-of resuming normal operation.
+If a PostgreSQL server is destroyed completely or in part, we need to
+restore from backups. We have two ways of restoring PostgreSQL
+backups, the **direct** and **indirect** procedures.

-The correct way to recover here is to issue a `pg_promote` statement:
+The **direct** procedure *streams* the backup files directly from the
+backup server. The `BASE` file is streamed into the `tar` command for
+restore of the base snapshot, then the PostgreSQL restore command
+copies each log directly from the backup server as well. It requires
+minimal amount of space, as files are not copied to a temporary
+location on the new server. The downside is it might be more difficult
+to deploy and diagnose, as it has more moving parts.

-    sudo -u postgres psql -c 'SELECT pg_promote();'
+The **indirect** procedures first transfers the backup files to the
+server and then runs the restore, so it require more space than the
+direct procedure. It might also be slower than the direct restore
+procedure because files are copied around twice: once from the backup
+server, and again loaded in the database.
+
+Both procedures are adaptations of the [official recovery
+procedure](https://www.postgresql.org/docs/11/continuous-archiving.html#BACKUP-PITR-RECOVERY), which can be refered to for more information.
+
+See the procedures in [direct backup recovery](#direct-backup-recovery) and [indirect backup
+recovery](#indirect-backup-recovery), above.

 # Reference