I had this situation in #40945 that was quite funky, hopefully this will make it easier in the future.

anarcat · 9506b952
--- a/howto/postgresql.md
+++ b/howto/postgresql.md
@@ -1002,25 +1002,9 @@ information on how to pick a restore procedure.
   
        ssh torbackup@bungei.torproject.org bacula-director-01 retrieve-file pg dictyotum bacula.BASE.bungei.torproject.org-20191010-184205-dictyotum.torproject.org-bacula-9.6-backup.tar.gz | pv -s 33G | sed '0,/^$/d' | tar -C /var/lib/postgresql/9.6/main -x -z -f -

-    If the above doesn't work, use the indirect procedure to restore
-    the `BASE` file.
-
 [#31786]: https://bugs.torproject.org/31786

-    If the above fails with:
-    
-        ssh: connect to host bungei.torproject.org port 22: Connection refused
-
-    It's likely because the host you are restoring to is not
-    configured to backup to this backup host, and therefore the backup
-    host does not have firewall rules to allow it to connect. You can
-    fix this with something like:
-    
-        iptables-legacy -I INPUT -s 78.47.61.104 -j ACCEPT
-    
-    or:

-        nft insert rule ip filter INPUT ip saddr 78.47.61.104 counter accept

 6. Make sure the `pg_xlog` directory doesn't contain any files.
 
@@ -1076,10 +1060,6 @@ information on how to pick a restore procedure.

    ... and so on.

-    TODO: this warning comes up from time to time, problem?
-
-        2019-10-09 23:47:13.446 UTC [16973] LOG:  could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
-
    Then the recovery will complete with something like this:

        2019-10-10 01:30:55.460 UTC [16953] LOG:  redo done at 5B8/9C5BE738
@@ -1103,6 +1083,34 @@ information on how to pick a restore procedure.

 #### Troubleshooting

+See also the "Direct restore procedure" troubleshooting section, which
+also applies here.
+
+##### Base file copy
+
+If the BASE file copy fails with:
+    
+    ssh: connect to host bungei.torproject.org port 22: Connection refused
+
+It's likely because the host you are restoring to is not configured to
+backup to this backup host, and therefore the backup host does not
+have firewall rules to allow it to connect. You can fix this with
+something like:
+    
+    iptables-legacy -I INPUT -s 78.47.61.104 -j ACCEPT
+    
+or:
+
+    nft insert rule ip filter INPUT ip saddr 78.47.61.104 counter accept
+
+... but really, you should put the host in Puppet, with the
+`profile::postgres` class otherwise you will have more trouble later.
+
+If all fails, you can also use the indirect restore procedure to copy
+the BASE file manually on the server.
+
+##### Unknown format line
+
 If you get this error in the PostgreSQL restore logs:

    No directories from which read is allowed given on cmdline.
@@ -1111,10 +1119,14 @@ If you get this error in the PostgreSQL restore logs:
 It's because the SSH keys deployed on the server does not have the
 `--read-allowed` argument.

+##### Manually copying WAL files
+
 If the restore doesn't work, try to connect by hand:

    sudo -u postgres /usr/local/bin/pg-receive-file-from-backup dictyotum bacula.WAL.00000001000005AB0000003A /tmp/foo

+##### WAL was generated with wal_level=minimum, data may be missing
+
 If you get this kind of errors, it's because you forgot to restore the
 `BASE` file first:

@@ -1135,8 +1147,83 @@ If you get this kind of errors, it's because you forgot to restore the
    2019-10-08 21:25:46.467 UTC [14315] LOG:  database system is ready to accept connections
    2019-10-08 21:25:46.467 UTC [14351] LOG:  autovacuum launcher started

-See also the "Direct restore procedure" troubleshooting section, which
-also applies here.
+##### could not unlink file
+
+TODO: this warning comes up from time to time, problem?
+
+    2019-10-09 23:47:13.446 UTC [16973] LOG:  could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
+
+##### missing "archive recovery complete" message
+
+A block like this should show up in the
+`/var/log/postgresql/postgresql-13-main.log` file:
+
+    2019-10-10 01:30:55.460 UTC [16953] LOG:  redo done at 5B8/9C5BE738
+    2019-10-10 01:30:55.460 UTC [16953] LOG:  last completed transaction was at log time 2019-10-10 01:04:23.238233+00
+    2019-10-10 01:31:03.536 UTC [16953] LOG:  restored log file "00000001000005B80000009C" from archive
+    2019-10-10 01:31:06.458 UTC [16953] LOG:  selected new timeline ID: 2
+    2019-10-10 01:31:17.485 UTC [16953] LOG:  archive recovery complete
+    2019-10-10 01:32:11.975 UTC [16953] LOG:  MultiXact member wraparound protections are now enabled
+    2019-10-10 01:32:12.438 UTC [16950] LOG:  database system is ready to accept connections
+    2019-10-10 01:32:12.439 UTC [26501] LOG:  autovacuum launcher started
+
+The key entry is `archive recovery complete` here.
+
+It *should* show this in the logs. If it is not, it might just be
+still recovering a WAL file, or it might be `pause`d.
+    
+You can confirm what the server is doing by looking at the processes,
+for example, this is still recovering a WAL file:
+
+    root@meronense-backup-01:~# systemctl status postgresql@13-main.service
+    ● postgresql@13-main.service - PostgreSQL Cluster 13-main
+         Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
+         Active: active (running) since Thu 2022-10-27 15:06:40 UTC; 1min 0s ago
+        Process: 67835 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-main start (code=exited, status=0/SUCCESS)
+       Main PID: 67840 (postgres)
+          Tasks: 5 (limit: 9510)
+         Memory: 50.0M
+            CPU: 626ms
+         CGroup: /system.slice/system-postgresql.slice/postgresql@13-main.service
+                 ├─67840 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
+                 ├─67842 postgres: 13/main: startup recovering 0000000100000600000000F5
+                 ├─67851 postgres: 13/main: checkpointer
+                 ├─67853 postgres: 13/main: background writer
+                 └─67855 postgres: 13/main: stats collector
+
+... because there's a process doing:
+
+    67842 postgres: 13/main: startup recovering 0000000100000600000000F5
+
+In that case, it was stuck in "pause" mode, as the logs indicated:
+
+    2022-10-27 15:08:54.882 UTC [67933] LOG:  starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
+    2022-10-27 15:08:54.882 UTC [67933] LOG:  listening on IPv6 address "::1", port 5432
+    2022-10-27 15:08:54.882 UTC [67933] LOG:  listening on IPv4 address "127.0.0.1", port 5432
+    2022-10-27 15:08:54.998 UTC [67933] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
+    2022-10-27 15:08:55.236 UTC [67939] LOG:  database system was shut down in recovery at 2022-10-27 15:08:54 UTC
+    2022-10-27 15:08:55.911 UTC [67939] LOG:  starting point-in-time recovery to 2022-10-01 00:00:00+00
+    2022-10-27 15:08:56.764 UTC [67939] LOG:  restored log file "0000000100000600000000F4" from archive
+    2022-10-27 15:08:57.316 UTC [67939] LOG:  redo starts at 600/F4000028
+    2022-10-27 15:08:58.497 UTC [67939] LOG:  restored log file "0000000100000600000000F5" from archive
+    2022-10-27 15:08:59.119 UTC [67939] LOG:  consistent recovery state reached at 600/F50051F0
+    2022-10-27 15:08:59.119 UTC [67933] LOG:  database system is ready to accept read only connections
+    2022-10-27 15:08:59.120 UTC [67939] LOG:  recovery stopping before commit of transaction 12884886, time 2022-10-01 08:40:35.735422+00
+    2022-10-27 15:08:59.120 UTC [67939] LOG:  pausing at the end of recovery
+    2022-10-27 15:08:59.120 UTC [67939] HINT:  Execute pg_wal_replay_resume() to promote.
+
+The `pg_wal_replay_resume()` is not actually the right statement to
+use here, however. That would put the server back into recovery mode,
+where it would start fetching WAL files again. It's useful for
+replicated setups, but this is not such a case.
+
+In the above scenario, a `recovery_target_time` was added but without
+a `recovery_target_action`, which led the server to be paused instead
+of resuming normal operation.
+
+The correct way to recover here is to issue a `pg_promote` statement:
+
+    sudo -u postgres psql -c 'SELECT pg_promote();'

 # Reference