review and itemize the direct restore procedure, which now seems to work

a4747f47 · anarcat · 58cadd0c · a4747f47
Verified Commit a4747f47 authored 5 years ago by anarcat
--- a/tsa/howto/postgresql.mdwn
+++ b/tsa/howto/postgresql.mdwn
@@ -184,116 +184,119 @@ harmless.
 Direct restore procedure
 ------------------------

-TODO: this procedure does not work.
-
 The above procedure assumes a bare-bones recovery, on a new server,
-but it's also possible to sync an existing server from backups. The
-following, therefore, assume postgres is already configured, with
-something like:
+but it's also possible to sync an existing server from backups. This
+is also an adaptation of the [official recovery
+procedure](https://www.postgresql.org/docs/9.3/continuous-archiving.html#BACKUP-PITR-RECOVERY).

-    apt install postgres-11
+ 1. First install the right PostgreSQL version:

-Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
-backup! You cannot restore across versions. This might mean installing
-from backports or an older version of Debian.
+        apt install postgres-9.6
+
+    Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
+    backup! You cannot restore across versions. This might mean
+    installing from backports or an older version of Debian.

-On the postgres server:
+ 2. On that new PostgreSQL server, show the `postgres` server public
+    key, creating it if missing:

+    [ -f ~postgres/.ssh/id_rsa.pub ] || sudo -u postgres ssh-keygen
    cat ~postgres/.ssh/*.pub

-Then on the backup server:
+ 3. Then on the backup server, allow the user to access backups of the
+    old server:

    echo "command="/usr/local/bin/debbackup-ssh-wrap --read-allow=/srv/backups/pg/$OLDSERVER $CLIENT",restrict $HOSTKEY" > /etc/ssh/userkeys/torbackup.more

-This assumes we connect to a *previous* server's backups, named
-`$OLDSERVER` (e.g. `dictyotum`). The `$HOSTKEY` is the public key
-found on the postgres server above.
-
-Warning: the above will fail if the key is already present in
-`/etc/ssh/userkeys/torbackup`, edit the key in there instead in that
-case.
-
-Then you need to find the right `BASE` file to restore from. Each
-`BASE` file has a timestamp in its filename, so just sorting them by
-name should be enough to find the latest one. Uncompress the `BASE`
-file in place, as the `postgres` user:
-
-    sudo -u postgres -i
-    sudo -u postgres ssh torbackup@$BACKUPSERVER $(hostname) retrieve-file pg $OLDSERVER bacula.BASE.$BACKUPSERVER-20191004-062226-$OLDSERVER.torproject.org-$CLUSTERNAME-9.6-backup.tar.gz | tar -C /var/lib/postgresql/9.6/main -x -z -f -
-
-Add a `pv` before the `tar` call in the pipeline for a progress bar
-with large backups, and replace:
-
- 1. `$BACKUPSERVER` with the backupserver name and username (currently
-    `bungei.torproject.org`)
- 2. `$OLDSERVER` with the old server's (short) hostname
-    (e.g. `dictyotum`)
- 3. `$CLUSTERNAME` with the name of the cluster to restore
-    (e.g. usually `main`)
-
-TODO: The above might hang for a while, but it should complete. It
-`retrieve-file` sends a header which includes a `sha512sum` which
-takes a while to compute. If it doesn't work, use the indirect
-procedure to restore the BASE, which there is hopefully space for
-without the logs...
-
-Make sure the `pg_xlog` directory doesn't contain any files.
-
-Then you need to create a `recovery.conf` file in
-`/var/lib/postgresql/9.6/main` that will tell postgres where to find
-the WAL files. At least the `restore_command` need to be
-specified. Something like this should work:
-
-    restore_command = '/usr/local/bin/pg-receive-file-from-backup $OLDSERVER $CLUSTERNAME.WAL.%f %p'
-
-... where:
-
- * `$OLDSERVER` should be replaced by the previous postgresql server
-   name (e.g. `dictyotum`)
- * `$CLUSTERNAME` should be replaced by the previous cluster name
-   (e.g. `main`, generally)
-
-You can specify a specific recovery point in the `recovery.conf`, see
-the [upstream documentation](https://www.postgresql.org/docs/9.3/recovery-target-settings.html) for more information. Make sure the
-file is owned by postgres:
-
-    $EDITOR /var/lib/postgresql/9.6/main/recovery.conf
-    chown postgres /var/lib/postgresql/9.6/main/recovery.conf
-
-Then start the server and look at the logs to follow the recovery
-process:
-
-    service postgresql start
-    tail -f /var/log/postgresql/*
-
-You should see something like this:
-
-    2019-10-09 21:17:47.335 UTC [9632] LOG:  database system was interrupted; last known up at 2019-10-04 08:12:28 UTC
-    2019-10-09 21:17:47.517 UTC [9632] LOG:  starting archive recovery
-    2019-10-09 21:17:47.524 UTC [9633] [unknown]@[unknown] LOG:  incomplete startup packet
-    2019-10-09 21:17:48.032 UTC [9639] postgres@postgres FATAL:  the database system is starting up
-    2019-10-09 21:17:48.538 UTC [9642] postgres@postgres FATAL:  the database system is starting up
-    2019-10-09 21:17:49.046 UTC [9645] postgres@postgres FATAL:  the database system is starting up
-    2019-10-09 21:17:49.354 UTC [9632] LOG:  restored log file "00000001000005B200000074" from archive
-    2019-10-09 21:17:49.552 UTC [9648] postgres@postgres FATAL:  the database system is starting up
-    2019-10-09 21:17:50.058 UTC [9651] postgres@postgres FATAL:  the database system is starting up
-    2019-10-09 21:17:50.565 UTC [9654] postgres@postgres FATAL:  the database system is starting up
-    2019-10-09 21:17:50.836 UTC [9632] LOG:  redo starts at 5B2/74000028
-    2019-10-09 21:17:51.071 UTC [9659] postgres@postgres FATAL:  the database system is starting up
-    2019-10-09 21:17:51.577 UTC [9665] postgres@postgres FATAL:  the database system is starting up
-    2019-10-09 21:20:35.790 UTC [9632] LOG:  restored log file "00000001000005B20000009F" from archive
-    2019-10-09 21:20:37.745 UTC [9632] LOG:  restored log file "00000001000005B2000000A0" from archive
-    2019-10-09 21:20:39.648 UTC [9632] LOG:  restored log file "00000001000005B2000000A1" from archive
-    2019-10-09 21:20:41.738 UTC [9632] LOG:  restored log file "00000001000005B2000000A2" from archive
-    2019-10-09 21:20:43.773 UTC [9632] LOG:  restored log file "00000001000005B2000000A3" from archive
-
-... and so on.
-
-Then remove the temporary SSH access on the backup server, either by
-removing the `.more` key file or restoring the previous key
-configuration:
-
-    rm /etc/ssh/userkeys/torbackup.more
+    This assumes we connect to a *previous* server's backups, named
+    `$OLDSERVER` (e.g. `dictyotum`). The `$HOSTKEY` is the public key
+    found on the postgres server above.
+
+    Warning: the above will fail if the key is already present in
+    `/etc/ssh/userkeys/torbackup`, edit the key in there instead in
+    that case.
+
+ 4. Then you need to find the right `BASE` file to restore from. Each
+    `BASE` file has a timestamp in its filename, so just sorting them
+    by name should be enough to find the latest one. Uncompress the
+    `BASE` file in place, as the `postgres` user:
+
+        sudo -u postgres ssh torbackup@$BACKUPSERVER $(hostname) retrieve-file pg $OLDSERVER bacula.BASE.$BACKUPSERVER-20191004-062226-$OLDSERVER.torproject.org-$CLUSTERNAME-9.6-backup.tar.gz | sudo -u postgres tar -C /var/lib/postgresql/9.6/main -x -z -f -
+
+    Add a `pv` before the `tar` call in the pipeline for a progress bar
+    with large backups, and replace:
+
+     * `$BACKUPSERVER` with the backupserver name and username
+        (currently `bungei.torproject.org`)
+     * `$OLDSERVER` with the old server's (short) hostname
+       (e.g. `dictyotum`)
+     * `$CLUSTERNAME` with the name of the cluster to restore
+       (e.g. usually `main`)
+
+    The above might hang for a while, but it should complete. The
+    "hang" is because `retrieve-file` sends a header which includes a
+    `sha512sum` and it takes a while to compute. If it doesn't work,
+    use the indirect procedure to restore the `BASE` file.
+
+ 5. Make sure the `pg_xlog` directory doesn't contain any files.
+ 
+        rm -f -- /var/lib/postgresql/9.6/main/pg_xlog/*
+
+ 6. Then you need to create a `recovery.conf` file in
+    `/var/lib/postgresql/9.6/main` that will tell postgres where to
+    find the WAL files. At least the `restore_command` need to be
+    specified. Something like this should work:
+
+        restore_command = '/usr/local/bin/pg-receive-file-from-backup $OLDSERVER $CLUSTERNAME.WAL.%f %p'
+
+    ... where:
+
+     * `$OLDSERVER` should be replaced by the previous postgresql
+       server name (e.g. `dictyotum`)
+     * `$CLUSTERNAME` should be replaced by the previous cluster name
+       (e.g. `main`, generally)
+
+    You can specify a specific recovery point in the `recovery.conf`,
+    see the [upstream documentation](https://www.postgresql.org/docs/9.3/recovery-target-settings.html) for more information. Also
+    make sure the file is owned by postgres:
+
+        $EDITOR /var/lib/postgresql/9.6/main/recovery.conf
+        chown postgres /var/lib/postgresql/9.6/main/recovery.conf
+
+ 7. Then start the server and look at the logs to follow the recovery
+    process:
+
+        service postgresql start
+        tail -f /var/log/postgresql/*
+
+    You should see something like this:
+
+        2019-10-09 21:17:47.335 UTC [9632] LOG:  database system was interrupted; last known up at 2019-10-04 08:12:28 UTC
+        2019-10-09 21:17:47.517 UTC [9632] LOG:  starting archive recovery
+        2019-10-09 21:17:47.524 UTC [9633] [unknown]@[unknown] LOG:  incomplete startup packet
+        2019-10-09 21:17:48.032 UTC [9639] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:48.538 UTC [9642] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:49.046 UTC [9645] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:49.354 UTC [9632] LOG:  restored log file "00000001000005B200000074" from archive
+        2019-10-09 21:17:49.552 UTC [9648] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:50.058 UTC [9651] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:50.565 UTC [9654] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:50.836 UTC [9632] LOG:  redo starts at 5B2/74000028
+        2019-10-09 21:17:51.071 UTC [9659] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:51.577 UTC [9665] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:20:35.790 UTC [9632] LOG:  restored log file "00000001000005B20000009F" from archive
+        2019-10-09 21:20:37.745 UTC [9632] LOG:  restored log file "00000001000005B2000000A0" from archive
+        2019-10-09 21:20:39.648 UTC [9632] LOG:  restored log file "00000001000005B2000000A1" from archive
+        2019-10-09 21:20:41.738 UTC [9632] LOG:  restored log file "00000001000005B2000000A2" from archive
+        2019-10-09 21:20:43.773 UTC [9632] LOG:  restored log file "00000001000005B2000000A3" from archive
+
+    ... and so on.
+
+ 8. Then remove the temporary SSH access on the backup server, either
+    by removing the `.more` key file or restoring the previous key
+    configuration:
+
+        rm /etc/ssh/userkeys/torbackup.more

 ### Troubleshooting