... | ... | @@ -1002,25 +1002,9 @@ information on how to pick a restore procedure. |
|
|
|
|
|
ssh torbackup@bungei.torproject.org bacula-director-01 retrieve-file pg dictyotum bacula.BASE.bungei.torproject.org-20191010-184205-dictyotum.torproject.org-bacula-9.6-backup.tar.gz | pv -s 33G | sed '0,/^$/d' | tar -C /var/lib/postgresql/9.6/main -x -z -f -
|
|
|
|
|
|
If the above doesn't work, use the indirect procedure to restore
|
|
|
the `BASE` file.
|
|
|
|
|
|
[#31786]: https://bugs.torproject.org/31786
|
|
|
|
|
|
If the above fails with:
|
|
|
|
|
|
ssh: connect to host bungei.torproject.org port 22: Connection refused
|
|
|
|
|
|
It's likely because the host you are restoring to is not
|
|
|
configured to backup to this backup host, and therefore the backup
|
|
|
host does not have firewall rules to allow it to connect. You can
|
|
|
fix this with something like:
|
|
|
|
|
|
iptables-legacy -I INPUT -s 78.47.61.104 -j ACCEPT
|
|
|
|
|
|
or:
|
|
|
|
|
|
nft insert rule ip filter INPUT ip saddr 78.47.61.104 counter accept
|
|
|
|
|
|
6. Make sure the `pg_xlog` directory doesn't contain any files.
|
|
|
|
... | ... | @@ -1076,10 +1060,6 @@ information on how to pick a restore procedure. |
|
|
|
|
|
... and so on.
|
|
|
|
|
|
TODO: this warning comes up from time to time, problem?
|
|
|
|
|
|
2019-10-09 23:47:13.446 UTC [16973] LOG: could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
|
|
|
|
|
|
Then the recovery will complete with something like this:
|
|
|
|
|
|
2019-10-10 01:30:55.460 UTC [16953] LOG: redo done at 5B8/9C5BE738
|
... | ... | @@ -1103,6 +1083,34 @@ information on how to pick a restore procedure. |
|
|
|
|
|
#### Troubleshooting
|
|
|
|
|
|
See also the "Direct restore procedure" troubleshooting section, which
|
|
|
also applies here.
|
|
|
|
|
|
##### Base file copy
|
|
|
|
|
|
If the BASE file copy fails with:
|
|
|
|
|
|
ssh: connect to host bungei.torproject.org port 22: Connection refused
|
|
|
|
|
|
It's likely because the host you are restoring to is not configured to
|
|
|
backup to this backup host, and therefore the backup host does not
|
|
|
have firewall rules to allow it to connect. You can fix this with
|
|
|
something like:
|
|
|
|
|
|
iptables-legacy -I INPUT -s 78.47.61.104 -j ACCEPT
|
|
|
|
|
|
or:
|
|
|
|
|
|
nft insert rule ip filter INPUT ip saddr 78.47.61.104 counter accept
|
|
|
|
|
|
... but really, you should put the host in Puppet, with the
|
|
|
`profile::postgres` class otherwise you will have more trouble later.
|
|
|
|
|
|
If all fails, you can also use the indirect restore procedure to copy
|
|
|
the BASE file manually on the server.
|
|
|
|
|
|
##### Unknown format line
|
|
|
|
|
|
If you get this error in the PostgreSQL restore logs:
|
|
|
|
|
|
No directories from which read is allowed given on cmdline.
|
... | ... | @@ -1111,10 +1119,14 @@ If you get this error in the PostgreSQL restore logs: |
|
|
It's because the SSH keys deployed on the server does not have the
|
|
|
`--read-allowed` argument.
|
|
|
|
|
|
##### Manually copying WAL files
|
|
|
|
|
|
If the restore doesn't work, try to connect by hand:
|
|
|
|
|
|
sudo -u postgres /usr/local/bin/pg-receive-file-from-backup dictyotum bacula.WAL.00000001000005AB0000003A /tmp/foo
|
|
|
|
|
|
##### WAL was generated with wal_level=minimum, data may be missing
|
|
|
|
|
|
If you get this kind of errors, it's because you forgot to restore the
|
|
|
`BASE` file first:
|
|
|
|
... | ... | @@ -1135,8 +1147,83 @@ If you get this kind of errors, it's because you forgot to restore the |
|
|
2019-10-08 21:25:46.467 UTC [14315] LOG: database system is ready to accept connections
|
|
|
2019-10-08 21:25:46.467 UTC [14351] LOG: autovacuum launcher started
|
|
|
|
|
|
See also the "Direct restore procedure" troubleshooting section, which
|
|
|
also applies here.
|
|
|
##### could not unlink file
|
|
|
|
|
|
TODO: this warning comes up from time to time, problem?
|
|
|
|
|
|
2019-10-09 23:47:13.446 UTC [16973] LOG: could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
|
|
|
|
|
|
##### missing "archive recovery complete" message
|
|
|
|
|
|
A block like this should show up in the
|
|
|
`/var/log/postgresql/postgresql-13-main.log` file:
|
|
|
|
|
|
2019-10-10 01:30:55.460 UTC [16953] LOG: redo done at 5B8/9C5BE738
|
|
|
2019-10-10 01:30:55.460 UTC [16953] LOG: last completed transaction was at log time 2019-10-10 01:04:23.238233+00
|
|
|
2019-10-10 01:31:03.536 UTC [16953] LOG: restored log file "00000001000005B80000009C" from archive
|
|
|
2019-10-10 01:31:06.458 UTC [16953] LOG: selected new timeline ID: 2
|
|
|
2019-10-10 01:31:17.485 UTC [16953] LOG: archive recovery complete
|
|
|
2019-10-10 01:32:11.975 UTC [16953] LOG: MultiXact member wraparound protections are now enabled
|
|
|
2019-10-10 01:32:12.438 UTC [16950] LOG: database system is ready to accept connections
|
|
|
2019-10-10 01:32:12.439 UTC [26501] LOG: autovacuum launcher started
|
|
|
|
|
|
The key entry is `archive recovery complete` here.
|
|
|
|
|
|
It *should* show this in the logs. If it is not, it might just be
|
|
|
still recovering a WAL file, or it might be `pause`d.
|
|
|
|
|
|
You can confirm what the server is doing by looking at the processes,
|
|
|
for example, this is still recovering a WAL file:
|
|
|
|
|
|
root@meronense-backup-01:~# systemctl status postgresql@13-main.service
|
|
|
● postgresql@13-main.service - PostgreSQL Cluster 13-main
|
|
|
Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
|
|
|
Active: active (running) since Thu 2022-10-27 15:06:40 UTC; 1min 0s ago
|
|
|
Process: 67835 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-main start (code=exited, status=0/SUCCESS)
|
|
|
Main PID: 67840 (postgres)
|
|
|
Tasks: 5 (limit: 9510)
|
|
|
Memory: 50.0M
|
|
|
CPU: 626ms
|
|
|
CGroup: /system.slice/system-postgresql.slice/postgresql@13-main.service
|
|
|
├─67840 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
|
|
|
├─67842 postgres: 13/main: startup recovering 0000000100000600000000F5
|
|
|
├─67851 postgres: 13/main: checkpointer
|
|
|
├─67853 postgres: 13/main: background writer
|
|
|
└─67855 postgres: 13/main: stats collector
|
|
|
|
|
|
... because there's a process doing:
|
|
|
|
|
|
67842 postgres: 13/main: startup recovering 0000000100000600000000F5
|
|
|
|
|
|
In that case, it was stuck in "pause" mode, as the logs indicated:
|
|
|
|
|
|
2022-10-27 15:08:54.882 UTC [67933] LOG: starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
|
|
|
2022-10-27 15:08:54.882 UTC [67933] LOG: listening on IPv6 address "::1", port 5432
|
|
|
2022-10-27 15:08:54.882 UTC [67933] LOG: listening on IPv4 address "127.0.0.1", port 5432
|
|
|
2022-10-27 15:08:54.998 UTC [67933] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
|
|
|
2022-10-27 15:08:55.236 UTC [67939] LOG: database system was shut down in recovery at 2022-10-27 15:08:54 UTC
|
|
|
2022-10-27 15:08:55.911 UTC [67939] LOG: starting point-in-time recovery to 2022-10-01 00:00:00+00
|
|
|
2022-10-27 15:08:56.764 UTC [67939] LOG: restored log file "0000000100000600000000F4" from archive
|
|
|
2022-10-27 15:08:57.316 UTC [67939] LOG: redo starts at 600/F4000028
|
|
|
2022-10-27 15:08:58.497 UTC [67939] LOG: restored log file "0000000100000600000000F5" from archive
|
|
|
2022-10-27 15:08:59.119 UTC [67939] LOG: consistent recovery state reached at 600/F50051F0
|
|
|
2022-10-27 15:08:59.119 UTC [67933] LOG: database system is ready to accept read only connections
|
|
|
2022-10-27 15:08:59.120 UTC [67939] LOG: recovery stopping before commit of transaction 12884886, time 2022-10-01 08:40:35.735422+00
|
|
|
2022-10-27 15:08:59.120 UTC [67939] LOG: pausing at the end of recovery
|
|
|
2022-10-27 15:08:59.120 UTC [67939] HINT: Execute pg_wal_replay_resume() to promote.
|
|
|
|
|
|
The `pg_wal_replay_resume()` is not actually the right statement to
|
|
|
use here, however. That would put the server back into recovery mode,
|
|
|
where it would start fetching WAL files again. It's useful for
|
|
|
replicated setups, but this is not such a case.
|
|
|
|
|
|
In the above scenario, a `recovery_target_time` was added but without
|
|
|
a `recovery_target_action`, which led the server to be paused instead
|
|
|
of resuming normal operation.
|
|
|
|
|
|
The correct way to recover here is to issue a `pg_promote` statement:
|
|
|
|
|
|
sudo -u postgres psql -c 'SELECT pg_promote();'
|
|
|
|
|
|
# Reference
|
|
|
|
... | ... | |