... | ... | @@ -645,293 +645,21 @@ significantly, going from hours (if not days) to minutes. |
|
|
Note that [pg_squeeze](https://github.com/cybertec-postgresql/pg_squeeze) is another alternative to `pg_repack`, but
|
|
|
it isn't available in Debian.
|
|
|
|
|
|
## Pager playbook
|
|
|
|
|
|
### BASE-IS-OLD
|
|
|
|
|
|
Our nagios check warns us when a backup server has not successfully
|
|
|
fetched a base backup recently. The causes often are that either the
|
|
|
postgres server or the backup host went down or was down during the
|
|
|
time of the weekly cronjob. The TLS certificate could have also been
|
|
|
renewed without the postgres daemon having been restarted.
|
|
|
|
|
|
To re-run a base backup for a specific cluster, log into the backup
|
|
|
server(s) (currently `bungei`), then find the port for the affected
|
|
|
cluster:
|
|
|
|
|
|
tail /usr/local/bin/postgres-make-base-backups
|
|
|
|
|
|
Then run this backup in a screen:
|
|
|
|
|
|
sudo -u torbackup /usr/local/bin/postgres-make-base-backups <host>:<port>
|
|
|
|
|
|
### MISSING-BASE
|
|
|
|
|
|
WARNING: this procedure was taken from the [DSA wiki](https://dsa.debian.org/howto/postgres-backup/) and is
|
|
|
untested.
|
|
|
|
|
|
If you get a Nagios warning like this:
|
|
|
|
|
|
[fasolo, dak] MISSING-BASE: dak.BASE.backuphost.debian.org-20180211-012002-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
|
|
|
This means that we started doing a base backup (as witnessed by a
|
|
|
.backup file next to a WAL), but for some reason we don't have the
|
|
|
corresponding base file.
|
|
|
|
|
|
root@backuphost:/srv/backups/pg/fasolo# ls -l *backup*
|
|
|
-rw------- 1 debbackup debbackup 9201093916 Jan 14 06:18 dak.BASE.backuphost.debian.org-20180114-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 9227651542 Jan 21 06:25 dak.BASE.backuphost.debian.org-20180121-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 9266306750 Jan 28 07:59 dak.BASE.backuphost.debian.org-20180128-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 9312602089 Feb 5 11:00 dak.BASE.backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 9346830509 Feb 12 10:25 dak.BASE.backuphost.debian.org-20180212-094930-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 353 Jan 14 06:18 dak.WAL.0000000100000033000000A6.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 350 Jan 20 11:20 dak.WAL.00000001000000350000008C.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Jan 21 06:25 dak.WAL.000000010000003600000068.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Jan 28 07:59 dak.WAL.0000000100000038000000E3.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Feb 5 11:00 dak.WAL.000000010000003B00000090.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 350 Feb 5 15:49 dak.WAL.000000010000003B0000009B.00000108.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Feb 11 10:09 dak.WAL.000000010000003D000000AC.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Feb 12 10:25 dak.WAL.000000010000003E00000027.00000178.backup
|
|
|
|
|
|
`.backup` files are created on the postgres server and shipped to the
|
|
|
backup hosts whenever a base backup is initiated. We do some
|
|
|
labelling, so we know which backup host the corresponding tarball
|
|
|
should end up with. For example:
|
|
|
|
|
|
root@backuphost:/srv/backups/pg/fasolo# cat dak.WAL.000000010000003B00000090.00000028.backup
|
|
|
START WAL LOCATION: 3B/90000028 (file 000000010000003B00000090)
|
|
|
STOP WAL LOCATION: 3B/97CF2138 (file 000000010000003B00000097)
|
|
|
CHECKPOINT LOCATION: 3B/90000098
|
|
|
BACKUP METHOD: streamed
|
|
|
BACKUP FROM: master
|
|
|
START TIME: 2018-02-05 10:25:28 UTC
|
|
|
LABEL: backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup
|
|
|
STOP TIME: 2018-02-05 10:59:50 UTC
|
|
|
|
|
|
To fix this, verify we have a later base tarball, or that we are fine
|
|
|
for some other reason, and remove the corresponding `.backup` file from
|
|
|
the backup host. In the case above, we would remove
|
|
|
`dak.WAL.000000010000003D000000AC.00000028.backup`.
|
|
|
|
|
|
### WAL-MISSING-AFTER
|
|
|
|
|
|
Example message:
|
|
|
|
|
|
[troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD
|
|
|
|
|
|
This means that a WAL file is missing after the specified
|
|
|
file. Specifically, in the above scenario, the following files are
|
|
|
present, in chronological order:
|
|
|
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 13:57 main.WAL.00000001000000D9000000B5
|
|
|
|
|
|
Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
|
|
|
`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
|
|
|
happened between 6:26 and 13:57 on that day.
|
|
|
|
|
|
1. List the files in chronological order:
|
|
|
|
|
|
ls -ltr /srv/backups/pg/troodi/ | less
|
|
|
|
|
|
2. Find the file warned about, using `/` then the filename
|
|
|
(`main.WAL.00000001000000D9000000AD`), above
|
|
|
|
|
|
3. Look for a `.BASE.` file *following* the missing file, using `/`
|
|
|
again
|
|
|
|
|
|
4. Either:
|
|
|
|
|
|
* if a `.BASE.` backup is present after the missing files, it is
|
|
|
harmless to the extent that the missing timeframe is not
|
|
|
necessary. to remove the warnings, previous WAL files *do* need
|
|
|
to be removed, by hand.
|
|
|
|
|
|
* if a `.BASE.` backup is *not* present after the missing files,
|
|
|
the backup integrity is faulty, and a new base backup needs to
|
|
|
be performed. See [Running a full
|
|
|
backup](#running-a-full-backup) above.
|
|
|
|
|
|
Here is an example of the second case above, and recovery:
|
|
|
|
|
|
root@bungei:~# cd /srv/backups/pg/meronense
|
|
|
root@bungei:/srv/backups/pg/meronense# /usr/lib/nagios/plugins/dsa-check-backuppg -e -v
|
|
|
[meronense, main] WAL-MISSING-AFTER: meronense/main.WAL.00000001000005A900000073
|
|
|
[meronense, main] NO-BASE: no base backup found?
|
|
|
[meronense, main] NO-BACKUP: no backups! (no .backup files found)
|
|
|
NOT-CONFIGURED: meronense-11
|
|
|
IGNORED: .nobackup
|
|
|
IGNORED: lost+found
|
|
|
IGNORED: BASE-meronense.torproject.org:5432-G0uKzfYhtH.tar.gz
|
|
|
root@bungei:/srv/backups/pg/meronense# ls -altr
|
|
|
total 98312
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:49 main.WAL.00000001000005A90000006F
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:49 main.WAL.00000001000005A900000070
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:58 main.WAL.00000001000005A900000071
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:59 main.WAL.00000001000005A900000072
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:59 main.WAL.00000001000005A900000073
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 14:14 main.WAL.00000001000005A90000007F
|
|
|
drwxr-xr-x 10 torbackup torbackup 4096 Jun 23 14:14 ..
|
|
|
drwxr-xr-x 2 torbackup torbackup 4096 Jun 23 14:14 .
|
|
|
root@bungei:/srv/backups/pg/meronense# rm main.WAL.00000001000005A90000006F main.WAL.00000001000005A900000070 main.WAL.00000001000005A900000071 main.WAL.00000001000005A900000072 main.WAL.00000001000005A900000073
|
|
|
root@bungei:/srv/backups/pg/meronense# /usr/lib/nagios/plugins/dsa-check-backuppg -e -v
|
|
|
[meronense, main] NO-BASE: no base backup found?
|
|
|
[meronense, main] NO-BACKUP: no backups! (no .backup files found)
|
|
|
NOT-CONFIGURED: meronense-11
|
|
|
IGNORED: .nobackup
|
|
|
IGNORED: lost+found
|
|
|
IGNORED: BASE-meronense.torproject.org:5432-G0uKzfYhtH.tar.gz
|
|
|
|
|
|
There was a gap between `main.WAL.00000001000005A900000073` and
|
|
|
`main.WAL.00000001000005A90000007F` so everything up to and
|
|
|
including the former were removed by hand. Then a full backup was
|
|
|
performed. The reason why the BASE backup was missing is this was
|
|
|
following a failed upgrade (see [tpo/tpa/team#40809](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40809)).
|
|
|
|
|
|
## Disaster recovery
|
|
|
|
|
|
If all fails, we need to restore PostgreSQL from backups. We have two
|
|
|
ways of restoring PostgreSQL backups, **indirect** and **direct**.
|
|
|
|
|
|
The **indirect** procedures first transfers the backup files to the
|
|
|
server and then runs the restore, so it require more space than the
|
|
|
direct procedure. It might also be slower than the direct restore
|
|
|
procedure because files are copied around twice: once from the backup
|
|
|
server, and again loaded in the database.
|
|
|
|
|
|
The **direct** procedure *streams* the backup files directly from the
|
|
|
backup server. The `BASE` file is streamed into the `tar` command for
|
|
|
restore of the base snapshot, then the PostgreSQL restore command
|
|
|
copies each log directly from the backup server as well. It requires
|
|
|
minimal amount of space, as files are not copied to a temporary
|
|
|
location on the new server. The downside is it might be more difficult
|
|
|
to deploy and diagnose, as it has more moving parts.
|
|
|
|
|
|
Both procedures are adaptations of the [official recovery
|
|
|
procedure](https://www.postgresql.org/docs/11/continuous-archiving.html#BACKUP-PITR-RECOVERY), which can be refered to for more information.
|
|
|
|
|
|
### Indirect restore procedures
|
|
|
|
|
|
Reminder: this procedure **copies** the backup files from the backup
|
|
|
server to the database server and restores from those.
|
|
|
|
|
|
1. First, you will need to give the backup server access to the new
|
|
|
postgresql server, which we'll call the "client" for now. First,
|
|
|
login to the client and allow the backup server to connect, and
|
|
|
show the public hostkey:
|
|
|
|
|
|
iptables -I INPUT -s $BACKUP_SERVER -j ACCEPT
|
|
|
cat /etc/ssh/ssh_host_rsa_key.pub
|
|
|
|
|
|
2. Then load the server's private key in an agent and show it to
|
|
|
allow on the client. On the server, assuming `$IP` is the IP of
|
|
|
the client and `$HOSTKEY` is its hostkey:
|
|
|
|
|
|
ssh-agent bash
|
|
|
ssh-add /etc/ssh/ssh_host_rsa_key
|
|
|
mkdir -p ~/.ssh
|
|
|
echo "$IP $HOSTKEY" >> ~/.ssh/known_hosts
|
|
|
cat /etc/ssh/ssh_host_rsa_key.pub
|
|
|
|
|
|
3. And on the client, allow the server `$HOSTKEY` (the above `cat
|
|
|
/etc/ssh/ssh_host_rsa_key.pub` on the backup server):
|
|
|
|
|
|
echo "$HOSTKEY" >> /etc/ssh/userkeys/root.more
|
|
|
|
|
|
4. Then, we can transfer files over from the backup server to the
|
|
|
client:
|
|
|
|
|
|
cd /srv/backups/pg
|
|
|
rsync -aP $CLIENT $CLIENT:/var/lib/postgresql/restore
|
|
|
|
|
|
5. Disable Puppet so you have control on when PostgreSQL is running:
|
|
|
|
|
|
puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09'
|
|
|
|
|
|
6. Then, on the client, install the software, stop the server and
|
|
|
move the template cluster out of the way:
|
|
|
|
|
|
apt install postgres rsync
|
|
|
service postgresql stop
|
|
|
mv /var/lib/postgresql/*/main{,.orig}
|
|
|
su -c 'mkdir -m 0700 /var/lib/postgresql/9.6/main' postgres
|
|
|
|
|
|
We'll be restoring files in that `main` directory.
|
|
|
|
|
|
Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
|
|
|
backup! You cannot restore across versions. This might mean
|
|
|
installing from backports or an older version of Debian.
|
|
|
|
|
|
7. Then you need to find the right `BASE` file to restore from. Each
|
|
|
`BASE` file has a timestamp in its filename, so just sorting them
|
|
|
by name should be enough to find the latest one. Uncompress the
|
|
|
`BASE` file in place:
|
|
|
|
|
|
cat ~postgres/restore/$CLIENT/main.BASE.bungei.torproject.org-20190805-145239-$CLIENT.torproject.org-main-9.6-backup.tar.gz | su postgres -c 'tar -C /var/lib/postgresql/9.6/main -x -z -f -'
|
|
|
|
|
|
(Use `pv` instead of `cat` for a progress bar with large backups.)
|
|
|
|
|
|
8. Make sure the `pg_xlog` directory doesn't contain any files.
|
|
|
|
|
|
9. Then you need to create a `recovery.conf` file in
|
|
|
`/var/lib/postgresql/9.6/main` that will tell postgres where to
|
|
|
find the WAL files. At least the `restore_command` need to be
|
|
|
specified. Something like this should work:
|
|
|
|
|
|
restore_command = 'cp /var/lib/postgresql/restore/subnotabile/main.WAL.%f %p'
|
|
|
|
|
|
You can specify a specific recovery point in the `recovery.conf`,
|
|
|
see the [upstream documentation](https://www.postgresql.org/docs/11/recovery-target-settings.html) for more information.
|
|
|
|
|
|
10. Then start the server and look at the logs to follow the recovery
|
|
|
process:
|
|
|
|
|
|
service postgresql start
|
|
|
tail -f /var/log/postgresql/*
|
|
|
|
|
|
11. re-enable Puppet, which should clean up the extra SSH key and
|
|
|
firewall rules:
|
|
|
|
|
|
puppet agent -t
|
|
|
|
|
|
make sure it's okay in `/etc/ssh/userkeys/root` and `iptables -L`.
|
|
|
|
|
|
When the restore succeeds, the `recovery.conf` file will be renamed to
|
|
|
`recovery.done` and you will see something like:
|
|
|
|
|
|
2019-08-12 21:36:53.453 UTC [16901] LOG: selected new timeline ID: 2
|
|
|
2019-08-12 21:36:53.470 UTC [16901] LOG: archive recovery complete
|
|
|
cp: cannot stat '/var/lib/postgresql/restore/subnotablie/main.WAL.00000001.history': No such file or directory
|
|
|
2019-08-12 21:36:53.577 UTC [16901] LOG: MultiXact member wraparound protections are now enabled
|
|
|
2019-08-12 21:36:53.584 UTC [16900] LOG: database system is ready to accept connections
|
|
|
## Direct backup recovery
|
|
|
|
|
|
Ignore the error from `cp` complaining about the `.history` file, it's
|
|
|
harmless.
|
|
|
|
|
|
#### Troubleshooting
|
|
|
|
|
|
If you find the following error in the logs:
|
|
|
|
|
|
FATAL: could not locate required checkpoint record
|
|
|
|
|
|
It's because postgres cannot find the WAL logs to restore from. There
|
|
|
could be many causes for this, but the ones I stumbled upon were:
|
|
|
<a name="direct-restore-procedures" />
|
|
|
|
|
|
* wrong location for `recovery.conf` (it's in `/var`, not `/etc`),
|
|
|
probably the case if you don't see an error from `cp`
|
|
|
* wrong permissions on the archive (put the WAL files in `~postgres`,
|
|
|
not `~root`)
|
|
|
* wrong path or pattern for `restore_command` (double-check the path
|
|
|
and make sure to include the right prefix, e.g. `main.WAL`)
|
|
|
Reminder: this procedure **streams** the files from the backup server
|
|
|
to the database server and restore directly. See the [disaster
|
|
|
recovery](#disaster-recovery) section for a more in-depth discussion on the recovery
|
|
|
systems.
|
|
|
|
|
|
### Direct restore procedure
|
|
|
This is probably the procedure you want.
|
|
|
|
|
|
Reminder: this procedure **streams** the files from the backup server
|
|
|
to the database server and restore directly. See above for more
|
|
|
information on how to pick a restore procedure.
|
|
|
This procedure assumes you have already a server installed with enough
|
|
|
disk space to hold the data to recover. It also assumes the server is
|
|
|
configured with the `profile::postgres` class in Puppet, since it uses
|
|
|
a script from there to copy the WAL files from backup.
|
|
|
|
|
|
1. First, disable Puppet so you have control on when PostgreSQL is
|
|
|
running:
|
... | ... | @@ -943,14 +671,10 @@ information on how to pick a restore procedure. |
|
|
apt install postgres-13
|
|
|
service postgresql stop
|
|
|
|
|
|
Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
|
|
|
Make sure you run the **SAME MAJOR VERSION** of PostgreSQL than the
|
|
|
backup! You cannot restore across versions. This might mean
|
|
|
installing from backports or an older version of Debian.
|
|
|
|
|
|
This procedure also assumes the server is configured with the
|
|
|
`profile::postgres` class in Puppet, since it uses a script from
|
|
|
there to copy the WAL files from backup.
|
|
|
|
|
|
3. On that new PostgreSQL server, show the `postgres` server public
|
|
|
key, creating it if missing:
|
|
|
|
... | ... | @@ -967,23 +691,24 @@ information on how to pick a restore procedure. |
|
|
found on the postgres server above. `$CLIENT` is the hostname of
|
|
|
the server we are restoring to.
|
|
|
|
|
|
Warning: the above will fail if the key is already present in
|
|
|
`/etc/ssh/userkeys/torbackup`, edit the key in there instead in
|
|
|
that case.
|
|
|
NOTE: the above will not work if the key is already present in
|
|
|
`/etc/ssh/userkeys/torbackup`, as the key will override the one in
|
|
|
`.more`. Edit the key in there instead in that case.
|
|
|
|
|
|
5. Then you need to find the right `BASE` file to restore from. Each
|
|
|
`BASE` file has a timestamp in its filename, so just sorting them
|
|
|
by name should be enough to find the latest one. Uncompress the
|
|
|
`BASE` file in place, as the `postgres` user:
|
|
|
by name should be enough to find the latest one.
|
|
|
|
|
|
Decompress the `BASE` file in place, as the `postgres` user:
|
|
|
|
|
|
sudo -u postgres -i
|
|
|
mkdir -m 0700 /var/lib/postgresql/13/main
|
|
|
ssh torbackup@$BACKUPSERVER $(hostname) retrieve-file pg $OLDSERVER $BASEFILE \
|
|
|
| sed '0,/^$/d' \
|
|
|
| pv -s 34G
|
|
|
| tar -C /var/lib/postgresql/13/main -x -z -f -
|
|
|
|
|
|
Add a `pv` before the `tar` call in the pipeline for a progress bar
|
|
|
with large backups, and replace:
|
|
|
In the above, replace:
|
|
|
|
|
|
* `$BACKUPSERVER` with the backupserver name and username
|
|
|
(currently `bungei.torproject.org`)
|
... | ... | @@ -996,6 +721,7 @@ information on how to pick a restore procedure. |
|
|
or, fully expanded:
|
|
|
`bacula.BASE.bungei.torproject.org-20191010-184205-dictyotum.torproject.org-bacula-13-backup.tar.gz`
|
|
|
where the `$CLUSTERNAME` is the non-default `bacula`
|
|
|
* `34G` with the size of the BASE file found on the backup server
|
|
|
|
|
|
The above might hang for a while (yes, maybe even a minute) in the
|
|
|
beginning, but it should eventually get started and complete. The
|
... | ... | @@ -1100,161 +826,449 @@ information on how to pick a restore procedure. |
|
|
2019-10-10 01:32:12.438 UTC [16950] LOG: database system is ready to accept connections
|
|
|
2019-10-10 01:32:12.439 UTC [26501] LOG: autovacuum launcher started
|
|
|
|
|
|
The server is now ready for use.
|
|
|
The server is now ready for use.
|
|
|
|
|
|
9. Remove the temporary SSH access on the backup server, either by
|
|
|
removing the `.more` key file or restoring the previous key
|
|
|
configuration:
|
|
|
|
|
|
rm /etc/ssh/userkeys/torbackup.more
|
|
|
|
|
|
10. re-enable Puppet:
|
|
|
|
|
|
puppet agent -t
|
|
|
|
|
|
### Troubleshooting
|
|
|
|
|
|
See also the "Direct restore procedure" troubleshooting section, which
|
|
|
also applies here.
|
|
|
|
|
|
#### Base file copy
|
|
|
|
|
|
If the BASE file copy fails with:
|
|
|
|
|
|
ssh: connect to host bungei.torproject.org port 22: Connection refused
|
|
|
|
|
|
It's likely because the host you are restoring to is not configured to
|
|
|
backup to this backup host, and therefore the backup host does not
|
|
|
have firewall rules to allow it to connect. You can fix this with
|
|
|
something like:
|
|
|
|
|
|
iptables-legacy -I INPUT -s 78.47.61.104 -j ACCEPT
|
|
|
|
|
|
or:
|
|
|
|
|
|
nft insert rule ip filter INPUT ip saddr 78.47.61.104 counter accept
|
|
|
|
|
|
... but really, you should put the host in Puppet, with the
|
|
|
`profile::postgres` class otherwise you will have more trouble later.
|
|
|
|
|
|
If all fails, you can also use the indirect restore procedure to copy
|
|
|
the BASE file manually on the server.
|
|
|
|
|
|
#### Unknown format line
|
|
|
|
|
|
If you get this error in the PostgreSQL restore logs:
|
|
|
|
|
|
No directories from which read is allowed given on cmdline.
|
|
|
Unkown format in line 1 () when getting dictyotum:main.WAL.00000005.history from backup host
|
|
|
|
|
|
It's because the SSH keys deployed on the server does not have the
|
|
|
`--read-allowed` argument.
|
|
|
|
|
|
#### Manually copying WAL files
|
|
|
|
|
|
If the restore doesn't work, try to connect by hand:
|
|
|
|
|
|
sudo -u postgres /usr/local/bin/pg-receive-file-from-backup dictyotum bacula.WAL.00000001000005AB0000003A /tmp/foo
|
|
|
|
|
|
#### WAL was generated with wal_level=minimum, data may be missing
|
|
|
|
|
|
If you get this kind of errors, it's because you forgot to restore the
|
|
|
`BASE` file first:
|
|
|
|
|
|
2019-10-08 21:25:43.711 UTC [14320] [unknown]@[unknown] LOG: incomplete startup packet
|
|
|
2019-10-08 21:25:44.215 UTC [14326] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:44.451 UTC [14316] WARNING: WAL was generated with wal_level=minimal, data may be missing
|
|
|
2019-10-08 21:25:44.451 UTC [14316] HINT: This happens if you temporarily set wal_level=minimal without taking a new base backup.
|
|
|
2019-10-08 21:25:44.451 UTC [14316] LOG: consistent recovery state reached at 0/153A3F0
|
|
|
2019-10-08 21:25:44.451 UTC [14316] LOG: invalid record length at 0/153A3F0: wanted 24, got 0
|
|
|
2019-10-08 21:25:44.451 UTC [14316] LOG: redo is not required
|
|
|
2019-10-08 21:25:44.721 UTC [14334] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:45.226 UTC [14340] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:45.549 UTC [14316] LOG: selected new timeline ID: 6
|
|
|
2019-10-08 21:25:45.732 UTC [14343] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:45.765 UTC [14316] LOG: archive recovery complete
|
|
|
2019-10-08 21:25:46.238 UTC [14349] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:46.466 UTC [14316] LOG: MultiXact member wraparound protections are now enabled
|
|
|
2019-10-08 21:25:46.467 UTC [14315] LOG: database system is ready to accept connections
|
|
|
2019-10-08 21:25:46.467 UTC [14351] LOG: autovacuum launcher started
|
|
|
|
|
|
#### could not unlink file
|
|
|
|
|
|
TODO: this warning comes up from time to time, problem?
|
|
|
|
|
|
2019-10-09 23:47:13.446 UTC [16973] LOG: could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
|
|
|
|
|
|
#### missing "archive recovery complete" message
|
|
|
|
|
|
A block like this should show up in the
|
|
|
`/var/log/postgresql/postgresql-13-main.log` file:
|
|
|
|
|
|
2019-10-10 01:30:55.460 UTC [16953] LOG: redo done at 5B8/9C5BE738
|
|
|
2019-10-10 01:30:55.460 UTC [16953] LOG: last completed transaction was at log time 2019-10-10 01:04:23.238233+00
|
|
|
2019-10-10 01:31:03.536 UTC [16953] LOG: restored log file "00000001000005B80000009C" from archive
|
|
|
2019-10-10 01:31:06.458 UTC [16953] LOG: selected new timeline ID: 2
|
|
|
2019-10-10 01:31:17.485 UTC [16953] LOG: archive recovery complete
|
|
|
2019-10-10 01:32:11.975 UTC [16953] LOG: MultiXact member wraparound protections are now enabled
|
|
|
2019-10-10 01:32:12.438 UTC [16950] LOG: database system is ready to accept connections
|
|
|
2019-10-10 01:32:12.439 UTC [26501] LOG: autovacuum launcher started
|
|
|
|
|
|
The key entry is `archive recovery complete` here.
|
|
|
|
|
|
It *should* show this in the logs. If it is not, it might just be
|
|
|
still recovering a WAL file, or it might be `pause`d.
|
|
|
|
|
|
You can confirm what the server is doing by looking at the processes,
|
|
|
for example, this is still recovering a WAL file:
|
|
|
|
|
|
root@meronense-backup-01:~# systemctl status postgresql@13-main.service
|
|
|
● postgresql@13-main.service - PostgreSQL Cluster 13-main
|
|
|
Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
|
|
|
Active: active (running) since Thu 2022-10-27 15:06:40 UTC; 1min 0s ago
|
|
|
Process: 67835 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-main start (code=exited, status=0/SUCCESS)
|
|
|
Main PID: 67840 (postgres)
|
|
|
Tasks: 5 (limit: 9510)
|
|
|
Memory: 50.0M
|
|
|
CPU: 626ms
|
|
|
CGroup: /system.slice/system-postgresql.slice/postgresql@13-main.service
|
|
|
├─67840 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
|
|
|
├─67842 postgres: 13/main: startup recovering 0000000100000600000000F5
|
|
|
├─67851 postgres: 13/main: checkpointer
|
|
|
├─67853 postgres: 13/main: background writer
|
|
|
└─67855 postgres: 13/main: stats collector
|
|
|
|
|
|
... because there's a process doing:
|
|
|
|
|
|
67842 postgres: 13/main: startup recovering 0000000100000600000000F5
|
|
|
|
|
|
In that case, it was stuck in "pause" mode, as the logs indicated:
|
|
|
|
|
|
2022-10-27 15:08:54.882 UTC [67933] LOG: starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
|
|
|
2022-10-27 15:08:54.882 UTC [67933] LOG: listening on IPv6 address "::1", port 5432
|
|
|
2022-10-27 15:08:54.882 UTC [67933] LOG: listening on IPv4 address "127.0.0.1", port 5432
|
|
|
2022-10-27 15:08:54.998 UTC [67933] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
|
|
|
2022-10-27 15:08:55.236 UTC [67939] LOG: database system was shut down in recovery at 2022-10-27 15:08:54 UTC
|
|
|
2022-10-27 15:08:55.911 UTC [67939] LOG: starting point-in-time recovery to 2022-10-01 00:00:00+00
|
|
|
2022-10-27 15:08:56.764 UTC [67939] LOG: restored log file "0000000100000600000000F4" from archive
|
|
|
2022-10-27 15:08:57.316 UTC [67939] LOG: redo starts at 600/F4000028
|
|
|
2022-10-27 15:08:58.497 UTC [67939] LOG: restored log file "0000000100000600000000F5" from archive
|
|
|
2022-10-27 15:08:59.119 UTC [67939] LOG: consistent recovery state reached at 600/F50051F0
|
|
|
2022-10-27 15:08:59.119 UTC [67933] LOG: database system is ready to accept read only connections
|
|
|
2022-10-27 15:08:59.120 UTC [67939] LOG: recovery stopping before commit of transaction 12884886, time 2022-10-01 08:40:35.735422+00
|
|
|
2022-10-27 15:08:59.120 UTC [67939] LOG: pausing at the end of recovery
|
|
|
2022-10-27 15:08:59.120 UTC [67939] HINT: Execute pg_wal_replay_resume() to promote.
|
|
|
|
|
|
The `pg_wal_replay_resume()` is not actually the right statement to
|
|
|
use here, however. That would put the server back into recovery mode,
|
|
|
where it would start fetching WAL files again. It's useful for
|
|
|
replicated setups, but this is not such a case.
|
|
|
|
|
|
In the above scenario, a `recovery_target_time` was added but without
|
|
|
a `recovery_target_action`, which led the server to be paused instead
|
|
|
of resuming normal operation.
|
|
|
|
|
|
The correct way to recover here is to issue a `pg_promote` statement:
|
|
|
|
|
|
sudo -u postgres psql -c 'SELECT pg_promote();'
|
|
|
|
|
|
## Indirect backup recovery
|
|
|
|
|
|
<a name="indirect-restore-procedures" />
|
|
|
|
|
|
Reminder: this procedure **copies** the backup files from the backup
|
|
|
server to the database server and restores from those.
|
|
|
|
|
|
1. First, you will need to give the backup server access to the new
|
|
|
postgresql server, which we'll call the "client" for now. First,
|
|
|
login to the client and allow the backup server to connect, and
|
|
|
show the public hostkey:
|
|
|
|
|
|
iptables -I INPUT -s $BACKUP_SERVER -j ACCEPT
|
|
|
cat /etc/ssh/ssh_host_rsa_key.pub
|
|
|
|
|
|
2. Then load the server's private key in an agent and show it to
|
|
|
allow on the client. On the server, assuming `$IP` is the IP of
|
|
|
the client and `$HOSTKEY` is its hostkey:
|
|
|
|
|
|
ssh-agent bash
|
|
|
ssh-add /etc/ssh/ssh_host_rsa_key
|
|
|
mkdir -p ~/.ssh
|
|
|
echo "$IP $HOSTKEY" >> ~/.ssh/known_hosts
|
|
|
cat /etc/ssh/ssh_host_rsa_key.pub
|
|
|
|
|
|
3. And on the client, allow the server `$HOSTKEY` (the above `cat
|
|
|
/etc/ssh/ssh_host_rsa_key.pub` on the backup server):
|
|
|
|
|
|
echo "$HOSTKEY" >> /etc/ssh/userkeys/root.more
|
|
|
|
|
|
4. Then, we can transfer files over from the backup server to the
|
|
|
client:
|
|
|
|
|
|
cd /srv/backups/pg
|
|
|
rsync -aP $CLIENT $CLIENT:/var/lib/postgresql/restore
|
|
|
|
|
|
5. Disable Puppet so you have control on when PostgreSQL is running:
|
|
|
|
|
|
puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09'
|
|
|
|
|
|
6. Then, on the client, install the software, stop the server and
|
|
|
move the template cluster out of the way:
|
|
|
|
|
|
apt install postgres rsync
|
|
|
service postgresql stop
|
|
|
mv /var/lib/postgresql/*/main{,.orig}
|
|
|
su -c 'mkdir -m 0700 /var/lib/postgresql/9.6/main' postgres
|
|
|
|
|
|
We'll be restoring files in that `main` directory.
|
|
|
|
|
|
Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
|
|
|
backup! You cannot restore across versions. This might mean
|
|
|
installing from backports or an older version of Debian.
|
|
|
|
|
|
7. Then you need to find the right `BASE` file to restore from. Each
|
|
|
`BASE` file has a timestamp in its filename, so just sorting them
|
|
|
by name should be enough to find the latest one. Uncompress the
|
|
|
`BASE` file in place:
|
|
|
|
|
|
cat ~postgres/restore/$CLIENT/main.BASE.bungei.torproject.org-20190805-145239-$CLIENT.torproject.org-main-9.6-backup.tar.gz | su postgres -c 'tar -C /var/lib/postgresql/9.6/main -x -z -f -'
|
|
|
|
|
|
(Use `pv` instead of `cat` for a progress bar with large backups.)
|
|
|
|
|
|
8. Make sure the `pg_xlog` directory doesn't contain any files.
|
|
|
|
|
|
9. Then you need to create a `recovery.conf` file in
|
|
|
`/var/lib/postgresql/9.6/main` that will tell postgres where to
|
|
|
find the WAL files. At least the `restore_command` need to be
|
|
|
specified. Something like this should work:
|
|
|
|
|
|
restore_command = 'cp /var/lib/postgresql/restore/subnotabile/main.WAL.%f %p'
|
|
|
|
|
|
You can specify a specific recovery point in the `recovery.conf`,
|
|
|
see the [upstream documentation](https://www.postgresql.org/docs/11/recovery-target-settings.html) for more information.
|
|
|
|
|
|
10. Then start the server and look at the logs to follow the recovery
|
|
|
process:
|
|
|
|
|
|
service postgresql start
|
|
|
tail -f /var/log/postgresql/*
|
|
|
|
|
|
11. re-enable Puppet, which should clean up the extra SSH key and
|
|
|
firewall rules:
|
|
|
|
|
|
puppet agent -t
|
|
|
|
|
|
make sure it's okay in `/etc/ssh/userkeys/root` and `iptables -L`.
|
|
|
|
|
|
When the restore succeeds, the `recovery.conf` file will be renamed to
|
|
|
`recovery.done` and you will see something like:
|
|
|
|
|
|
9. Remove the temporary SSH access on the backup server, either by
|
|
|
removing the `.more` key file or restoring the previous key
|
|
|
configuration:
|
|
|
2019-08-12 21:36:53.453 UTC [16901] LOG: selected new timeline ID: 2
|
|
|
2019-08-12 21:36:53.470 UTC [16901] LOG: archive recovery complete
|
|
|
cp: cannot stat '/var/lib/postgresql/restore/subnotablie/main.WAL.00000001.history': No such file or directory
|
|
|
2019-08-12 21:36:53.577 UTC [16901] LOG: MultiXact member wraparound protections are now enabled
|
|
|
2019-08-12 21:36:53.584 UTC [16900] LOG: database system is ready to accept connections
|
|
|
|
|
|
rm /etc/ssh/userkeys/torbackup.more
|
|
|
Ignore the error from `cp` complaining about the `.history` file, it's
|
|
|
harmless.
|
|
|
|
|
|
10. re-enable Puppet:
|
|
|
### Troubleshooting
|
|
|
|
|
|
puppet agent -t
|
|
|
If you find the following error in the logs:
|
|
|
|
|
|
#### Troubleshooting
|
|
|
FATAL: could not locate required checkpoint record
|
|
|
|
|
|
See also the "Direct restore procedure" troubleshooting section, which
|
|
|
also applies here.
|
|
|
It's because postgres cannot find the WAL logs to restore from. There
|
|
|
could be many causes for this, but the ones I stumbled upon were:
|
|
|
|
|
|
##### Base file copy
|
|
|
* wrong location for `recovery.conf` (it's in `/var`, not `/etc`),
|
|
|
probably the case if you don't see an error from `cp`
|
|
|
* wrong permissions on the archive (put the WAL files in `~postgres`,
|
|
|
not `~root`)
|
|
|
* wrong path or pattern for `restore_command` (double-check the path
|
|
|
and make sure to include the right prefix, e.g. `main.WAL`)
|
|
|
|
|
|
If the BASE file copy fails with:
|
|
|
## Pager playbook
|
|
|
|
|
|
ssh: connect to host bungei.torproject.org port 22: Connection refused
|
|
|
### BASE-IS-OLD
|
|
|
|
|
|
It's likely because the host you are restoring to is not configured to
|
|
|
backup to this backup host, and therefore the backup host does not
|
|
|
have firewall rules to allow it to connect. You can fix this with
|
|
|
something like:
|
|
|
Our nagios check warns us when a backup server has not successfully
|
|
|
fetched a base backup recently. The causes often are that either the
|
|
|
postgres server or the backup host went down or was down during the
|
|
|
time of the weekly cronjob. The TLS certificate could have also been
|
|
|
renewed without the postgres daemon having been restarted.
|
|
|
|
|
|
iptables-legacy -I INPUT -s 78.47.61.104 -j ACCEPT
|
|
|
To re-run a base backup for a specific cluster, log into the backup
|
|
|
server(s) (currently `bungei`), then find the port for the affected
|
|
|
cluster:
|
|
|
|
|
|
or:
|
|
|
tail /usr/local/bin/postgres-make-base-backups
|
|
|
|
|
|
nft insert rule ip filter INPUT ip saddr 78.47.61.104 counter accept
|
|
|
Then run this backup in a screen:
|
|
|
|
|
|
... but really, you should put the host in Puppet, with the
|
|
|
`profile::postgres` class otherwise you will have more trouble later.
|
|
|
sudo -u torbackup /usr/local/bin/postgres-make-base-backups <host>:<port>
|
|
|
|
|
|
If all fails, you can also use the indirect restore procedure to copy
|
|
|
the BASE file manually on the server.
|
|
|
### MISSING-BASE
|
|
|
|
|
|
##### Unknown format line
|
|
|
WARNING: this procedure was taken from the [DSA wiki](https://dsa.debian.org/howto/postgres-backup/) and is
|
|
|
untested.
|
|
|
|
|
|
If you get this error in the PostgreSQL restore logs:
|
|
|
If you get a Nagios warning like this:
|
|
|
|
|
|
No directories from which read is allowed given on cmdline.
|
|
|
Unkown format in line 1 () when getting dictyotum:main.WAL.00000005.history from backup host
|
|
|
[fasolo, dak] MISSING-BASE: dak.BASE.backuphost.debian.org-20180211-012002-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
|
|
|
It's because the SSH keys deployed on the server does not have the
|
|
|
`--read-allowed` argument.
|
|
|
This means that we started doing a base backup (as witnessed by a
|
|
|
.backup file next to a WAL), but for some reason we don't have the
|
|
|
corresponding base file.
|
|
|
|
|
|
##### Manually copying WAL files
|
|
|
root@backuphost:/srv/backups/pg/fasolo# ls -l *backup*
|
|
|
-rw------- 1 debbackup debbackup 9201093916 Jan 14 06:18 dak.BASE.backuphost.debian.org-20180114-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 9227651542 Jan 21 06:25 dak.BASE.backuphost.debian.org-20180121-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 9266306750 Jan 28 07:59 dak.BASE.backuphost.debian.org-20180128-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 9312602089 Feb 5 11:00 dak.BASE.backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 9346830509 Feb 12 10:25 dak.BASE.backuphost.debian.org-20180212-094930-fasolo.debian.org-dak-9.6-backup.tar.gz
|
|
|
-rw------- 1 debbackup debbackup 353 Jan 14 06:18 dak.WAL.0000000100000033000000A6.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 350 Jan 20 11:20 dak.WAL.00000001000000350000008C.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Jan 21 06:25 dak.WAL.000000010000003600000068.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Jan 28 07:59 dak.WAL.0000000100000038000000E3.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Feb 5 11:00 dak.WAL.000000010000003B00000090.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 350 Feb 5 15:49 dak.WAL.000000010000003B0000009B.00000108.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Feb 11 10:09 dak.WAL.000000010000003D000000AC.00000028.backup
|
|
|
-rw------- 1 debbackup debbackup 353 Feb 12 10:25 dak.WAL.000000010000003E00000027.00000178.backup
|
|
|
|
|
|
If the restore doesn't work, try to connect by hand:
|
|
|
`.backup` files are created on the postgres server and shipped to the
|
|
|
backup hosts whenever a base backup is initiated. We do some
|
|
|
labelling, so we know which backup host the corresponding tarball
|
|
|
should end up with. For example:
|
|
|
|
|
|
sudo -u postgres /usr/local/bin/pg-receive-file-from-backup dictyotum bacula.WAL.00000001000005AB0000003A /tmp/foo
|
|
|
root@backuphost:/srv/backups/pg/fasolo# cat dak.WAL.000000010000003B00000090.00000028.backup
|
|
|
START WAL LOCATION: 3B/90000028 (file 000000010000003B00000090)
|
|
|
STOP WAL LOCATION: 3B/97CF2138 (file 000000010000003B00000097)
|
|
|
CHECKPOINT LOCATION: 3B/90000098
|
|
|
BACKUP METHOD: streamed
|
|
|
BACKUP FROM: master
|
|
|
START TIME: 2018-02-05 10:25:28 UTC
|
|
|
LABEL: backuphost.debian.org-20180204-012001-fasolo.debian.org-dak-9.6-backup
|
|
|
STOP TIME: 2018-02-05 10:59:50 UTC
|
|
|
|
|
|
##### WAL was generated with wal_level=minimum, data may be missing
|
|
|
To fix this, verify we have a later base tarball, or that we are fine
|
|
|
for some other reason, and remove the corresponding `.backup` file from
|
|
|
the backup host. In the case above, we would remove
|
|
|
`dak.WAL.000000010000003D000000AC.00000028.backup`.
|
|
|
|
|
|
If you get this kind of errors, it's because you forgot to restore the
|
|
|
`BASE` file first:
|
|
|
### WAL-MISSING-AFTER
|
|
|
|
|
|
2019-10-08 21:25:43.711 UTC [14320] [unknown]@[unknown] LOG: incomplete startup packet
|
|
|
2019-10-08 21:25:44.215 UTC [14326] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:44.451 UTC [14316] WARNING: WAL was generated with wal_level=minimal, data may be missing
|
|
|
2019-10-08 21:25:44.451 UTC [14316] HINT: This happens if you temporarily set wal_level=minimal without taking a new base backup.
|
|
|
2019-10-08 21:25:44.451 UTC [14316] LOG: consistent recovery state reached at 0/153A3F0
|
|
|
2019-10-08 21:25:44.451 UTC [14316] LOG: invalid record length at 0/153A3F0: wanted 24, got 0
|
|
|
2019-10-08 21:25:44.451 UTC [14316] LOG: redo is not required
|
|
|
2019-10-08 21:25:44.721 UTC [14334] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:45.226 UTC [14340] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:45.549 UTC [14316] LOG: selected new timeline ID: 6
|
|
|
2019-10-08 21:25:45.732 UTC [14343] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:45.765 UTC [14316] LOG: archive recovery complete
|
|
|
2019-10-08 21:25:46.238 UTC [14349] postgres@postgres FATAL: the database system is starting up
|
|
|
2019-10-08 21:25:46.466 UTC [14316] LOG: MultiXact member wraparound protections are now enabled
|
|
|
2019-10-08 21:25:46.467 UTC [14315] LOG: database system is ready to accept connections
|
|
|
2019-10-08 21:25:46.467 UTC [14351] LOG: autovacuum launcher started
|
|
|
Example message:
|
|
|
|
|
|
##### could not unlink file
|
|
|
[troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD
|
|
|
|
|
|
TODO: this warning comes up from time to time, problem?
|
|
|
This means that a WAL file is missing after the specified
|
|
|
file. Specifically, in the above scenario, the following files are
|
|
|
present, in chronological order:
|
|
|
|
|
|
2019-10-09 23:47:13.446 UTC [16973] LOG: could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 13:57 main.WAL.00000001000000D9000000B5
|
|
|
|
|
|
##### missing "archive recovery complete" message
|
|
|
Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
|
|
|
`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
|
|
|
happened between 6:26 and 13:57 on that day.
|
|
|
|
|
|
A block like this should show up in the
|
|
|
`/var/log/postgresql/postgresql-13-main.log` file:
|
|
|
1. List the files in chronological order:
|
|
|
|
|
|
2019-10-10 01:30:55.460 UTC [16953] LOG: redo done at 5B8/9C5BE738
|
|
|
2019-10-10 01:30:55.460 UTC [16953] LOG: last completed transaction was at log time 2019-10-10 01:04:23.238233+00
|
|
|
2019-10-10 01:31:03.536 UTC [16953] LOG: restored log file "00000001000005B80000009C" from archive
|
|
|
2019-10-10 01:31:06.458 UTC [16953] LOG: selected new timeline ID: 2
|
|
|
2019-10-10 01:31:17.485 UTC [16953] LOG: archive recovery complete
|
|
|
2019-10-10 01:32:11.975 UTC [16953] LOG: MultiXact member wraparound protections are now enabled
|
|
|
2019-10-10 01:32:12.438 UTC [16950] LOG: database system is ready to accept connections
|
|
|
2019-10-10 01:32:12.439 UTC [26501] LOG: autovacuum launcher started
|
|
|
ls -ltr /srv/backups/pg/troodi/ | less
|
|
|
|
|
|
The key entry is `archive recovery complete` here.
|
|
|
2. Find the file warned about, using `/` then the filename
|
|
|
(`main.WAL.00000001000000D9000000AD`), above
|
|
|
|
|
|
It *should* show this in the logs. If it is not, it might just be
|
|
|
still recovering a WAL file, or it might be `pause`d.
|
|
|
3. Look for a `.BASE.` file *following* the missing file, using `/`
|
|
|
again
|
|
|
|
|
|
You can confirm what the server is doing by looking at the processes,
|
|
|
for example, this is still recovering a WAL file:
|
|
|
4. Either:
|
|
|
|
|
|
root@meronense-backup-01:~# systemctl status postgresql@13-main.service
|
|
|
● postgresql@13-main.service - PostgreSQL Cluster 13-main
|
|
|
Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
|
|
|
Active: active (running) since Thu 2022-10-27 15:06:40 UTC; 1min 0s ago
|
|
|
Process: 67835 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-main start (code=exited, status=0/SUCCESS)
|
|
|
Main PID: 67840 (postgres)
|
|
|
Tasks: 5 (limit: 9510)
|
|
|
Memory: 50.0M
|
|
|
CPU: 626ms
|
|
|
CGroup: /system.slice/system-postgresql.slice/postgresql@13-main.service
|
|
|
├─67840 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
|
|
|
├─67842 postgres: 13/main: startup recovering 0000000100000600000000F5
|
|
|
├─67851 postgres: 13/main: checkpointer
|
|
|
├─67853 postgres: 13/main: background writer
|
|
|
└─67855 postgres: 13/main: stats collector
|
|
|
* if a `.BASE.` backup is present after the missing files, it is
|
|
|
harmless to the extent that the missing timeframe is not
|
|
|
necessary. to remove the warnings, previous WAL files *do* need
|
|
|
to be removed, by hand.
|
|
|
|
|
|
... because there's a process doing:
|
|
|
* if a `.BASE.` backup is *not* present after the missing files,
|
|
|
the backup integrity is faulty, and a new base backup needs to
|
|
|
be performed. See [Running a full
|
|
|
backup](#running-a-full-backup) above.
|
|
|
|
|
|
67842 postgres: 13/main: startup recovering 0000000100000600000000F5
|
|
|
Here is an example of the second case above, and recovery:
|
|
|
|
|
|
In that case, it was stuck in "pause" mode, as the logs indicated:
|
|
|
root@bungei:~# cd /srv/backups/pg/meronense
|
|
|
root@bungei:/srv/backups/pg/meronense# /usr/lib/nagios/plugins/dsa-check-backuppg -e -v
|
|
|
[meronense, main] WAL-MISSING-AFTER: meronense/main.WAL.00000001000005A900000073
|
|
|
[meronense, main] NO-BASE: no base backup found?
|
|
|
[meronense, main] NO-BACKUP: no backups! (no .backup files found)
|
|
|
NOT-CONFIGURED: meronense-11
|
|
|
IGNORED: .nobackup
|
|
|
IGNORED: lost+found
|
|
|
IGNORED: BASE-meronense.torproject.org:5432-G0uKzfYhtH.tar.gz
|
|
|
root@bungei:/srv/backups/pg/meronense# ls -altr
|
|
|
total 98312
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:49 main.WAL.00000001000005A90000006F
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:49 main.WAL.00000001000005A900000070
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:58 main.WAL.00000001000005A900000071
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:59 main.WAL.00000001000005A900000072
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 13:59 main.WAL.00000001000005A900000073
|
|
|
-rw------- 1 torbackup torbackup 16777216 Jun 23 14:14 main.WAL.00000001000005A90000007F
|
|
|
drwxr-xr-x 10 torbackup torbackup 4096 Jun 23 14:14 ..
|
|
|
drwxr-xr-x 2 torbackup torbackup 4096 Jun 23 14:14 .
|
|
|
root@bungei:/srv/backups/pg/meronense# rm main.WAL.00000001000005A90000006F main.WAL.00000001000005A900000070 main.WAL.00000001000005A900000071 main.WAL.00000001000005A900000072 main.WAL.00000001000005A900000073
|
|
|
root@bungei:/srv/backups/pg/meronense# /usr/lib/nagios/plugins/dsa-check-backuppg -e -v
|
|
|
[meronense, main] NO-BASE: no base backup found?
|
|
|
[meronense, main] NO-BACKUP: no backups! (no .backup files found)
|
|
|
NOT-CONFIGURED: meronense-11
|
|
|
IGNORED: .nobackup
|
|
|
IGNORED: lost+found
|
|
|
IGNORED: BASE-meronense.torproject.org:5432-G0uKzfYhtH.tar.gz
|
|
|
|
|
|
2022-10-27 15:08:54.882 UTC [67933] LOG: starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
|
|
|
2022-10-27 15:08:54.882 UTC [67933] LOG: listening on IPv6 address "::1", port 5432
|
|
|
2022-10-27 15:08:54.882 UTC [67933] LOG: listening on IPv4 address "127.0.0.1", port 5432
|
|
|
2022-10-27 15:08:54.998 UTC [67933] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
|
|
|
2022-10-27 15:08:55.236 UTC [67939] LOG: database system was shut down in recovery at 2022-10-27 15:08:54 UTC
|
|
|
2022-10-27 15:08:55.911 UTC [67939] LOG: starting point-in-time recovery to 2022-10-01 00:00:00+00
|
|
|
2022-10-27 15:08:56.764 UTC [67939] LOG: restored log file "0000000100000600000000F4" from archive
|
|
|
2022-10-27 15:08:57.316 UTC [67939] LOG: redo starts at 600/F4000028
|
|
|
2022-10-27 15:08:58.497 UTC [67939] LOG: restored log file "0000000100000600000000F5" from archive
|
|
|
2022-10-27 15:08:59.119 UTC [67939] LOG: consistent recovery state reached at 600/F50051F0
|
|
|
2022-10-27 15:08:59.119 UTC [67933] LOG: database system is ready to accept read only connections
|
|
|
2022-10-27 15:08:59.120 UTC [67939] LOG: recovery stopping before commit of transaction 12884886, time 2022-10-01 08:40:35.735422+00
|
|
|
2022-10-27 15:08:59.120 UTC [67939] LOG: pausing at the end of recovery
|
|
|
2022-10-27 15:08:59.120 UTC [67939] HINT: Execute pg_wal_replay_resume() to promote.
|
|
|
There was a gap between `main.WAL.00000001000005A900000073` and
|
|
|
`main.WAL.00000001000005A90000007F` so everything up to and
|
|
|
including the former were removed by hand. Then a full backup was
|
|
|
performed. The reason why the BASE backup was missing is this was
|
|
|
following a failed upgrade (see [tpo/tpa/team#40809](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40809)).
|
|
|
|
|
|
The `pg_wal_replay_resume()` is not actually the right statement to
|
|
|
use here, however. That would put the server back into recovery mode,
|
|
|
where it would start fetching WAL files again. It's useful for
|
|
|
replicated setups, but this is not such a case.
|
|
|
## Disaster recovery
|
|
|
|
|
|
In the above scenario, a `recovery_target_time` was added but without
|
|
|
a `recovery_target_action`, which led the server to be paused instead
|
|
|
of resuming normal operation.
|
|
|
If a PostgreSQL server is destroyed completely or in part, we need to
|
|
|
restore from backups. We have two ways of restoring PostgreSQL
|
|
|
backups, the **direct** and **indirect** procedures.
|
|
|
|
|
|
The correct way to recover here is to issue a `pg_promote` statement:
|
|
|
The **direct** procedure *streams* the backup files directly from the
|
|
|
backup server. The `BASE` file is streamed into the `tar` command for
|
|
|
restore of the base snapshot, then the PostgreSQL restore command
|
|
|
copies each log directly from the backup server as well. It requires
|
|
|
minimal amount of space, as files are not copied to a temporary
|
|
|
location on the new server. The downside is it might be more difficult
|
|
|
to deploy and diagnose, as it has more moving parts.
|
|
|
|
|
|
sudo -u postgres psql -c 'SELECT pg_promote();'
|
|
|
The **indirect** procedures first transfers the backup files to the
|
|
|
server and then runs the restore, so it require more space than the
|
|
|
direct procedure. It might also be slower than the direct restore
|
|
|
procedure because files are copied around twice: once from the backup
|
|
|
server, and again loaded in the database.
|
|
|
|
|
|
Both procedures are adaptations of the [official recovery
|
|
|
procedure](https://www.postgresql.org/docs/11/continuous-archiving.html#BACKUP-PITR-RECOVERY), which can be refered to for more information.
|
|
|
|
|
|
See the procedures in [direct backup recovery](#direct-backup-recovery) and [indirect backup
|
|
|
recovery](#indirect-backup-recovery), above.
|
|
|
|
|
|
# Reference
|
|
|
|
... | ... | |