|
|
|
[[_TOC_]]
|
|
|
|
|
|
|
|
[PostgreSQL](https://www.postgresql.org/) is an advanced database server that is robust and
|
|
|
|
fast, although possibly less well-known and popular than it's eternal
|
|
|
|
rival in the free software world, MySQL.
|
|
|
|
|
|
|
|
Tutorial
|
|
|
|
========
|
|
|
|
|
|
|
|
Those are quick reminders on easy things to do in a cluster.
|
|
|
|
|
|
|
|
Connecting
|
|
|
|
----------
|
|
|
|
|
|
|
|
Our PostgreSQL setup is fairly standard so connecting to the database
|
|
|
|
is like any other Debian machine:
|
|
|
|
|
|
|
|
sudo -u postres psql
|
|
|
|
|
|
|
|
This drops you in a [psql shell](https://www.postgresql.org/docs/11/app-psql.html) where you can issue SQL queries
|
|
|
|
and so on.
|
|
|
|
|
|
|
|
Howto
|
|
|
|
=====
|
|
|
|
|
|
|
|
Deployment
|
|
|
|
----------
|
|
|
|
|
|
|
|
A PostgreSQL server should be deployed with puppet, using the
|
|
|
|
`postgresql` module. Make sure you also include the
|
|
|
|
`postgres::backup_source` class so that it's registered with the
|
|
|
|
backup server, see the `dip::database` class for an example on how to
|
|
|
|
deploy a server.
|
|
|
|
|
|
|
|
See the [howto/puppet](howto/puppet) documentation for more information on how to deploy
|
|
|
|
Puppet things.
|
|
|
|
|
|
|
|
TODO: turn this in a Tutorial by making it more straightforward to
|
|
|
|
deploy a cluster, for example just by including a single class.
|
|
|
|
|
|
|
|
Find what is taking up space
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
This will report size information for all tables, in both raw bytes and "pretty" form:
|
|
|
|
|
|
|
|
SELECT *, pg_size_pretty(total_bytes) AS total
|
|
|
|
, pg_size_pretty(index_bytes) AS INDEX
|
|
|
|
, pg_size_pretty(toast_bytes) AS toast
|
|
|
|
, pg_size_pretty(table_bytes) AS TABLE
|
|
|
|
FROM (
|
|
|
|
SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
|
|
|
|
SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME
|
|
|
|
, c.reltuples AS row_estimate
|
|
|
|
, pg_total_relation_size(c.oid) AS total_bytes
|
|
|
|
, pg_indexes_size(c.oid) AS index_bytes
|
|
|
|
, pg_total_relation_size(reltoastrelid) AS toast_bytes
|
|
|
|
FROM pg_class c
|
|
|
|
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
|
|
|
|
WHERE relkind = 'r'
|
|
|
|
) a
|
|
|
|
) a ORDER BY total_bytes;
|
|
|
|
|
|
|
|
Same with databases:
|
|
|
|
|
|
|
|
SELECT d.datname AS Name, pg_catalog.pg_get_userbyid(d.datdba) AS Owner,
|
|
|
|
CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
|
|
|
|
THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname))
|
|
|
|
ELSE 'No Access'
|
|
|
|
END AS SIZE
|
|
|
|
FROM pg_catalog.pg_database d
|
|
|
|
ORDER BY
|
|
|
|
CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
|
|
|
|
THEN pg_catalog.pg_database_size(d.datname)
|
|
|
|
ELSE NULL
|
|
|
|
END DESC -- nulls first
|
|
|
|
LIMIT 20;
|
|
|
|
|
|
|
|
Source: [PostgreSQL wiki](https://wiki.postgresql.org/wiki/Disk_Usage). See also [the upstream manual](https://www.postgresql.org/docs/11/disk-usage.html).
|
|
|
|
|
|
|
|
Running a full backup
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
Backups are normally automatically ran on the backup server (currently
|
|
|
|
`bungei`) but you can also run a backup by hand. This will run a
|
|
|
|
backup of `meronense`, for example:
|
|
|
|
|
|
|
|
sudo -u torbackup postgres-make-one-base-backup $(grep ^meronense.torproject.org $(which postgres-make-base-backups ))
|
|
|
|
|
|
|
|
See the [reference](#Reference) section for details.
|
|
|
|
|
|
|
|
Picking a restore procedures
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
We have two ways of restoring PostgreSQL backups, **indirect** and
|
|
|
|
**direct**.
|
|
|
|
|
|
|
|
The **indirect** procedures first transfers the backup files to the
|
|
|
|
server and then runs the restore, so it require more space than the
|
|
|
|
direct procedure. It might also be slower than the direct restore
|
|
|
|
procedure because files are copied around twice: once from the backup
|
|
|
|
server, and again loaded in the database.
|
|
|
|
|
|
|
|
The **direct** procedure *streams* the backup files directly from the
|
|
|
|
backup server. The `BASE` file is streamed into the `tar` command for
|
|
|
|
restore of the base snapshot, then the PostgreSQL restore command
|
|
|
|
copies each log directly from the backup server as well. It requires
|
|
|
|
minimal amount of space, as files are not copied to a temporary
|
|
|
|
location on the new server. The downside is it might be more difficult
|
|
|
|
to deploy and diagnose, as it has more moving parts.
|
|
|
|
|
|
|
|
Both procedures are adaptations of the [official recovery
|
|
|
|
procedure](https://www.postgresql.org/docs/11/continuous-archiving.html#BACKUP-PITR-RECOVERY), which can be refered to for more information.
|
|
|
|
|
|
|
|
Indirect restore procedures
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
Reminder: this procedure **copies** the backup files from the backup
|
|
|
|
server to the database server and restores from those.
|
|
|
|
|
|
|
|
1. First, you will need to give the backup server access to the new
|
|
|
|
postgresql server, which we'll call the "client" for now. First,
|
|
|
|
login to the client and allow the backup server to connect, and
|
|
|
|
show the public hostkey:
|
|
|
|
|
|
|
|
iptables -I INPUT -s $BACKUP_SERVER -j ACCEPT
|
|
|
|
cat /etc/ssh/ssh_host_rsa_key.pub
|
|
|
|
|
|
|
|
2. Then load the server's private key in an agent and show it to
|
|
|
|
allow on the client. On the server, assuming `$IP` is the IP of
|
|
|
|
the client and `$HOSTKEY` is its hostkey:
|
|
|
|
|
|
|
|
ssh-agent bash
|
|
|
|
ssh-add /etc/ssh/ssh_host_rsa_key
|
|
|
|
mkdir -p ~/.ssh
|
|
|
|
echo "$IP $HOSTKEY" >> ~/.ssh/known_hosts
|
|
|
|
cat /etc/ssh/ssh_host_rsa_key.pub
|
|
|
|
|
|
|
|
3. And on the client, allow the server `$HOSTKEY` (the above `cat
|
|
|
|
/etc/ssh/ssh_host_rsa_key.pub` on the backup server):
|
|
|
|
|
|
|
|
echo "$HOSTKEY" >> /etc/ssh/userkeys/root.more
|
|
|
|
|
|
|
|
4. Then, we can transfer files over from the backup server to the
|
|
|
|
client:
|
|
|
|
|
|
|
|
cd /srv/backups/pg
|
|
|
|
rsync -aP $CLIENT $CLIENT:/var/lib/postgresql/restore
|
|
|
|
|
|
|
|
5. Disable Puppet so you have control on when PostgreSQL is running:
|
|
|
|
|
|
|
|
puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09'
|
|
|
|
|
|
|
|
6. Then, on the client, install the software, stop the server and
|
|
|
|
move the template cluster out of the way:
|
|
|
|
|
|
|
|
apt install postgres rsync
|
|
|
|
service postgresql stop
|
|
|
|
mv /var/lib/postgresql/*/main{,.orig}
|
|
|
|
su -c 'mkdir -m 0700 /var/lib/postgresql/9.6/main' postgres
|
|
|
|
|
|
|
|
We'll be restoring files in that `main` directory.
|
|
|
|
|
|
|
|
Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
|
|
|
|
backup! You cannot restore across versions. This might mean
|
|
|
|
installing from backports or an older version of Debian.
|
|
|
|
|
|
|
|
7. Then you need to find the right `BASE` file to restore from. Each
|
|
|
|
`BASE` file has a timestamp in its filename, so just sorting them
|
|
|
|
by name should be enough to find the latest one. Uncompress the
|
|
|
|
`BASE` file in place:
|
|
|
|
|
|
|
|
cat ~postgres/restore/$CLIENT/main.BASE.bungei.torproject.org-20190805-145239-$CLIENT.torproject.org-main-9.6-backup.tar.gz | su postgres -c 'tar -C /var/lib/postgresql/9.6/main -x -z -f -'
|
|
|
|
|
|
|
|
(Use `pv` instead of `cat` for a progress bar with large backups.)
|
|
|
|
|
|
|
|
8. Make sure the `pg_xlog` directory doesn't contain any files.
|
|
|
|
|
|
|
|
9. Then you need to create a `recovery.conf` file in
|
|
|
|
`/var/lib/postgresql/9.6/main` that will tell postgres where to
|
|
|
|
find the WAL files. At least the `restore_command` need to be
|
|
|
|
specified. Something like this should work:
|
|
|
|
|
|
|
|
restore_command = 'cp /var/lib/postgresql/restore/subnotablie/main.WAL.%f %p'
|
|
|
|
|
|
|
|
You can specify a specific recovery point in the `recovery.conf`,
|
|
|
|
see the [upstream documentation](https://www.postgresql.org/docs/11/recovery-target-settings.html) for more information.
|
|
|
|
|
|
|
|
10. Then start the server and look at the logs to follow the recovery
|
|
|
|
process:
|
|
|
|
|
|
|
|
service postgresql start
|
|
|
|
tail -f /var/log/postgresql/*
|
|
|
|
|
|
|
|
11. re-enable Puppet, which should clean up the extra SSH key and
|
|
|
|
firewall rules:
|
|
|
|
|
|
|
|
puppet agent -t
|
|
|
|
|
|
|
|
make sure it's okay in `/etc/ssh/userkeys/root` and `iptables -L`.
|
|
|
|
|
|
|
|
When the restore succeeds, the `recovery.conf` file will be renamed to
|
|
|
|
`recovery.done` and you will see something like:
|
|
|
|
|
|
|
|
2019-08-12 21:36:53.453 UTC [16901] LOG: selected new timeline ID: 2
|
|
|
|
2019-08-12 21:36:53.470 UTC [16901] LOG: archive recovery complete
|
|
|
|
cp: cannot stat '/var/lib/postgresql/restore/subnotablie/main.WAL.00000001.history': No such file or directory
|
|
|
|
2019-08-12 21:36:53.577 UTC [16901] LOG: MultiXact member wraparound protections are now enabled
|
|
|
|
2019-08-12 21:36:53.584 UTC [16900] LOG: database system is ready to accept connections
|
|
|
|
|
|
|
|
Ignore the error from `cp` complaining about the `.history` file, it's
|
|
|
|
harmless.
|
|
|
|
|
|
|
|
### Troubleshooting
|
|
|
|
|
|
|
|
If you find the following error in the logs:
|
|
|
|
|
|
|
|
FATAL: could not locate required checkpoint record
|
|
|
|
|
|
|
|
It's because postgres cannot find the WAL logs to restore from. There
|
|
|
|
could be many causes for this, but the ones I stumbled upon were:
|
|
|
|
|
|
|
|
* wrong location for `recovery.conf` (it's in `/var`, not `/etc`),
|
|
|
|
probably the case if you don't see an error from `cp`
|
|
|
|
* wrong permissions on the archive (put the WAL files in `~postgres`,
|
|
|
|
not `~root`)
|
|
|
|
* wrong path or pattern for `restore_command` (double-check the path
|
|
|
|
and make sure to include the right prefix, e.g. `main.WAL`)
|
|
|
|
|
|
|
|
Direct restore procedure
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
Reminder: this procedure **streams** the files from the backup server
|
|
|
|
to the database server and restore directly. See above for more
|
|
|
|
information on how to pick a restore procedure.
|
|
|
|
|
|
|
|
1. First, disable Puppet so you have control on when PostgreSQL is
|
|
|
|
running:
|
|
|
|
|
|
|
|
puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09'
|
|
|
|
|
|
|
|
2. Then install the right PostgreSQL version and stop the server:
|
|
|
|
|
|
|
|
apt install postgres-9.6
|
|
|
|
service postgresql stop
|
|
|
|
|
|
|
|
Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
|
|
|
|
backup! You cannot restore across versions. This might mean
|
|
|
|
installing from backports or an older version of Debian.
|
|
|
|
|
|
|
|
3. On that new PostgreSQL server, show the `postgres` server public
|
|
|
|
key, creating it if missing:
|
|
|
|
|
|
|
|
[ -f ~postgres/.ssh/id_rsa.pub ] || sudo -u postgres ssh-keygen
|
|
|
|
cat ~postgres/.ssh/*.pub
|
|
|
|
|
|
|
|
4. Then on the backup server, allow the user to access backups of the
|
|
|
|
old server:
|
|
|
|
|
|
|
|
echo "command="/usr/local/bin/debbackup-ssh-wrap --read-allow=/srv/backups/pg/$OLDSERVER $CLIENT",restrict $HOSTKEY" > /etc/ssh/userkeys/torbackup.more
|
|
|
|
|
|
|
|
This assumes we connect to a *previous* server's backups, named
|
|
|
|
`$OLDSERVER` (e.g. `dictyotum`). The `$HOSTKEY` is the public key
|
|
|
|
found on the postgres server above.
|
|
|
|
|
|
|
|
Warning: the above will fail if the key is already present in
|
|
|
|
`/etc/ssh/userkeys/torbackup`, edit the key in there instead in
|
|
|
|
that case.
|
|
|
|
|
|
|
|
5. Then you need to find the right `BASE` file to restore from. Each
|
|
|
|
`BASE` file has a timestamp in its filename, so just sorting them
|
|
|
|
by name should be enough to find the latest one. Uncompress the
|
|
|
|
`BASE` file in place, as the `postgres` user:
|
|
|
|
|
|
|
|
sudo -u postgres -i
|
|
|
|
mkdir -m 0700 /var/lib/postgresql/9.6/main
|
|
|
|
ssh torbackup@$BACKUPSERVER $(hostname) retrieve-file pg $OLDSERVER $BASEFILE \
|
|
|
|
| sed '0,/^$/d' \
|
|
|
|
| tar -C /var/lib/postgresql/9.6/main -x -z -f -
|
|
|
|
|
|
|
|
Add a `pv` before the `tar` call in the pipeline for a progress bar
|
|
|
|
with large backups, and replace:
|
|
|
|
|
|
|
|
* `$BACKUPSERVER` with the backupserver name and username
|
|
|
|
(currently `bungei.torproject.org`)
|
|
|
|
* `$OLDSERVER` with the old server's (short) hostname
|
|
|
|
(e.g. `dictyotum`)
|
|
|
|
* `$CLUSTERNAME` with the name of the cluster to restore
|
|
|
|
(e.g. usually `main`)
|
|
|
|
* `$BASEFILE` with something like
|
|
|
|
`$CLUSTERNAME.BASE.$BACKUPSERVER-20191004-062226-$OLDSERVER.torproject.org-$CLUSTERNAME-9.6-backup.tar.gz`
|
|
|
|
or, fully expanded:
|
|
|
|
`bacula.BASE.bungei.torproject.org-20191010-184205-dictyotum.torproject.org-bacula-9.6-backup.tar.gz`
|
|
|
|
where the `$CLUSTERNAME` is the non-default `bacula`
|
|
|
|
|
|
|
|
The above might hang for a while (yes, maybe even a minute) in the
|
|
|
|
beginning, but it should eventually get started and complete. The
|
|
|
|
"hang" is because `retrieve-file` sends a header which includes a
|
|
|
|
`sha512sum` and it takes a while to compute. The `sed` command in
|
|
|
|
that pipeline is there to skip that header. Example of a
|
|
|
|
successful restore operated in [#31786][]:
|
|
|
|
|
|
|
|
ssh torbackup@bungei.torproject.org bacula-director-01 retrieve-file pg dictyotum bacula.BASE.bungei.torproject.org-20191010-184205-dictyotum.torproject.org-bacula-9.6-backup.tar.gz | pv -s 33G | sed '0,/^$/d' | tar -C /var/lib/postgresql/9.6/main -x -z -f -
|
|
|
|
|
|
|
|
If the above doesn't work, use the indirect procedure to restore
|
|
|
|
the `BASE` file.
|
|
|
|
|
|
|
|
[#31786]: https://bugs.torproject.org/31786
|
|
|
|
|
|
|
|
6. Make sure the `pg_xlog` directory doesn't contain any files.
|
|
|
|
|
|
|
|
rm -rf -- /var/lib/postgresql/9.6/main/pg_xlog/*
|
|
|
|
|
|
|
|
7. Then you need to create a `recovery.conf` file in
|
|
|
|
`/var/lib/postgresql/9.6/main` that will tell postgres where to
|
|
|
|
find the WAL files. At least the `restore_command` need to be
|
|
|
|
specified. Something like this should work:
|
|
|
|
|
|
|
|
restore_command = '/usr/local/bin/pg-receive-file-from-backup $OLDSERVER $CLUSTERNAME.WAL.%f %p'
|
|
|
|
|
|
|
|
... where:
|
|
|
|
|
|
|
|
* `$OLDSERVER` should be replaced by the previous postgresql
|
|
|
|
server name (e.g. `dictyotum`)
|
|
|
|
* `$CLUSTERNAME` should be replaced by the previous cluster name
|
|
|
|
(e.g. `main`, generally)
|
|
|
|
|
|
|
|
You can specify a specific recovery point in the `recovery.conf`,
|
|
|
|
see the [upstream documentation](https://www.postgresql.org/docs/11/recovery-target-settings.html) for more information. Also
|
|
|
|
make sure the file is owned by postgres:
|
|
|
|
|
|
|
|
$EDITOR /var/lib/postgresql/9.6/main/recovery.conf
|
|
|
|
chown postgres /var/lib/postgresql/9.6/main/recovery.conf
|
|
|
|
|
|
|
|
8. Then start the server and look at the logs to follow the recovery
|
|
|
|
process:
|
|
|
|
|
|
|
|
service postgresql start
|
|
|
|
tail -f /var/log/postgresql/*
|
|
|
|
|
|
|
|
You should see something like this:
|
|
|
|
|
|
|
|
2019-10-09 21:17:47.335 UTC [9632] LOG: database system was interrupted; last known up at 2019-10-04 08:12:28 UTC
|
|
|
|
2019-10-09 21:17:47.517 UTC [9632] LOG: starting archive recovery
|
|
|
|
2019-10-09 21:17:47.524 UTC [9633] [unknown]@[unknown] LOG: incomplete startup packet
|
|
|
|
2019-10-09 21:17:48.032 UTC [9639] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-09 21:17:48.538 UTC [9642] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-09 21:17:49.046 UTC [9645] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-09 21:17:49.354 UTC [9632] LOG: restored log file "00000001000005B200000074" from archive
|
|
|
|
2019-10-09 21:17:49.552 UTC [9648] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-09 21:17:50.058 UTC [9651] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-09 21:17:50.565 UTC [9654] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-09 21:17:50.836 UTC [9632] LOG: redo starts at 5B2/74000028
|
|
|
|
2019-10-09 21:17:51.071 UTC [9659] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-09 21:17:51.577 UTC [9665] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-09 21:20:35.790 UTC [9632] LOG: restored log file "00000001000005B20000009F" from archive
|
|
|
|
2019-10-09 21:20:37.745 UTC [9632] LOG: restored log file "00000001000005B2000000A0" from archive
|
|
|
|
2019-10-09 21:20:39.648 UTC [9632] LOG: restored log file "00000001000005B2000000A1" from archive
|
|
|
|
2019-10-09 21:20:41.738 UTC [9632] LOG: restored log file "00000001000005B2000000A2" from archive
|
|
|
|
2019-10-09 21:20:43.773 UTC [9632] LOG: restored log file "00000001000005B2000000A3" from archive
|
|
|
|
|
|
|
|
... and so on.
|
|
|
|
|
|
|
|
TODO: this warning comes up from time to time, problem?
|
|
|
|
|
|
|
|
2019-10-09 23:47:13.446 UTC [16973] LOG: could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
|
|
|
|
|
|
|
|
Then the recovery will complete with something like this:
|
|
|
|
|
|
|
|
2019-10-10 01:30:55.460 UTC [16953] LOG: redo done at 5B8/9C5BE738
|
|
|
|
2019-10-10 01:30:55.460 UTC [16953] LOG: last completed transaction was at log time 2019-10-10 01:04:23.238233+00
|
|
|
|
2019-10-10 01:31:03.536 UTC [16953] LOG: restored log file "00000001000005B80000009C" from archive
|
|
|
|
2019-10-10 01:31:06.458 UTC [16953] LOG: selected new timeline ID: 2
|
|
|
|
2019-10-10 01:31:17.485 UTC [16953] LOG: archive recovery complete
|
|
|
|
2019-10-10 01:32:11.975 UTC [16953] LOG: MultiXact member wraparound protections are now enabled
|
|
|
|
2019-10-10 01:32:12.438 UTC [16950] LOG: database system is ready to accept connections
|
|
|
|
2019-10-10 01:32:12.439 UTC [26501] LOG: autovacuum launcher started
|
|
|
|
|
|
|
|
9. Then remove the temporary SSH access on the backup server, either
|
|
|
|
by removing the `.more` key file or restoring the previous key
|
|
|
|
configuration:
|
|
|
|
|
|
|
|
rm /etc/ssh/userkeys/torbackup.more
|
|
|
|
|
|
|
|
10. re-enable Puppet:
|
|
|
|
|
|
|
|
puppet agent -t
|
|
|
|
|
|
|
|
### Troubleshooting
|
|
|
|
|
|
|
|
If you get this error in the PostgreSQL restore logs:
|
|
|
|
|
|
|
|
No directories from which read is allowed given on cmdline.
|
|
|
|
Unkown format in line 1 () when getting dictyotum:main.WAL.00000005.history from backup host
|
|
|
|
|
|
|
|
It's because the SSH keys deployed on the server does not have the
|
|
|
|
`--read-allowed` argument.
|
|
|
|
|
|
|
|
If the restore doesn't work, try to connect by hand:
|
|
|
|
|
|
|
|
sudo -u postgres /usr/local/bin/pg-receive-file-from-backup dictyotum bacula.WAL.00000001000005AB0000003A /tmp/foo
|
|
|
|
|
|
|
|
If you get this kind of errors, it's because you forgot to restore the
|
|
|
|
`BASE` file first:
|
|
|
|
|
|
|
|
2019-10-08 21:25:43.711 UTC [14320] [unknown]@[unknown] LOG: incomplete startup packet
|
|
|
|
2019-10-08 21:25:44.215 UTC [14326] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-08 21:25:44.451 UTC [14316] WARNING: WAL was generated with wal_level=minimal, data may be missing
|
|
|
|
2019-10-08 21:25:44.451 UTC [14316] HINT: This happens if you temporarily set wal_level=minimal without taking a new base backup.
|
|
|
|
2019-10-08 21:25:44.451 UTC [14316] LOG: consistent recovery state reached at 0/153A3F0
|
|
|
|
2019-10-08 21:25:44.451 UTC [14316] LOG: invalid record length at 0/153A3F0: wanted 24, got 0
|
|
|
|
2019-10-08 21:25:44.451 UTC [14316] LOG: redo is not required
|
|
|
|
2019-10-08 21:25:44.721 UTC [14334] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-08 21:25:45.226 UTC [14340] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-08 21:25:45.549 UTC [14316] LOG: selected new timeline ID: 6
|
|
|
|
2019-10-08 21:25:45.732 UTC [14343] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-08 21:25:45.765 UTC [14316] LOG: archive recovery complete
|
|
|
|
2019-10-08 21:25:46.238 UTC [14349] postgres@postgres FATAL: the database system is starting up
|
|
|
|
2019-10-08 21:25:46.466 UTC [14316] LOG: MultiXact member wraparound protections are now enabled
|
|
|
|
2019-10-08 21:25:46.467 UTC [14315] LOG: database system is ready to accept connections
|
|
|
|
2019-10-08 21:25:46.467 UTC [14351] LOG: autovacuum launcher started
|
|
|
|
|
|
|
|
See also the "Direct restore procedure" troubleshooting section, which
|
|
|
|
also applies here.
|
|
|
|
|
|
|
|
Monitoring warnings
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
TODO: there's some more information about backup handling in the
|
|
|
|
[Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/).
|
|
|
|
|
|
|
|
### WAL-MISSING-AFTER
|
|
|
|
|
|
|
|
Example message:
|
|
|
|
|
|
|
|
[troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD
|
|
|
|
|
|
|
|
This means that a WAL file is missing after the specified
|
|
|
|
file. Specifically, in the above scenario, the following files are
|
|
|
|
present, in chronological order:
|
|
|
|
|
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
|
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
|
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
|
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
|
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 13:57 main.WAL.00000001000000D9000000B5
|
|
|
|
|
|
|
|
Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
|
|
|
|
`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
|
|
|
|
happened between 6:26 and 13:57 on that day.
|
|
|
|
|
|
|
|
1. List the files in chronological order:
|
|
|
|
|
|
|
|
ls -ltr /srv/backups/pg/troodi/ | less
|
|
|
|
|
|
|
|
2. Find the file warned about, using `/` then the filename
|
|
|
|
(`main.WAL.00000001000000D9000000AD`), above
|
|
|
|
|
|
|
|
3. Look for a `.BASE.` file *following* the missing file, using `/`
|
|
|
|
again
|
|
|
|
|
|
|
|
4. Either:
|
|
|
|
|
|
|
|
* if a `.BASE.` backup is present after the missing files, it is
|
|
|
|
harmless to the extent that the missing timeframe is not
|
|
|
|
necessary. to remove the warnings, previous WAL files *do* need
|
|
|
|
to be removed, by hand.
|
|
|
|
|
|
|
|
* if a `.BASE.` backup is *not* present after the missing files,
|
|
|
|
the backup integrity is faulty, and a new base backup needs to
|
|
|
|
be performed. See [Running a full
|
|
|
|
backup](#Running_a_full_backup) above.
|
|
|
|
|
|
|
|
Reference
|
|
|
|
=========
|
|
|
|
|
|
|
|
Backup design
|
|
|
|
-------------
|
|
|
|
|
|
|
|
We use upstream's [Continuous Archiving and Point-in-Time Recovery
|
|
|
|
(PITR)](https://www.postgresql.org/docs/11/continuous-archiving.html)
|
|
|
|
which relies on postgres's "write-ahead log" (WAL) to write regular
|
|
|
|
"transaction logs" of the cluster to the backup host. (Think of
|
|
|
|
transaction logs as incremental backups.) This is configured in
|
|
|
|
`postgresql.conf`, using a line like this:
|
|
|
|
|
|
|
|
archive_command = '/usr/local/bin/pg-backup-file main WAL %p'
|
|
|
|
|
|
|
|
That is a site-specific script which reads a config file in
|
|
|
|
`/etc/dsa/pg-backup-file.conf` where the backup host is specified
|
|
|
|
(currently `torbackup@bungei.torproject.org`). That command passes the
|
|
|
|
WAL logs onto the backup server, over SSH. A WAL file is shipped
|
|
|
|
immediately when it is full (16MB of data by default) but no later
|
|
|
|
than 6 hours (varies, see `archive_timeout` on each host) after it was
|
|
|
|
first written to. On the backup server, the `command` is set to
|
|
|
|
`debbackup-ssh-wrap` in the `authorized_keys` file and takes the
|
|
|
|
`store-file pg` argument to write the file to the right location.
|
|
|
|
|
|
|
|
WAL files are written to `/srv/backups/pg/$HOSTNAME` where `$HOSTNAME`
|
|
|
|
(without `.torproject.org`). WAL files are prefixed with `main.WAL.`
|
|
|
|
(where `main` is the cluster name) with a long unique string after,
|
|
|
|
e.g. `main.WAL.00000001000000A40000007F`.
|
|
|
|
|
|
|
|
For that system to work, we also need *full* backups to happen on a
|
|
|
|
regular basis. This happens straight from the backup server (still
|
|
|
|
`bungei`) which connects to the various postgres servers and runs a
|
|
|
|
[pg_basebackup](https://manpages.debian.org/pg_basebackup) to get a complete snapshot of the cluster. This
|
|
|
|
happens *weekly* (every 7 to 10 days) in the wrapper
|
|
|
|
`postgres-make-base-backups`, which is a wrapper (based on a Puppet
|
|
|
|
`concat::fragment` template) that calls
|
|
|
|
`postgres-make-one-base-backup` for each postgres server.
|
|
|
|
|
|
|
|
The base files are written to the same directory as WAL file and are
|
|
|
|
named using the template:
|
|
|
|
|
|
|
|
$CLUSTER.BASE.$SERVER_FQDN-$DATE-$ID-$CLIENT_FQDN-$CLUSTER-$VERSION-backup.tar.gz
|
|
|
|
|
|
|
|
... for example:
|
|
|
|
|
|
|
|
main.BASE.bungei.torproject.org-20190804-214510-troodi.torproject.org-main-9.6-backup.tar.gz
|
|
|
|
|
|
|
|
All of this works because SSH public keys and postgres credentials are
|
|
|
|
passed around between servers. That is handled in the Puppet
|
|
|
|
`postgresql` module for the most part, but some bits might still be
|
|
|
|
configured manually on some servers.
|
|
|
|
|
|
|
|
Backups are checked for freshness in Nagios using the
|
|
|
|
`dsa-check-backuppg` plugin with its configuration stored in
|
|
|
|
`/etc/dsa/postgresql-backup/dsa-check-backuppg.conf.d/`, per
|
|
|
|
cluster. The Nagios plugin also takes care of expiring backups when
|
|
|
|
they are healthy.
|
|
|
|
|
|
|
|
Other documentation
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
See also:
|
|
|
|
|
|
|
|
* [PostgreSQL official documentation](https://www.postgresql.org/docs/manuals/)
|
|
|
|
* [PostgreSQL wiki](https://wiki.postgresql.org/wiki/Main_Page)
|
|
|
|
* [Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/) |