Changes

anarcat · f6c6b238
--- a/howto/postgresql.md
+++ b/howto/postgresql.md
+[[_TOC_]]
+[PostgreSQL](https://www.postgresql.org/) is an advanced database server that is robust and
+fast, although possibly less well-known and popular than it's eternal
+rival in the free software world, MySQL.
+Tutorial
+========
+Those are quick reminders on easy things to do in a cluster.
+Connecting
+----------
+Our PostgreSQL setup is fairly standard so connecting to the database
+is like any other Debian machine:
+    sudo -u postres psql
+This drops you in a [psql shell](https://www.postgresql.org/docs/11/app-psql.html) where you can issue SQL queries
+and so on.
+Howto
+=====
+Deployment
+----------
+A PostgreSQL server should be deployed with puppet, using the
+`postgresql` module. Make sure you also include the
+`postgres::backup_source` class so that it's registered with the
+backup server, see the `dip::database` class for an example on how to
+deploy a server.
+See the [howto/puppet](howto/puppet) documentation for more information on how to deploy
+Puppet things.
+TODO: turn this in a Tutorial by making it more straightforward to
+deploy a cluster, for example just by including a single class.
+Find what is taking up space
+----------------------------
+This will report size information for all tables, in both raw bytes and "pretty" form:
+    SELECT *, pg_size_pretty(total_bytes) AS total
+        , pg_size_pretty(index_bytes) AS INDEX
+        , pg_size_pretty(toast_bytes) AS toast
+        , pg_size_pretty(table_bytes) AS TABLE
+      FROM (
+      SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
+          SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME
+                  , c.reltuples AS row_estimate
+                  , pg_total_relation_size(c.oid) AS total_bytes
+                  , pg_indexes_size(c.oid) AS index_bytes
+                  , pg_total_relation_size(reltoastrelid) AS toast_bytes
+              FROM pg_class c
+              LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
+              WHERE relkind = 'r'
+      ) a
+    ) a ORDER BY total_bytes;
+Same with databases:
+    SELECT d.datname AS Name,  pg_catalog.pg_get_userbyid(d.datdba) AS Owner,
+        CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
+            THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname))
+            ELSE 'No Access'
+        END AS SIZE
+    FROM pg_catalog.pg_database d
+        ORDER BY
+        CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
+            THEN pg_catalog.pg_database_size(d.datname)
+            ELSE NULL
+        END DESC -- nulls first
+        LIMIT 20;
+Source: [PostgreSQL wiki](https://wiki.postgresql.org/wiki/Disk_Usage). See also [the upstream manual](https://www.postgresql.org/docs/11/disk-usage.html).
+Running a full backup
+---------------------
+Backups are normally automatically ran on the backup server (currently
+`bungei`) but you can also run a backup by hand. This will run a
+backup of `meronense`, for example:
+    sudo -u torbackup postgres-make-one-base-backup $(grep ^meronense.torproject.org $(which postgres-make-base-backups ))
+See the [reference](#Reference) section for details.
+Picking a restore procedures
+----------------------------
+We have two ways of restoring PostgreSQL backups, **indirect** and
+**direct**.
+The **indirect** procedures first transfers the backup files to the
+server and then runs the restore, so it require more space than the
+direct procedure. It might also be slower than the direct restore
+procedure because files are copied around twice: once from the backup
+server, and again loaded in the database.
+The **direct** procedure *streams* the backup files directly from the
+backup server. The `BASE` file is streamed into the `tar` command for
+restore of the base snapshot, then the PostgreSQL restore command
+copies each log directly from the backup server as well. It requires
+minimal amount of space, as files are not copied to a temporary
+location on the new server. The downside is it might be more difficult
+to deploy and diagnose, as it has more moving parts.
+Both procedures are adaptations of the [official recovery
+procedure](https://www.postgresql.org/docs/11/continuous-archiving.html#BACKUP-PITR-RECOVERY), which can be refered to for more information.
+Indirect restore procedures
+---------------------------
+Reminder: this procedure **copies** the backup files from the backup
+server to the database server and restores from those.
+ 1. First, you will need to give the backup server access to the new
+    postgresql server, which we'll call the "client" for now. First,
+    login to the client and allow the backup server to connect, and
+    show the public hostkey:
+        iptables -I INPUT -s $BACKUP_SERVER -j ACCEPT
+        cat /etc/ssh/ssh_host_rsa_key.pub
+ 2. Then load the server's private key in an agent and show it to
+    allow on the client. On the server, assuming `$IP` is the IP of
+    the client and `$HOSTKEY` is its hostkey:
+        ssh-agent bash
+        ssh-add /etc/ssh/ssh_host_rsa_key
+        mkdir -p ~/.ssh
+        echo "$IP $HOSTKEY" >> ~/.ssh/known_hosts
+        cat /etc/ssh/ssh_host_rsa_key.pub
+ 3. And on the client, allow the server `$HOSTKEY` (the above `cat
+    /etc/ssh/ssh_host_rsa_key.pub` on the backup server):
+        echo "$HOSTKEY" >> /etc/ssh/userkeys/root.more
+ 4. Then, we can transfer files over from the backup server to the
+    client:
+        cd /srv/backups/pg
+        rsync -aP $CLIENT $CLIENT:/var/lib/postgresql/restore
+ 5. Disable Puppet so you have control on when PostgreSQL is running:
+        puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09'
+ 6. Then, on the client, install the software, stop the server and
+    move the template cluster out of the way:
+        apt install postgres rsync
+        service postgresql stop
+        mv /var/lib/postgresql/*/main{,.orig}
+        su -c 'mkdir -m 0700 /var/lib/postgresql/9.6/main' postgres
+    We'll be restoring files in that `main` directory.
+    Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
+    backup! You cannot restore across versions. This might mean
+    installing from backports or an older version of Debian.
+ 7. Then you need to find the right `BASE` file to restore from. Each
+    `BASE` file has a timestamp in its filename, so just sorting them
+    by name should be enough to find the latest one. Uncompress the
+    `BASE` file in place:
+        cat ~postgres/restore/$CLIENT/main.BASE.bungei.torproject.org-20190805-145239-$CLIENT.torproject.org-main-9.6-backup.tar.gz | su postgres -c 'tar -C /var/lib/postgresql/9.6/main -x -z -f -'
+    (Use `pv` instead of `cat` for a progress bar with large backups.)
+ 8. Make sure the `pg_xlog` directory doesn't contain any files.
+ 9. Then you need to create a `recovery.conf` file in
+    `/var/lib/postgresql/9.6/main` that will tell postgres where to
+    find the WAL files. At least the `restore_command` need to be
+    specified. Something like this should work:
+        restore_command = 'cp /var/lib/postgresql/restore/subnotablie/main.WAL.%f %p'
+    You can specify a specific recovery point in the `recovery.conf`,
+    see the [upstream documentation](https://www.postgresql.org/docs/11/recovery-target-settings.html) for more information.
+ 10. Then start the server and look at the logs to follow the recovery
+     process:
+        service postgresql start
+        tail -f /var/log/postgresql/*
+ 11. re-enable Puppet, which should clean up the extra SSH key and
+     firewall rules:
+        puppet agent -t
+     make sure it's okay in `/etc/ssh/userkeys/root` and `iptables -L`.
+When the restore succeeds, the `recovery.conf` file will be renamed to
+`recovery.done` and you will see something like:
+    2019-08-12 21:36:53.453 UTC [16901] LOG:  selected new timeline ID: 2
+    2019-08-12 21:36:53.470 UTC [16901] LOG:  archive recovery complete
+    cp: cannot stat '/var/lib/postgresql/restore/subnotablie/main.WAL.00000001.history': No such file or directory
+    2019-08-12 21:36:53.577 UTC [16901] LOG:  MultiXact member wraparound protections are now enabled
+    2019-08-12 21:36:53.584 UTC [16900] LOG:  database system is ready to accept connections
+Ignore the error from `cp` complaining about the `.history` file, it's
+harmless.
+### Troubleshooting
+If you find the following error in the logs:
+    FATAL:  could not locate required checkpoint record
+It's because postgres cannot find the WAL logs to restore from. There
+could be many causes for this, but the ones I stumbled upon were:
+ * wrong location for `recovery.conf` (it's in `/var`, not `/etc`),
+   probably the case if you don't see an error from `cp`
+ * wrong permissions on the archive (put the WAL files in `~postgres`,
+   not `~root`)
+ * wrong path or pattern for `restore_command` (double-check the path
+   and make sure to include the right prefix, e.g. `main.WAL`)
+Direct restore procedure
+------------------------
+Reminder: this procedure **streams** the files from the backup server
+to the database server and restore directly. See above for more
+information on how to pick a restore procedure.
+ 1. First, disable Puppet so you have control on when PostgreSQL is
+    running:
+        puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09'
+ 2. Then install the right PostgreSQL version and stop the server:
+        apt install postgres-9.6
+        service postgresql stop
+    Make sure you run the SAME MAJOR VERSION of PostgreSQL than the
+    backup! You cannot restore across versions. This might mean
+    installing from backports or an older version of Debian.
+ 3. On that new PostgreSQL server, show the `postgres` server public
+    key, creating it if missing:
+        [ -f ~postgres/.ssh/id_rsa.pub ] || sudo -u postgres ssh-keygen
+        cat ~postgres/.ssh/*.pub
+ 4. Then on the backup server, allow the user to access backups of the
+    old server:
+        echo "command="/usr/local/bin/debbackup-ssh-wrap --read-allow=/srv/backups/pg/$OLDSERVER $CLIENT",restrict $HOSTKEY" > /etc/ssh/userkeys/torbackup.more
+    This assumes we connect to a *previous* server's backups, named
+    `$OLDSERVER` (e.g. `dictyotum`). The `$HOSTKEY` is the public key
+    found on the postgres server above.
+    Warning: the above will fail if the key is already present in
+    `/etc/ssh/userkeys/torbackup`, edit the key in there instead in
+    that case.
+ 5. Then you need to find the right `BASE` file to restore from. Each
+    `BASE` file has a timestamp in its filename, so just sorting them
+    by name should be enough to find the latest one. Uncompress the
+    `BASE` file in place, as the `postgres` user:
+        sudo -u postgres -i
+        mkdir -m 0700 /var/lib/postgresql/9.6/main
+        ssh torbackup@$BACKUPSERVER $(hostname) retrieve-file pg $OLDSERVER $BASEFILE \
+          | sed '0,/^$/d' \
+          | tar -C /var/lib/postgresql/9.6/main -x -z -f -
+    Add a `pv` before the `tar` call in the pipeline for a progress bar
+    with large backups, and replace:
+     * `$BACKUPSERVER` with the backupserver name and username
+        (currently `bungei.torproject.org`)
+     * `$OLDSERVER` with the old server's (short) hostname
+       (e.g. `dictyotum`)
+     * `$CLUSTERNAME` with the name of the cluster to restore
+       (e.g. usually `main`)
+     * `$BASEFILE` with something like
+       `$CLUSTERNAME.BASE.$BACKUPSERVER-20191004-062226-$OLDSERVER.torproject.org-$CLUSTERNAME-9.6-backup.tar.gz`
+       or, fully expanded:
+       `bacula.BASE.bungei.torproject.org-20191010-184205-dictyotum.torproject.org-bacula-9.6-backup.tar.gz`
+       where the `$CLUSTERNAME` is the non-default `bacula`
+    The above might hang for a while (yes, maybe even a minute) in the
+    beginning, but it should eventually get started and complete. The
+    "hang" is because `retrieve-file` sends a header which includes a
+    `sha512sum` and it takes a while to compute. The `sed` command in
+    that pipeline is there to skip that header. Example of a
+    successful restore operated in [#31786][]:
+        ssh torbackup@bungei.torproject.org bacula-director-01 retrieve-file pg dictyotum bacula.BASE.bungei.torproject.org-20191010-184205-dictyotum.torproject.org-bacula-9.6-backup.tar.gz | pv -s 33G | sed '0,/^$/d' | tar -C /var/lib/postgresql/9.6/main -x -z -f -
+    If the above doesn't work, use the indirect procedure to restore
+    the `BASE` file.
+[#31786]: https://bugs.torproject.org/31786
+ 6. Make sure the `pg_xlog` directory doesn't contain any files.
+        rm -rf -- /var/lib/postgresql/9.6/main/pg_xlog/*
+ 7. Then you need to create a `recovery.conf` file in
+    `/var/lib/postgresql/9.6/main` that will tell postgres where to
+    find the WAL files. At least the `restore_command` need to be
+    specified. Something like this should work:
+        restore_command = '/usr/local/bin/pg-receive-file-from-backup $OLDSERVER $CLUSTERNAME.WAL.%f %p'
+    ... where:
+     * `$OLDSERVER` should be replaced by the previous postgresql
+       server name (e.g. `dictyotum`)
+     * `$CLUSTERNAME` should be replaced by the previous cluster name
+       (e.g. `main`, generally)
+    You can specify a specific recovery point in the `recovery.conf`,
+    see the [upstream documentation](https://www.postgresql.org/docs/11/recovery-target-settings.html) for more information. Also
+    make sure the file is owned by postgres:
+        $EDITOR /var/lib/postgresql/9.6/main/recovery.conf
+        chown postgres /var/lib/postgresql/9.6/main/recovery.conf
+ 8. Then start the server and look at the logs to follow the recovery
+    process:
+        service postgresql start
+        tail -f /var/log/postgresql/*
+    You should see something like this:
+        2019-10-09 21:17:47.335 UTC [9632] LOG:  database system was interrupted; last known up at 2019-10-04 08:12:28 UTC
+        2019-10-09 21:17:47.517 UTC [9632] LOG:  starting archive recovery
+        2019-10-09 21:17:47.524 UTC [9633] [unknown]@[unknown] LOG:  incomplete startup packet
+        2019-10-09 21:17:48.032 UTC [9639] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:48.538 UTC [9642] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:49.046 UTC [9645] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:49.354 UTC [9632] LOG:  restored log file "00000001000005B200000074" from archive
+        2019-10-09 21:17:49.552 UTC [9648] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:50.058 UTC [9651] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:50.565 UTC [9654] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:50.836 UTC [9632] LOG:  redo starts at 5B2/74000028
+        2019-10-09 21:17:51.071 UTC [9659] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:17:51.577 UTC [9665] postgres@postgres FATAL:  the database system is starting up
+        2019-10-09 21:20:35.790 UTC [9632] LOG:  restored log file "00000001000005B20000009F" from archive
+        2019-10-09 21:20:37.745 UTC [9632] LOG:  restored log file "00000001000005B2000000A0" from archive
+        2019-10-09 21:20:39.648 UTC [9632] LOG:  restored log file "00000001000005B2000000A1" from archive
+        2019-10-09 21:20:41.738 UTC [9632] LOG:  restored log file "00000001000005B2000000A2" from archive
+        2019-10-09 21:20:43.773 UTC [9632] LOG:  restored log file "00000001000005B2000000A3" from archive
+    ... and so on.
+    TODO: this warning comes up from time to time, problem?
+        2019-10-09 23:47:13.446 UTC [16973] LOG:  could not link file "pg_xlog/00000001000005B3000000C3" to "pg_xlog/00000001000005B3000000F9": File exists
+    Then the recovery will complete with something like this:
+        2019-10-10 01:30:55.460 UTC [16953] LOG:  redo done at 5B8/9C5BE738
+        2019-10-10 01:30:55.460 UTC [16953] LOG:  last completed transaction was at log time 2019-10-10 01:04:23.238233+00
+        2019-10-10 01:31:03.536 UTC [16953] LOG:  restored log file "00000001000005B80000009C" from archive
+        2019-10-10 01:31:06.458 UTC [16953] LOG:  selected new timeline ID: 2
+        2019-10-10 01:31:17.485 UTC [16953] LOG:  archive recovery complete
+        2019-10-10 01:32:11.975 UTC [16953] LOG:  MultiXact member wraparound protections are now enabled
+        2019-10-10 01:32:12.438 UTC [16950] LOG:  database system is ready to accept connections
+        2019-10-10 01:32:12.439 UTC [26501] LOG:  autovacuum launcher started
+ 9. Then remove the temporary SSH access on the backup server, either
+    by removing the `.more` key file or restoring the previous key
+    configuration:
+        rm /etc/ssh/userkeys/torbackup.more
+ 10. re-enable Puppet:
+        puppet agent -t
+### Troubleshooting
+If you get this error in the PostgreSQL restore logs:
+    No directories from which read is allowed given on cmdline.
+    Unkown format in line 1 () when getting dictyotum:main.WAL.00000005.history from backup host
+It's because the SSH keys deployed on the server does not have the
+`--read-allowed` argument.
+If the restore doesn't work, try to connect by hand:
+    sudo -u postgres /usr/local/bin/pg-receive-file-from-backup dictyotum bacula.WAL.00000001000005AB0000003A /tmp/foo
+If you get this kind of errors, it's because you forgot to restore the
+`BASE` file first:
+    2019-10-08 21:25:43.711 UTC [14320] [unknown]@[unknown] LOG:  incomplete startup packet
+    2019-10-08 21:25:44.215 UTC [14326] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:44.451 UTC [14316] WARNING:  WAL was generated with wal_level=minimal, data may be missing
+    2019-10-08 21:25:44.451 UTC [14316] HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
+    2019-10-08 21:25:44.451 UTC [14316] LOG:  consistent recovery state reached at 0/153A3F0
+    2019-10-08 21:25:44.451 UTC [14316] LOG:  invalid record length at 0/153A3F0: wanted 24, got 0
+    2019-10-08 21:25:44.451 UTC [14316] LOG:  redo is not required
+    2019-10-08 21:25:44.721 UTC [14334] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:45.226 UTC [14340] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:45.549 UTC [14316] LOG:  selected new timeline ID: 6
+    2019-10-08 21:25:45.732 UTC [14343] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:45.765 UTC [14316] LOG:  archive recovery complete
+    2019-10-08 21:25:46.238 UTC [14349] postgres@postgres FATAL:  the database system is starting up
+    2019-10-08 21:25:46.466 UTC [14316] LOG:  MultiXact member wraparound protections are now enabled
+    2019-10-08 21:25:46.467 UTC [14315] LOG:  database system is ready to accept connections
+    2019-10-08 21:25:46.467 UTC [14351] LOG:  autovacuum launcher started
+See also the "Direct restore procedure" troubleshooting section, which
+also applies here.
+Monitoring warnings
+-------------------
+TODO: there's some more information about backup handling in the
+[Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/).
+### WAL-MISSING-AFTER
+Example message:
+    [troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD
+This means that a WAL file is missing after the specified
+file. Specifically, in the above scenario, the following files are
+present, in chronological order:
+    -rw------- 1 torbackup torbackup   16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
+    -rw------- 1 torbackup torbackup   16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
+    -rw------- 1 torbackup torbackup   16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
+    -rw------- 1 torbackup torbackup   16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
+    -rw------- 1 torbackup torbackup   16777216 May 10 13:57 main.WAL.00000001000000D9000000B5
+Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
+`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
+happened between 6:26 and 13:57 on that day.
+ 1. List the files in chronological order:
+        ls -ltr /srv/backups/pg/troodi/ | less
+ 2. Find the file warned about, using `/` then the filename
+    (`main.WAL.00000001000000D9000000AD`), above
+ 3. Look for a `.BASE.` file *following* the missing file, using `/`
+    again
+ 4. Either:
+    * if a `.BASE.` backup is present after the missing files, it is
+      harmless to the extent that the missing timeframe is not
+      necessary. to remove the warnings, previous WAL files *do* need
+      to be removed, by hand.
+    * if a `.BASE.` backup is *not* present after the missing files,
+      the backup integrity is faulty, and a new base backup needs to
+      be performed. See [Running a full
+      backup](#Running_a_full_backup) above.
+Reference
+=========
+Backup design
+-------------
+We use upstream's [Continuous Archiving and Point-in-Time Recovery
+(PITR)](https://www.postgresql.org/docs/11/continuous-archiving.html)
+which relies on postgres's "write-ahead log" (WAL) to write regular
+"transaction logs" of the cluster to the backup host. (Think of
+transaction logs as incremental backups.) This is configured in
+`postgresql.conf`, using a line like this:
+    archive_command = '/usr/local/bin/pg-backup-file main WAL %p'
+That is a site-specific script which reads a config file in
+`/etc/dsa/pg-backup-file.conf` where the backup host is specified
+(currently `torbackup@bungei.torproject.org`). That command passes the
+WAL logs onto the backup server, over SSH. A WAL file is shipped
+immediately when it is full (16MB of data by default) but no later
+than 6 hours (varies, see `archive_timeout` on each host) after it was
+first written to. On the backup server, the `command` is set to
+`debbackup-ssh-wrap` in the `authorized_keys` file and takes the
+`store-file pg` argument to write the file to the right location.
+WAL files are written to `/srv/backups/pg/$HOSTNAME` where `$HOSTNAME`
+(without `.torproject.org`). WAL files are prefixed with `main.WAL.`
+(where `main` is the cluster name) with a long unique string after,
+e.g. `main.WAL.00000001000000A40000007F`.
+For that system to work, we also need *full* backups to happen on a
+regular basis. This happens straight from the backup server (still
+`bungei`) which connects to the various postgres servers and runs a
+[pg_basebackup](https://manpages.debian.org/pg_basebackup) to get a complete snapshot of the cluster. This
+happens *weekly* (every 7 to 10 days) in the wrapper
+`postgres-make-base-backups`, which is a wrapper (based on a Puppet
+`concat::fragment` template) that calls
+`postgres-make-one-base-backup` for each postgres server.
+The base files are written to the same directory as WAL file and are
+named using the template:
+    $CLUSTER.BASE.$SERVER_FQDN-$DATE-$ID-$CLIENT_FQDN-$CLUSTER-$VERSION-backup.tar.gz
+... for example:
+    main.BASE.bungei.torproject.org-20190804-214510-troodi.torproject.org-main-9.6-backup.tar.gz
+All of this works because SSH public keys and postgres credentials are
+passed around between servers. That is handled in the Puppet
+`postgresql` module for the most part, but some bits might still be
+configured manually on some servers.
+Backups are checked for freshness in Nagios using the
+`dsa-check-backuppg` plugin with its configuration stored in
+`/etc/dsa/postgresql-backup/dsa-check-backuppg.conf.d/`, per
+cluster. The Nagios plugin also takes care of expiring backups when
+they are healthy.
+Other documentation
+-------------------
+See also:
+ * [PostgreSQL official documentation](https://www.postgresql.org/docs/manuals/)
+ * [PostgreSQL wiki](https://wiki.postgresql.org/wiki/Main_Page)
+ * [Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/)