... | ... | @@ -119,10 +119,114 @@ backup of `meronense`, for example: |
|
|
|
|
|
See the [reference](#reference) section for details.
|
|
|
|
|
|
## Picking a restore procedures
|
|
|
## Monitoring the VACUUM processes
|
|
|
|
|
|
In PostgreSQL, the [VACUUM command](https://www.postgresql.org/docs/current/sql-vacuum.html) "reclaims storage occupied by
|
|
|
dead tuples". To quote the excellent PostgreSQL documentation:
|
|
|
|
|
|
> In normal PostgreSQL operation, tuples that are deleted or obsoleted
|
|
|
> by an update are not physically removed from their table; they
|
|
|
> remain present until a VACUUM is done. Therefore it's necessary to
|
|
|
> do VACUUM periodically, especially on frequently-updated tables.
|
|
|
|
|
|
By default, the [autovacuum launcher](https://www.postgresql.org/docs/9.1/routine-vacuuming.html#AUTOVACUUM) is enabled in PostgreSQL (and
|
|
|
in our deployments), which should automatically take care of this
|
|
|
problem.
|
|
|
|
|
|
This will show that the autovacuum daemon is running:
|
|
|
|
|
|
# ps aux | grep [v]acuum
|
|
|
postgres 534 0.5 4.7 454920 388012 ? Ds 05:31 3:08 postgres: 11/main: autovacuum worker bacula
|
|
|
postgres 17259 0.0 0.1 331376 10984 ? Ss Nov12 0:10 postgres: 11/main: autovacuum launcher
|
|
|
|
|
|
In the above, the `launcher` is running, and we can see a worker has
|
|
|
been started to vacuum the `bacula` table.
|
|
|
|
|
|
If you don't see the launcher, check that it's enabled:
|
|
|
|
|
|
bacula=# SELECT name, setting FROM pg_settings WHERE name='autovacuum' or name='track_counts';
|
|
|
autovacuum | on
|
|
|
track_counts | on
|
|
|
|
|
|
Both need to be `on` for the autovacuum workers to operate. It's
|
|
|
possible that some tables might have autovacuum disabled, however,
|
|
|
see:
|
|
|
|
|
|
SELECT reloptions FROM pg_class WHERE relname='my_table';
|
|
|
|
|
|
In the above scenario, the `autovacuum worker bacula` process had been
|
|
|
running for *hours*, which was concerning. One way to diagnose is to
|
|
|
figure out how much data there is to vacuum.
|
|
|
|
|
|
This query will show the tables with dead tuples that need to be
|
|
|
cleaned up by the VACUUM process:
|
|
|
|
|
|
SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
|
|
|
|
|
|
In our case, there were tens of millions of rows to clean:
|
|
|
|
|
|
bacula=# SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
|
|
|
file | 183278595
|
|
|
|
|
|
That is 200 *million* tuples to cleanup!
|
|
|
|
|
|
One more trick: this will show last VACUUM dates on tables:
|
|
|
|
|
|
SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE last_vacuum IS NOT NULL or last_autovacuum IS NOT NULL ORDER BY relname;
|
|
|
|
|
|
Some of the ideas above were found on [this datadog post](https://www.datadoghq.com/blog/postgresql-vacuum-monitoring/).
|
|
|
|
|
|
## Pager playbook
|
|
|
|
|
|
TODO: there's some more information about backup handling in the
|
|
|
[Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/).
|
|
|
|
|
|
### WAL-MISSING-AFTER
|
|
|
|
|
|
Example message:
|
|
|
|
|
|
[troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD
|
|
|
|
|
|
This means that a WAL file is missing after the specified
|
|
|
file. Specifically, in the above scenario, the following files are
|
|
|
present, in chronological order:
|
|
|
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 13:57 main.WAL.00000001000000D9000000B5
|
|
|
|
|
|
Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
|
|
|
`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
|
|
|
happened between 6:26 and 13:57 on that day.
|
|
|
|
|
|
1. List the files in chronological order:
|
|
|
|
|
|
We have two ways of restoring PostgreSQL backups, **indirect** and
|
|
|
**direct**.
|
|
|
ls -ltr /srv/backups/pg/troodi/ | less
|
|
|
|
|
|
2. Find the file warned about, using `/` then the filename
|
|
|
(`main.WAL.00000001000000D9000000AD`), above
|
|
|
|
|
|
3. Look for a `.BASE.` file *following* the missing file, using `/`
|
|
|
again
|
|
|
|
|
|
4. Either:
|
|
|
|
|
|
* if a `.BASE.` backup is present after the missing files, it is
|
|
|
harmless to the extent that the missing timeframe is not
|
|
|
necessary. to remove the warnings, previous WAL files *do* need
|
|
|
to be removed, by hand.
|
|
|
|
|
|
* if a `.BASE.` backup is *not* present after the missing files,
|
|
|
the backup integrity is faulty, and a new base backup needs to
|
|
|
be performed. See [Running a full
|
|
|
backup](#running-a-full-backup) above.
|
|
|
|
|
|
## Disaster recovery
|
|
|
|
|
|
If all fails, we need to restore PostgreSQL from backups. We have two
|
|
|
ways of restoring PostgreSQL backups, **indirect** and **direct**.
|
|
|
|
|
|
The **indirect** procedures first transfers the backup files to the
|
|
|
server and then runs the restore, so it require more space than the
|
... | ... | @@ -141,7 +245,7 @@ to deploy and diagnose, as it has more moving parts. |
|
|
Both procedures are adaptations of the [official recovery
|
|
|
procedure](https://www.postgresql.org/docs/11/continuous-archiving.html#BACKUP-PITR-RECOVERY), which can be refered to for more information.
|
|
|
|
|
|
## Indirect restore procedures
|
|
|
### Indirect restore procedures
|
|
|
|
|
|
Reminder: this procedure **copies** the backup files from the backup
|
|
|
server to the database server and restores from those.
|
... | ... | @@ -239,7 +343,7 @@ When the restore succeeds, the `recovery.conf` file will be renamed to |
|
|
Ignore the error from `cp` complaining about the `.history` file, it's
|
|
|
harmless.
|
|
|
|
|
|
### Troubleshooting
|
|
|
#### Troubleshooting
|
|
|
|
|
|
If you find the following error in the logs:
|
|
|
|
... | ... | @@ -255,7 +359,7 @@ could be many causes for this, but the ones I stumbled upon were: |
|
|
* wrong path or pattern for `restore_command` (double-check the path
|
|
|
and make sure to include the right prefix, e.g. `main.WAL`)
|
|
|
|
|
|
## Direct restore procedure
|
|
|
### Direct restore procedure
|
|
|
|
|
|
Reminder: this procedure **streams** the files from the backup server
|
|
|
to the database server and restore directly. See above for more
|
... | ... | @@ -413,7 +517,7 @@ information on how to pick a restore procedure. |
|
|
|
|
|
puppet agent -t
|
|
|
|
|
|
### Troubleshooting
|
|
|
#### Troubleshooting
|
|
|
|
|
|
If you get this error in the PostgreSQL restore logs:
|
|
|
|
... | ... | @@ -450,111 +554,15 @@ If you get this kind of errors, it's because you forgot to restore the |
|
|
See also the "Direct restore procedure" troubleshooting section, which
|
|
|
also applies here.
|
|
|
|
|
|
## Monitoring the VACUUM processes
|
|
|
|
|
|
In PostgreSQL, the [VACUUM command](https://www.postgresql.org/docs/current/sql-vacuum.html) "reclaims storage occupied by
|
|
|
dead tuples". To quote the excellent PostgreSQL documentation:
|
|
|
|
|
|
> In normal PostgreSQL operation, tuples that are deleted or obsoleted
|
|
|
> by an update are not physically removed from their table; they
|
|
|
> remain present until a VACUUM is done. Therefore it's necessary to
|
|
|
> do VACUUM periodically, especially on frequently-updated tables.
|
|
|
|
|
|
By default, the [autovacuum launcher](https://www.postgresql.org/docs/9.1/routine-vacuuming.html#AUTOVACUUM) is enabled in PostgreSQL (and
|
|
|
in our deployments), which should automatically take care of this
|
|
|
problem.
|
|
|
|
|
|
This will show that the autovacuum daemon is running:
|
|
|
|
|
|
# ps aux | grep [v]acuum
|
|
|
postgres 534 0.5 4.7 454920 388012 ? Ds 05:31 3:08 postgres: 11/main: autovacuum worker bacula
|
|
|
postgres 17259 0.0 0.1 331376 10984 ? Ss Nov12 0:10 postgres: 11/main: autovacuum launcher
|
|
|
|
|
|
In the above, the `launcher` is running, and we can see a worker has
|
|
|
been started to vacuum the `bacula` table.
|
|
|
|
|
|
If you don't see the launcher, check that it's enabled:
|
|
|
|
|
|
bacula=# SELECT name, setting FROM pg_settings WHERE name='autovacuum' or name='track_counts';
|
|
|
autovacuum | on
|
|
|
track_counts | on
|
|
|
|
|
|
Both need to be `on` for the autovacuum workers to operate. It's
|
|
|
possible that some tables might have autovacuum disabled, however,
|
|
|
see:
|
|
|
|
|
|
SELECT reloptions FROM pg_class WHERE relname='my_table';
|
|
|
|
|
|
In the above scenario, the `autovacuum worker bacula` process had been
|
|
|
running for *hours*, which was concerning. One way to diagnose is to
|
|
|
figure out how much data there is to vacuum.
|
|
|
|
|
|
This query will show the tables with dead tuples that need to be
|
|
|
cleaned up by the VACUUM process:
|
|
|
|
|
|
SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
|
|
|
|
|
|
In our case, there were tens of millions of rows to clean:
|
|
|
|
|
|
bacula=# SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
|
|
|
file | 183278595
|
|
|
|
|
|
That is 200 *million* tuples to cleanup!
|
|
|
|
|
|
One more trick: this will show last VACUUM dates on tables:
|
|
|
|
|
|
SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE last_vacuum IS NOT NULL or last_autovacuum IS NOT NULL ORDER BY relname;
|
|
|
|
|
|
Some of the ideas above were found on [this datadog post](https://www.datadoghq.com/blog/postgresql-vacuum-monitoring/).
|
|
|
|
|
|
## Monitoring warnings
|
|
|
|
|
|
TODO: there's some more information about backup handling in the
|
|
|
[Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/).
|
|
|
|
|
|
### WAL-MISSING-AFTER
|
|
|
|
|
|
Example message:
|
|
|
|
|
|
[troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD
|
|
|
|
|
|
This means that a WAL file is missing after the specified
|
|
|
file. Specifically, in the above scenario, the following files are
|
|
|
present, in chronological order:
|
|
|
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
|
|
|
-rw------- 1 torbackup torbackup 16777216 May 10 13:57 main.WAL.00000001000000D9000000B5
|
|
|
|
|
|
Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
|
|
|
`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
|
|
|
happened between 6:26 and 13:57 on that day.
|
|
|
|
|
|
1. List the files in chronological order:
|
|
|
|
|
|
ls -ltr /srv/backups/pg/troodi/ | less
|
|
|
|
|
|
2. Find the file warned about, using `/` then the filename
|
|
|
(`main.WAL.00000001000000D9000000AD`), above
|
|
|
|
|
|
3. Look for a `.BASE.` file *following* the missing file, using `/`
|
|
|
again
|
|
|
# Reference
|
|
|
|
|
|
4. Either:
|
|
|
## Installation
|
|
|
|
|
|
* if a `.BASE.` backup is present after the missing files, it is
|
|
|
harmless to the extent that the missing timeframe is not
|
|
|
necessary. to remove the warnings, previous WAL files *do* need
|
|
|
to be removed, by hand.
|
|
|
<!-- how to setup the service from scratch -->
|
|
|
|
|
|
* if a `.BASE.` backup is *not* present after the missing files,
|
|
|
the backup integrity is faulty, and a new base backup needs to
|
|
|
be performed. See [Running a full
|
|
|
backup](#running-a-full-backup) above.
|
|
|
## SLA
|
|
|
|
|
|
# Reference
|
|
|
<!-- this describes an acceptable level of service for this service -->
|
|
|
|
|
|
## Design
|
|
|
|
... | ... | @@ -611,6 +619,40 @@ Backups are checked for freshness in Nagios using the |
|
|
cluster. The Nagios plugin also takes care of expiring backups when
|
|
|
they are healthy.
|
|
|
|
|
|
## Issues
|
|
|
|
|
|
<!-- such projects are never over. add a pointer to well-known issues -->
|
|
|
<!-- and show how to report problems. usually a link to the bugtracker -->
|
|
|
|
|
|
There is no issue tracker specifically for this project, [File][] or
|
|
|
[search][] for issues in the [team issue tracker][search].
|
|
|
|
|
|
[File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
|
|
|
[search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues
|
|
|
|
|
|
## Maintainer, users, and upstream
|
|
|
|
|
|
<!-- document who deployed and operates this service, who the users -->
|
|
|
<!-- are, who the upstreams are, if they are still active, -->
|
|
|
<!-- collaborative, how do we keep up to date, -->
|
|
|
|
|
|
## Monitoring and testing
|
|
|
|
|
|
<!-- describe how this service is monitored and how it can be tested -->
|
|
|
<!-- after major changes like IP address changes or upgrades. describe -->
|
|
|
<!-- CI, test suites, linting, how security issues and upgrades are -->
|
|
|
<!-- tracked -->
|
|
|
|
|
|
## Logs and metrics
|
|
|
|
|
|
<!-- where are the logs? how long are they kept? any PII? -->
|
|
|
<!-- what about performance metrics? same questions -->
|
|
|
|
|
|
## Backups
|
|
|
|
|
|
<!-- does this service need anything special in terms of backups? -->
|
|
|
<!-- e.g. locking a database? special recovery procedures? -->
|
|
|
|
|
|
## Other documentation
|
|
|
|
|
|
See also:
|
... | ... | @@ -618,3 +660,50 @@ See also: |
|
|
* [PostgreSQL official documentation](https://www.postgresql.org/docs/manuals/)
|
|
|
* [PostgreSQL wiki](https://wiki.postgresql.org/wiki/Main_Page)
|
|
|
* [Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/)
|
|
|
|
|
|
# Discussion
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
<!-- describe the overall project. should include a link to a ticket -->
|
|
|
<!-- that has a launch checklist -->
|
|
|
|
|
|
<!-- if this is an old project being documented, summarize the known -->
|
|
|
<!-- issues with the project. to quote the "audit procedure":
|
|
|
|
|
|
5. When was the last security review done on the project? What was
|
|
|
the outcome? Are there any security issues currently? Should it
|
|
|
have another security review?
|
|
|
|
|
|
6. When was the last risk assessment done? Something that would cover
|
|
|
risks from the data stored, the access required, etc.
|
|
|
|
|
|
7. Are there any in-progress projects? Technical debt cleanup?
|
|
|
Migrations? What state are they in? What's the urgency? What's the
|
|
|
next steps?
|
|
|
|
|
|
8. What urgent things need to be done on this project?
|
|
|
|
|
|
-->
|
|
|
|
|
|
## Goals
|
|
|
|
|
|
<!-- include bugs to be fixed -->
|
|
|
|
|
|
### Must have
|
|
|
|
|
|
### Nice to have
|
|
|
|
|
|
### Non-Goals
|
|
|
|
|
|
## Approvals required
|
|
|
|
|
|
<!-- for example, legal, "vegas", accounting, current maintainer -->
|
|
|
|
|
|
## Proposed Solution
|
|
|
|
|
|
## Cost
|
|
|
|
|
|
## Alternatives considered
|
|
|
|
|
|
<!-- include benchmarks and procedure if relevant --> |