anarcat · 10ef1420
--- a/howto/postgresql.md
+++ b/howto/postgresql.md
@@ -119,10 +119,114 @@ backup of `meronense`, for example:

 See the [reference](#reference) section for details.

-## Picking a restore procedures
+## Monitoring the VACUUM processes
+
+In PostgreSQL, the [VACUUM command](https://www.postgresql.org/docs/current/sql-vacuum.html) "reclaims storage occupied by
+dead tuples". To quote the excellent PostgreSQL documentation:
+
+> In normal PostgreSQL operation, tuples that are deleted or obsoleted
+> by an update are not physically removed from their table; they
+> remain present until a VACUUM is done. Therefore it's necessary to
+> do VACUUM periodically, especially on frequently-updated tables.
+
+By default, the [autovacuum launcher](https://www.postgresql.org/docs/9.1/routine-vacuuming.html#AUTOVACUUM) is enabled in PostgreSQL (and
+in our deployments), which should automatically take care of this
+problem.
+
+This will show that the autovacuum daemon is running:
+
+    # ps aux | grep [v]acuum
+    postgres   534  0.5  4.7 454920 388012 ?       Ds   05:31   3:08 postgres: 11/main: autovacuum worker   bacula
+    postgres 17259  0.0  0.1 331376 10984 ?        Ss   Nov12   0:10 postgres: 11/main: autovacuum launcher   
+
+In the above, the `launcher` is running, and we can see a worker has
+been started to vacuum the `bacula` table.
+
+If you don't see the launcher, check that it's enabled:
+
+    bacula=# SELECT name, setting FROM pg_settings WHERE name='autovacuum' or name='track_counts';
+     autovacuum   | on
+     track_counts | on
+
+Both need to be `on` for the autovacuum workers to operate. It's
+possible that some tables might have autovacuum disabled, however,
+see:
+
+    SELECT reloptions FROM pg_class WHERE relname='my_table';
+
+In the above scenario, the `autovacuum worker bacula` process had been
+running for *hours*, which was concerning. One way to diagnose is to
+figure out how much data there is to vacuum.
+
+This query will show the tables with dead tuples that need to be
+cleaned up by the VACUUM process:
+
+    SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
+
+In our case, there were tens of millions of rows to clean:
+
+    bacula=# SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
+     file    |  183278595
+
+That is 200 *million* tuples to cleanup!
+
+One more trick: this will show last VACUUM dates on tables:
+
+    SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE last_vacuum IS NOT NULL or last_autovacuum IS NOT NULL ORDER BY relname;
+
+Some of the ideas above were found on [this datadog post](https://www.datadoghq.com/blog/postgresql-vacuum-monitoring/).
+
+## Pager playbook
+
+TODO: there's some more information about backup handling in the
+[Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/).
+
+### WAL-MISSING-AFTER
+
+Example message:
+
+    [troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD
+
+This means that a WAL file is missing after the specified
+file. Specifically, in the above scenario, the following files are
+present, in chronological order:
+
+    -rw------- 1 torbackup torbackup   16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
+    -rw------- 1 torbackup torbackup   16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
+    -rw------- 1 torbackup torbackup   16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
+    -rw------- 1 torbackup torbackup   16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
+    -rw------- 1 torbackup torbackup   16777216 May 10 13:57 main.WAL.00000001000000D9000000B5
+
+Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
+`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
+happened between 6:26 and 13:57 on that day.
+
+ 1. List the files in chronological order:

-We have two ways of restoring PostgreSQL backups, **indirect** and
-**direct**.
+        ls -ltr /srv/backups/pg/troodi/ | less
+
+ 2. Find the file warned about, using `/` then the filename
+    (`main.WAL.00000001000000D9000000AD`), above
+
+ 3. Look for a `.BASE.` file *following* the missing file, using `/`
+    again
+
+ 4. Either:
+
+    * if a `.BASE.` backup is present after the missing files, it is
+      harmless to the extent that the missing timeframe is not
+      necessary. to remove the warnings, previous WAL files *do* need
+      to be removed, by hand.
+
+    * if a `.BASE.` backup is *not* present after the missing files,
+      the backup integrity is faulty, and a new base backup needs to
+      be performed. See [Running a full
+      backup](#running-a-full-backup) above.
+
+## Disaster recovery
+
+If all fails, we need to restore PostgreSQL from backups. We have two
+ways of restoring PostgreSQL backups, **indirect** and **direct**.

 The **indirect** procedures first transfers the backup files to the
 server and then runs the restore, so it require more space than the
@@ -141,7 +245,7 @@ to deploy and diagnose, as it has more moving parts.
 Both procedures are adaptations of the [official recovery
 procedure](https://www.postgresql.org/docs/11/continuous-archiving.html#BACKUP-PITR-RECOVERY), which can be refered to for more information.

-## Indirect restore procedures
+### Indirect restore procedures

 Reminder: this procedure **copies** the backup files from the backup
 server to the database server and restores from those.
@@ -239,7 +343,7 @@ When the restore succeeds, the `recovery.conf` file will be renamed to
 Ignore the error from `cp` complaining about the `.history` file, it's
 harmless.

-### Troubleshooting
+#### Troubleshooting

 If you find the following error in the logs:

@@ -255,7 +359,7 @@ could be many causes for this, but the ones I stumbled upon were:
 * wrong path or pattern for `restore_command` (double-check the path
   and make sure to include the right prefix, e.g. `main.WAL`)

-## Direct restore procedure
+### Direct restore procedure

 Reminder: this procedure **streams** the files from the backup server
 to the database server and restore directly. See above for more
@@ -413,7 +517,7 @@ information on how to pick a restore procedure.

        puppet agent -t

-### Troubleshooting
+#### Troubleshooting

 If you get this error in the PostgreSQL restore logs:

@@ -450,111 +554,15 @@ If you get this kind of errors, it's because you forgot to restore the
 See also the "Direct restore procedure" troubleshooting section, which
 also applies here.

-## Monitoring the VACUUM processes
-
-In PostgreSQL, the [VACUUM command](https://www.postgresql.org/docs/current/sql-vacuum.html) "reclaims storage occupied by
-dead tuples". To quote the excellent PostgreSQL documentation:
-
-> In normal PostgreSQL operation, tuples that are deleted or obsoleted
-> by an update are not physically removed from their table; they
-> remain present until a VACUUM is done. Therefore it's necessary to
-> do VACUUM periodically, especially on frequently-updated tables.
-
-By default, the [autovacuum launcher](https://www.postgresql.org/docs/9.1/routine-vacuuming.html#AUTOVACUUM) is enabled in PostgreSQL (and
-in our deployments), which should automatically take care of this
-problem.
-
-This will show that the autovacuum daemon is running:
-
-    # ps aux | grep [v]acuum
-    postgres   534  0.5  4.7 454920 388012 ?       Ds   05:31   3:08 postgres: 11/main: autovacuum worker   bacula
-    postgres 17259  0.0  0.1 331376 10984 ?        Ss   Nov12   0:10 postgres: 11/main: autovacuum launcher   
-
-In the above, the `launcher` is running, and we can see a worker has
-been started to vacuum the `bacula` table.
-
-If you don't see the launcher, check that it's enabled:
-
-    bacula=# SELECT name, setting FROM pg_settings WHERE name='autovacuum' or name='track_counts';
-     autovacuum   | on
-     track_counts | on
-
-Both need to be `on` for the autovacuum workers to operate. It's
-possible that some tables might have autovacuum disabled, however,
-see:
-
-    SELECT reloptions FROM pg_class WHERE relname='my_table';
-
-In the above scenario, the `autovacuum worker bacula` process had been
-running for *hours*, which was concerning. One way to diagnose is to
-figure out how much data there is to vacuum.
-
-This query will show the tables with dead tuples that need to be
-cleaned up by the VACUUM process:
-
-    SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
-
-In our case, there were tens of millions of rows to clean:
-
-    bacula=# SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
-     file    |  183278595
-
-That is 200 *million* tuples to cleanup!
-
-One more trick: this will show last VACUUM dates on tables:
-
-    SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE last_vacuum IS NOT NULL or last_autovacuum IS NOT NULL ORDER BY relname;
-
-Some of the ideas above were found on [this datadog post](https://www.datadoghq.com/blog/postgresql-vacuum-monitoring/).
-
-## Monitoring warnings
-
-TODO: there's some more information about backup handling in the
-[Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/).
-
-### WAL-MISSING-AFTER
-
-Example message:
-
-    [troodi, main] WAL-MISSING-AFTER: troodi/main.WAL.00000001000000D9000000AD
-
-This means that a WAL file is missing after the specified
-file. Specifically, in the above scenario, the following files are
-present, in chronological order:
-
-    -rw------- 1 torbackup torbackup   16777216 May 10 05:08 main.WAL.00000001000000D9000000AA
-    -rw------- 1 torbackup torbackup   16777216 May 10 05:47 main.WAL.00000001000000D9000000AB
-    -rw------- 1 torbackup torbackup   16777216 May 10 06:20 main.WAL.00000001000000D9000000AC
-    -rw------- 1 torbackup torbackup   16777216 May 10 06:26 main.WAL.00000001000000D9000000AD
-    -rw------- 1 torbackup torbackup   16777216 May 10 13:57 main.WAL.00000001000000D9000000B5
-
-Notice the jump from `...AD` to `...B5`. We're missing `AE`, `AF`,
-`B1`, `B2`, `B3`, `B4`, specifically. We can also tell that something
-happened between 6:26 and 13:57 on that day.
-
- 1. List the files in chronological order:
-
-        ls -ltr /srv/backups/pg/troodi/ | less
-
- 2. Find the file warned about, using `/` then the filename
-    (`main.WAL.00000001000000D9000000AD`), above
-
- 3. Look for a `.BASE.` file *following* the missing file, using `/`
-    again
+# Reference

- 4. Either:
+## Installation

-    * if a `.BASE.` backup is present after the missing files, it is
-      harmless to the extent that the missing timeframe is not
-      necessary. to remove the warnings, previous WAL files *do* need
-      to be removed, by hand.
+<!-- how to setup the service from scratch -->

-    * if a `.BASE.` backup is *not* present after the missing files,
-      the backup integrity is faulty, and a new base backup needs to
-      be performed. See [Running a full
-      backup](#running-a-full-backup) above.
+## SLA

-# Reference
+<!-- this describes an acceptable level of service for this service -->

 ## Design

@@ -611,6 +619,40 @@ Backups are checked for freshness in Nagios using the
 cluster. The Nagios plugin also takes care of expiring backups when
 they are healthy.

+## Issues
+
+<!-- such projects are never over. add a pointer to well-known issues -->
+<!-- and show how to report problems. usually a link to the bugtracker -->
+
+There is no issue tracker specifically for this project, [File][] or
+[search][] for issues in the [team issue tracker][search].
+
+ [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
+ [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues
+
+## Maintainer, users, and upstream
+
+<!-- document who deployed and operates this service, who the users -->
+<!-- are, who the upstreams are, if they are still active, -->
+<!-- collaborative, how do we keep up to date, -->
+
+## Monitoring and testing
+
+<!-- describe how this service is monitored and how it can be tested -->
+<!-- after major changes like IP address changes or upgrades. describe -->
+<!-- CI, test suites, linting, how security issues and upgrades are -->
+<!-- tracked -->
+
+## Logs and metrics
+
+<!-- where are the logs? how long are they kept? any PII? -->
+<!-- what about performance metrics? same questions -->
+
+## Backups
+
+<!-- does this service need anything special in terms of backups? -->
+<!-- e.g. locking a database? special recovery procedures? -->
+
 ## Other documentation

 See also:
@@ -618,3 +660,50 @@ See also:
 * [PostgreSQL official documentation](https://www.postgresql.org/docs/manuals/)
 * [PostgreSQL wiki](https://wiki.postgresql.org/wiki/Main_Page)
 * [Debian DSA documentation](https://dsa.debian.org/howto/postgres-backup/)
+
+# Discussion
+
+## Overview
+
+<!-- describe the overall project. should include a link to a ticket -->
+<!-- that has a launch checklist -->
+
+<!-- if this is an old project being documented, summarize the known -->
+<!-- issues with the project. to quote the "audit procedure":
+
+ 5. When was the last security review done on the project? What was
+    the outcome? Are there any security issues currently? Should it
+    have another security review?
+
+ 6. When was the last risk assessment done? Something that would cover
+    risks from the data stored, the access required, etc.
+
+ 7. Are there any in-progress projects? Technical debt cleanup?
+    Migrations? What state are they in? What's the urgency? What's the
+    next steps?
+
+ 8. What urgent things need to be done on this project?
+
+-->
+
+## Goals
+
+<!-- include bugs to be fixed -->
+
+### Must have
+
+### Nice to have
+
+### Non-Goals
+
+## Approvals required
+
+<!-- for example, legal, "vegas", accounting, current maintainer -->
+
+## Proposed Solution
+
+## Cost
+
+## Alternatives considered
+
+<!-- include benchmarks and procedure if relevant -->