Changes

anarcat · 54be7741
--- a/howto/postgresql.md
+++ b/howto/postgresql.md
@@ -270,6 +270,372 @@ PostgreSQL 12. Debian 11 ("bullseye") has PostgreSQL 13, but progress
 there is reported in the `pg_stat_progress_cluster` table, so the
 above might not work even there.
+## Checking for wasted space
+PostgreSQL is particular as a database in the sense that it never
+actually returns free space to the operating system unless explicitly
+asked for. Modern PostgreSQL releases (8.1+) have an "auto-vacuum"
+daemon which takes care of cleaning up `DELETE` and related operations
+to reclaim that disk space, but this only marks those regions of the
+database as usable: it doesn't actually returns those blocks to the
+operating system.
+Because databases typically either stay the same size or grow over
+their lifetime, this typically does not matter: the next `INSERT` will
+use that space and no space is actually wasted.
+But sometimes that disk space can grow too large. How do we check if
+our database is wasting space? There are many ways...
+### check_postgresql
+There's actually a Nagios plugin, which we are not currently using,
+which does this. It is called [check_postgresql](https://bucardo.org/check_postgres/check_postgres.pl.html) and features a
+[bloat check](https://bucardo.org/check_postgres/check_postgres.pl.html#bloat) which can run regularly.
+### Running bloat query by hand
+The above script might be annoying to deploy for an ad-hoc
+situation. You can just run the query by hand instead:
+```
+SELECT
+  current_database(), schemaname, tablename, /*reltuples::bigint, relpages::bigint, otta,*/
+  ROUND((CASE WHEN otta=0 THEN 0.0 ELSE sml.relpages::float/otta END)::numeric,1) AS tbloat,
+  CASE WHEN relpages < otta THEN 0 ELSE bs*(sml.relpages-otta)::BIGINT END AS wastedbytes,
+  iname, /*ituples::bigint, ipages::bigint, iotta,*/
+  ROUND((CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages::float/iotta END)::numeric,1) AS ibloat,
+  CASE WHEN ipages < iotta THEN 0 ELSE bs*(ipages-iotta) END AS wastedibytes
+FROM (
+  SELECT
+    schemaname, tablename, cc.reltuples, cc.relpages, bs,
+    CEIL((cc.reltuples*((datahdr+ma-
+      (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta,
+    COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
+    COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols
+  FROM (
+    SELECT
+      ma,bs,schemaname,tablename,
+      (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
+      (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
+    FROM (
+      SELECT
+        schemaname, tablename, hdr, ma, bs,
+        SUM((1-null_frac)*avg_width) AS datawidth,
+        MAX(null_frac) AS maxfracsum,
+        hdr+(
+          SELECT 1+count(*)/8
+          FROM pg_stats s2
+          WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
+        ) AS nullhdr
+      FROM pg_stats s, (
+        SELECT
+          (SELECT current_setting('block_size')::numeric) AS bs,
+          CASE WHEN substring(v,12,3) IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
+          CASE WHEN v ~ 'mingw32' THEN 8 ELSE 4 END AS ma
+        FROM (SELECT version() AS v) AS foo
+      ) AS constants
+      GROUP BY 1,2,3,4,5
+    ) AS foo
+  ) AS rs
+  JOIN pg_class cc ON cc.relname = rs.tablename
+  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = rs.schemaname AND nn.nspname <> 'information_schema'
+  LEFT JOIN pg_index i ON indrelid = cc.oid
+  LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid
+) AS sml
+ORDER BY wastedbytes DESC
+```
+### Another way
+It is rumored, however, that this is not very accurate. A [better
+option](https://blog.ioguix.net/postgresql/2014/03/28/Playing-with-indexes-and-better-bloat-estimate.html) seems to be this ... more complicated query:
+```
+-- change to the max number of field per index if not default.
+\set index_max_keys 32
+-- (readonly) IndexTupleData size
+\set index_tuple_hdr 2
+-- (readonly) ItemIdData size
+\set item_pointer 4
+-- (readonly) IndexAttributeBitMapData size
+\set index_attribute_bm (:index_max_keys + 8 - 1) / 8
+SELECT current_database(), nspname, c.relname AS table_name, index_name, bs*(sub.relpages)::bigint AS totalbytes,
+  CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint END                                    AS wastedbytes,
+  CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint * 100 / (bs*(sub.relpages)::bigint) END AS realbloat
+FROM (
+  SELECT bs, nspname, table_oid, index_name, relpages, coalesce(
+    ceil((reltuples*(:item_pointer+nulldatahdrwidth))/(bs-pagehdr::float)) +
+      CASE WHEN am.amname IN ('hash','btree') THEN 1 ELSE 0 END , 0 -- btree and hash have a metadata reserved block
+    ) AS otta
+  FROM (
+    SELECT maxalign, bs, nspname, relname AS index_name, reltuples, relpages, relam, table_oid,
+      ( index_tuple_hdr_bm +
+          maxalign - CASE /* Add padding to the index tuple header to align on MAXALIGN */
+            WHEN index_tuple_hdr_bm%maxalign = 0 THEN maxalign
+            ELSE index_tuple_hdr_bm%maxalign
+          END
+        + nulldatawidth + maxalign - CASE /* Add padding to the data to align on MAXALIGN */
+            WHEN nulldatawidth::integer%maxalign = 0 THEN maxalign
+            ELSE nulldatawidth::integer%maxalign
+          END
+      )::numeric AS nulldatahdrwidth, pagehdr
+    FROM (
+      SELECT
+        i.nspname, i.relname, i.reltuples, i.relpages, i.relam, s.starelid, a.attrelid AS table_oid,
+        current_setting('block_size')::numeric AS bs,
+        /* MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?) */
+        CASE
+          WHEN version() ~ 'mingw32' OR version() ~ '64-bit' THEN 8
+          ELSE 4
+        END AS maxalign,
+        /* per page header, fixed size: 20 for 7.X, 24 for others */
+        CASE WHEN substring(current_setting('server_version') FROM '#"[0-9]+#"%' FOR '#')::integer > 7
+          THEN 24
+          ELSE 20
+        END AS pagehdr,
+        /* per tuple header: add index_attribute_bm if some cols are null-able */
+        CASE WHEN max(coalesce(s.stanullfrac,0)) = 0
+          THEN :index_tuple_hdr
+          ELSE :index_tuple_hdr + :index_attribute_bm
+        END AS index_tuple_hdr_bm,
+        /* data len: we remove null values save space using it fractionnal part from stats */
+        sum( (1-coalesce(s.stanullfrac, 0)) * coalesce(s.stawidth, 2048) ) AS nulldatawidth
+      FROM pg_attribute AS a
+        JOIN pg_statistic AS s ON s.starelid=a.attrelid AND s.staattnum = a.attnum
+        JOIN (
+          SELECT nspname, relname, reltuples, relpages, indrelid, relam, regexp_split_to_table(indkey::text, ' ')::smallint AS attnum
+          FROM pg_index
+            JOIN pg_class ON pg_class.oid=pg_index.indexrelid
+            JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace
+        ) AS i ON i.indrelid = a.attrelid AND a.attnum = i.attnum
+      WHERE a.attnum > 0
+      GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
+    ) AS s1
+  ) AS s2
+    LEFT JOIN pg_am am ON s2.relam = am.oid
+) as sub
+JOIN pg_class c ON c.oid=sub.table_oid
+ORDER BY wastedbytes;
+```
+It was modified to sort the output by `wastedbytes`.
+### Grouped output
+One disadvantage of the above query is that tables and indexes are
+displayed separately. How do we know which belongs to which? It also
+makes it less obvious what the big tables are, and which ones are
+important.
+This one comes from the [pgx_scripts GitHub repo](https://github.com/pgexperts/pgx_scripts), and is a [130+
+line SQL query](https://github.com/pgexperts/pgx_scripts/blob/master/bloat/table_bloat_check.sql):
+```
+-- new table bloat query
+-- still needs work; is often off by +/- 20%
+WITH constants AS (
+    -- define some constants for sizes of things
+    -- for reference down the query and easy maintenance
+    SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 8 AS ma
+),
+no_stats AS (
+    -- screen out table who have attributes
+    -- which dont have stats, such as JSON
+    SELECT table_schema, table_name, 
+        n_live_tup::numeric as est_rows,
+        pg_table_size(relid)::numeric as table_size
+    FROM information_schema.columns
+        JOIN pg_stat_user_tables as psut
+           ON table_schema = psut.schemaname
+           AND table_name = psut.relname
+        LEFT OUTER JOIN pg_stats
+        ON table_schema = pg_stats.schemaname
+            AND table_name = pg_stats.tablename
+            AND column_name = attname 
+    WHERE attname IS NULL
+        AND table_schema NOT IN ('pg_catalog', 'information_schema')
+    GROUP BY table_schema, table_name, relid, n_live_tup
+),
+null_headers AS (
+    -- calculate null header sizes
+    -- omitting tables which dont have complete stats
+    -- and attributes which aren't visible
+    SELECT
+        hdr+1+(sum(case when null_frac <> 0 THEN 1 else 0 END)/8) as nullhdr,
+        SUM((1-null_frac)*avg_width) as datawidth,
+        MAX(null_frac) as maxfracsum,
+        schemaname,
+        tablename,
+        hdr, ma, bs
+    FROM pg_stats CROSS JOIN constants
+        LEFT OUTER JOIN no_stats
+            ON schemaname = no_stats.table_schema
+            AND tablename = no_stats.table_name
+    WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
+        AND no_stats.table_name IS NULL
+        AND EXISTS ( SELECT 1
+            FROM information_schema.columns
+                WHERE schemaname = columns.table_schema
+                    AND tablename = columns.table_name )
+    GROUP BY schemaname, tablename, hdr, ma, bs
+),
+data_headers AS (
+    -- estimate header and row size
+    SELECT
+        ma, bs, hdr, schemaname, tablename,
+        (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
+        (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
+    FROM null_headers
+),
+table_estimates AS (
+    -- make estimates of how large the table should be
+    -- based on row and page size
+    SELECT schemaname, tablename, bs,
+        reltuples::numeric as est_rows, relpages * bs as table_bytes,
+    CEIL((reltuples*
+            (datahdr + nullhdr2 + 4 + ma -
+                (CASE WHEN datahdr%ma=0
+                    THEN ma ELSE datahdr%ma END)
+                )/(bs-20))) * bs AS expected_bytes,
+        reltoastrelid
+    FROM data_headers
+        JOIN pg_class ON tablename = relname
+        JOIN pg_namespace ON relnamespace = pg_namespace.oid
+            AND schemaname = nspname
+    WHERE pg_class.relkind = 'r'
+),
+estimates_with_toast AS (
+    -- add in estimated TOAST table sizes
+    -- estimate based on 4 toast tuples per page because we dont have 
+    -- anything better.  also append the no_data tables
+    SELECT schemaname, tablename, 
+        TRUE as can_estimate,
+        est_rows,
+        table_bytes + ( coalesce(toast.relpages, 0) * bs ) as table_bytes,
+        expected_bytes + ( ceil( coalesce(toast.reltuples, 0) / 4 ) * bs ) as expected_bytes
+    FROM table_estimates LEFT OUTER JOIN pg_class as toast
+        ON table_estimates.reltoastrelid = toast.oid
+            AND toast.relkind = 't'
+),
+table_estimates_plus AS (
+-- add some extra metadata to the table data
+-- and calculations to be reused
+-- including whether we cant estimate it
+-- or whether we think it might be compressed
+    SELECT current_database() as databasename,
+            schemaname, tablename, can_estimate, 
+            est_rows,
+            CASE WHEN table_bytes > 0
+                THEN table_bytes::NUMERIC
+                ELSE NULL::NUMERIC END
+                AS table_bytes,
+            CASE WHEN expected_bytes > 0 
+                THEN expected_bytes::NUMERIC
+                ELSE NULL::NUMERIC END
+                    AS expected_bytes,
+            CASE WHEN expected_bytes > 0 AND table_bytes > 0
+                AND expected_bytes <= table_bytes
+                THEN (table_bytes - expected_bytes)::NUMERIC
+                ELSE 0::NUMERIC END AS bloat_bytes
+    FROM estimates_with_toast
+    UNION ALL
+    SELECT current_database() as databasename, 
+        table_schema, table_name, FALSE, 
+        est_rows, table_size,
+        NULL::NUMERIC, NULL::NUMERIC
+    FROM no_stats
+),
+bloat_data AS (
+    -- do final math calculations and formatting
+    select current_database() as databasename,
+        schemaname, tablename, can_estimate, 
+        table_bytes, round(table_bytes/(1024^2)::NUMERIC,3) as table_mb,
+        expected_bytes, round(expected_bytes/(1024^2)::NUMERIC,3) as expected_mb,
+        round(bloat_bytes*100/table_bytes) as pct_bloat,
+        round(bloat_bytes/(1024::NUMERIC^2),2) as mb_bloat,
+        table_bytes, expected_bytes, est_rows
+    FROM table_estimates_plus
+)
+-- filter output for bloated tables
+SELECT databasename, schemaname, tablename,
+    can_estimate,
+    est_rows,
+    pct_bloat, mb_bloat,
+    table_mb
+FROM bloat_data
+-- this where clause defines which tables actually appear
+-- in the bloat chart
+-- example below filters for tables which are either 50%
+-- bloated and more than 20mb in size, or more than 25%
+-- bloated and more than 4GB in size
+WHERE ( pct_bloat >= 50 AND mb_bloat >= 10 )
+    OR ( pct_bloat >= 25 AND mb_bloat >= 1000 )
+ORDER BY mb_bloat DESC;
+```
+It will show only tables which have significant bloat, which is
+defined in the last few lines above. It makes the output much more
+readable.
+There's also [this other query](https://github.com/ioguix/pgsql-bloat-estimation) we haven't evaluated.
+## Recovering disk space
+In some cases, you do need to reclaim actual operating system disk
+space from the PostgreSQL server (see above to see if you do). This
+can happen for example,for example if you have [removed years of old
+data from a database](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40525)).
+### VACUUM FULL
+Typically this is done with the `VACUUM FULL` command (instead of
+plain `VACUUM`, which the auto-vacuum does, see [this discussion for
+details](https://www.postgresql.org/docs/9.1/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY)). This will actually rewrite all the tables to make sure
+only the relevant data is actually stored on this. It's roughly the
+equivalent of a dump/restore, except it is faster.
+### pg_repack
+For very large changes (say, a dozens of terabytes) however, `VACUUM
+FULL` (and even plain `VACUUM`) can be prohibitively slow (think
+days). And while `VACUUM` doesn't require an exclusive lock on the
+tables it's working on, `VACUUM FULL` *does* which implies a
+significant outage.
+An alternative to that method is the [pg_repack extension](https://github.com/reorg/pg_repack), which
+is [packaged in Debian](https://tracker.debian.org/pkg/pg-repack). In Debian 10 buster, the following
+procedure was used on `bacula-director-01` to purge old data about
+removed Bacula clients that hadn't been cleaned up in years:
+    apt install postgresql-11-repack
+Then install the extension on the database, as the postgres user (sudo
+-u postgres -i), this needs to be done only once:
+    psql -c "CREATE EXTENSION pg_repack" -d bacula
+Then, for each table:
+    pg_repack  -d bacula --table media
+It is a good idea to start with a small table we can afford to lose,
+just in case something goes wrong. That job took about 2 hours on a
+very large table (150GB, `file`). The entire Bacula database went from
+using 161GB to 91GB after that cleanup, see [this ticket for
+details](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40525#note_2763212).
+When done, drop the `pg_repack` extension:
+    DROP EXTENSION pg_repack;
+Also note that, after the repack, `VACUUM` performance improved
+significantly, going from hours (if not days) to minutes.
+Note that [pg_squeeze](https://github.com/cybertec-postgresql/pg_squeeze) is another alternative to `pg_repack`, but
+it isn't available in Debian.
 ## Pager playbook
 TODO: there's some more information about backup handling in the