... | @@ -270,6 +270,372 @@ PostgreSQL 12. Debian 11 ("bullseye") has PostgreSQL 13, but progress |
... | @@ -270,6 +270,372 @@ PostgreSQL 12. Debian 11 ("bullseye") has PostgreSQL 13, but progress |
|
there is reported in the `pg_stat_progress_cluster` table, so the
|
|
there is reported in the `pg_stat_progress_cluster` table, so the
|
|
above might not work even there.
|
|
above might not work even there.
|
|
|
|
|
|
|
|
## Checking for wasted space
|
|
|
|
|
|
|
|
PostgreSQL is particular as a database in the sense that it never
|
|
|
|
actually returns free space to the operating system unless explicitly
|
|
|
|
asked for. Modern PostgreSQL releases (8.1+) have an "auto-vacuum"
|
|
|
|
daemon which takes care of cleaning up `DELETE` and related operations
|
|
|
|
to reclaim that disk space, but this only marks those regions of the
|
|
|
|
database as usable: it doesn't actually returns those blocks to the
|
|
|
|
operating system.
|
|
|
|
|
|
|
|
Because databases typically either stay the same size or grow over
|
|
|
|
their lifetime, this typically does not matter: the next `INSERT` will
|
|
|
|
use that space and no space is actually wasted.
|
|
|
|
|
|
|
|
But sometimes that disk space can grow too large. How do we check if
|
|
|
|
our database is wasting space? There are many ways...
|
|
|
|
|
|
|
|
### check_postgresql
|
|
|
|
|
|
|
|
There's actually a Nagios plugin, which we are not currently using,
|
|
|
|
which does this. It is called [check_postgresql](https://bucardo.org/check_postgres/check_postgres.pl.html) and features a
|
|
|
|
[bloat check](https://bucardo.org/check_postgres/check_postgres.pl.html#bloat) which can run regularly.
|
|
|
|
|
|
|
|
### Running bloat query by hand
|
|
|
|
|
|
|
|
The above script might be annoying to deploy for an ad-hoc
|
|
|
|
situation. You can just run the query by hand instead:
|
|
|
|
|
|
|
|
```
|
|
|
|
SELECT
|
|
|
|
current_database(), schemaname, tablename, /*reltuples::bigint, relpages::bigint, otta,*/
|
|
|
|
ROUND((CASE WHEN otta=0 THEN 0.0 ELSE sml.relpages::float/otta END)::numeric,1) AS tbloat,
|
|
|
|
CASE WHEN relpages < otta THEN 0 ELSE bs*(sml.relpages-otta)::BIGINT END AS wastedbytes,
|
|
|
|
iname, /*ituples::bigint, ipages::bigint, iotta,*/
|
|
|
|
ROUND((CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages::float/iotta END)::numeric,1) AS ibloat,
|
|
|
|
CASE WHEN ipages < iotta THEN 0 ELSE bs*(ipages-iotta) END AS wastedibytes
|
|
|
|
FROM (
|
|
|
|
SELECT
|
|
|
|
schemaname, tablename, cc.reltuples, cc.relpages, bs,
|
|
|
|
CEIL((cc.reltuples*((datahdr+ma-
|
|
|
|
(CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta,
|
|
|
|
COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
|
|
|
|
COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols
|
|
|
|
FROM (
|
|
|
|
SELECT
|
|
|
|
ma,bs,schemaname,tablename,
|
|
|
|
(datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
|
|
|
|
(maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
|
|
|
|
FROM (
|
|
|
|
SELECT
|
|
|
|
schemaname, tablename, hdr, ma, bs,
|
|
|
|
SUM((1-null_frac)*avg_width) AS datawidth,
|
|
|
|
MAX(null_frac) AS maxfracsum,
|
|
|
|
hdr+(
|
|
|
|
SELECT 1+count(*)/8
|
|
|
|
FROM pg_stats s2
|
|
|
|
WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
|
|
|
|
) AS nullhdr
|
|
|
|
FROM pg_stats s, (
|
|
|
|
SELECT
|
|
|
|
(SELECT current_setting('block_size')::numeric) AS bs,
|
|
|
|
CASE WHEN substring(v,12,3) IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
|
|
|
|
CASE WHEN v ~ 'mingw32' THEN 8 ELSE 4 END AS ma
|
|
|
|
FROM (SELECT version() AS v) AS foo
|
|
|
|
) AS constants
|
|
|
|
GROUP BY 1,2,3,4,5
|
|
|
|
) AS foo
|
|
|
|
) AS rs
|
|
|
|
JOIN pg_class cc ON cc.relname = rs.tablename
|
|
|
|
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = rs.schemaname AND nn.nspname <> 'information_schema'
|
|
|
|
LEFT JOIN pg_index i ON indrelid = cc.oid
|
|
|
|
LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid
|
|
|
|
) AS sml
|
|
|
|
ORDER BY wastedbytes DESC
|
|
|
|
```
|
|
|
|
|
|
|
|
### Another way
|
|
|
|
|
|
|
|
It is rumored, however, that this is not very accurate. A [better
|
|
|
|
option](https://blog.ioguix.net/postgresql/2014/03/28/Playing-with-indexes-and-better-bloat-estimate.html) seems to be this ... more complicated query:
|
|
|
|
|
|
|
|
```
|
|
|
|
-- change to the max number of field per index if not default.
|
|
|
|
\set index_max_keys 32
|
|
|
|
-- (readonly) IndexTupleData size
|
|
|
|
\set index_tuple_hdr 2
|
|
|
|
-- (readonly) ItemIdData size
|
|
|
|
\set item_pointer 4
|
|
|
|
-- (readonly) IndexAttributeBitMapData size
|
|
|
|
\set index_attribute_bm (:index_max_keys + 8 - 1) / 8
|
|
|
|
|
|
|
|
SELECT current_database(), nspname, c.relname AS table_name, index_name, bs*(sub.relpages)::bigint AS totalbytes,
|
|
|
|
CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint END AS wastedbytes,
|
|
|
|
CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint * 100 / (bs*(sub.relpages)::bigint) END AS realbloat
|
|
|
|
FROM (
|
|
|
|
SELECT bs, nspname, table_oid, index_name, relpages, coalesce(
|
|
|
|
ceil((reltuples*(:item_pointer+nulldatahdrwidth))/(bs-pagehdr::float)) +
|
|
|
|
CASE WHEN am.amname IN ('hash','btree') THEN 1 ELSE 0 END , 0 -- btree and hash have a metadata reserved block
|
|
|
|
) AS otta
|
|
|
|
FROM (
|
|
|
|
SELECT maxalign, bs, nspname, relname AS index_name, reltuples, relpages, relam, table_oid,
|
|
|
|
( index_tuple_hdr_bm +
|
|
|
|
maxalign - CASE /* Add padding to the index tuple header to align on MAXALIGN */
|
|
|
|
WHEN index_tuple_hdr_bm%maxalign = 0 THEN maxalign
|
|
|
|
ELSE index_tuple_hdr_bm%maxalign
|
|
|
|
END
|
|
|
|
+ nulldatawidth + maxalign - CASE /* Add padding to the data to align on MAXALIGN */
|
|
|
|
WHEN nulldatawidth::integer%maxalign = 0 THEN maxalign
|
|
|
|
ELSE nulldatawidth::integer%maxalign
|
|
|
|
END
|
|
|
|
)::numeric AS nulldatahdrwidth, pagehdr
|
|
|
|
FROM (
|
|
|
|
SELECT
|
|
|
|
i.nspname, i.relname, i.reltuples, i.relpages, i.relam, s.starelid, a.attrelid AS table_oid,
|
|
|
|
current_setting('block_size')::numeric AS bs,
|
|
|
|
/* MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?) */
|
|
|
|
CASE
|
|
|
|
WHEN version() ~ 'mingw32' OR version() ~ '64-bit' THEN 8
|
|
|
|
ELSE 4
|
|
|
|
END AS maxalign,
|
|
|
|
/* per page header, fixed size: 20 for 7.X, 24 for others */
|
|
|
|
CASE WHEN substring(current_setting('server_version') FROM '#"[0-9]+#"%' FOR '#')::integer > 7
|
|
|
|
THEN 24
|
|
|
|
ELSE 20
|
|
|
|
END AS pagehdr,
|
|
|
|
/* per tuple header: add index_attribute_bm if some cols are null-able */
|
|
|
|
CASE WHEN max(coalesce(s.stanullfrac,0)) = 0
|
|
|
|
THEN :index_tuple_hdr
|
|
|
|
ELSE :index_tuple_hdr + :index_attribute_bm
|
|
|
|
END AS index_tuple_hdr_bm,
|
|
|
|
/* data len: we remove null values save space using it fractionnal part from stats */
|
|
|
|
sum( (1-coalesce(s.stanullfrac, 0)) * coalesce(s.stawidth, 2048) ) AS nulldatawidth
|
|
|
|
FROM pg_attribute AS a
|
|
|
|
JOIN pg_statistic AS s ON s.starelid=a.attrelid AND s.staattnum = a.attnum
|
|
|
|
JOIN (
|
|
|
|
SELECT nspname, relname, reltuples, relpages, indrelid, relam, regexp_split_to_table(indkey::text, ' ')::smallint AS attnum
|
|
|
|
FROM pg_index
|
|
|
|
JOIN pg_class ON pg_class.oid=pg_index.indexrelid
|
|
|
|
JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace
|
|
|
|
) AS i ON i.indrelid = a.attrelid AND a.attnum = i.attnum
|
|
|
|
WHERE a.attnum > 0
|
|
|
|
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
|
|
|
|
) AS s1
|
|
|
|
) AS s2
|
|
|
|
LEFT JOIN pg_am am ON s2.relam = am.oid
|
|
|
|
) as sub
|
|
|
|
JOIN pg_class c ON c.oid=sub.table_oid
|
|
|
|
ORDER BY wastedbytes;
|
|
|
|
```
|
|
|
|
|
|
|
|
It was modified to sort the output by `wastedbytes`.
|
|
|
|
|
|
|
|
### Grouped output
|
|
|
|
|
|
|
|
One disadvantage of the above query is that tables and indexes are
|
|
|
|
displayed separately. How do we know which belongs to which? It also
|
|
|
|
makes it less obvious what the big tables are, and which ones are
|
|
|
|
important.
|
|
|
|
|
|
|
|
This one comes from the [pgx_scripts GitHub repo](https://github.com/pgexperts/pgx_scripts), and is a [130+
|
|
|
|
line SQL query](https://github.com/pgexperts/pgx_scripts/blob/master/bloat/table_bloat_check.sql):
|
|
|
|
|
|
|
|
```
|
|
|
|
-- new table bloat query
|
|
|
|
-- still needs work; is often off by +/- 20%
|
|
|
|
WITH constants AS (
|
|
|
|
-- define some constants for sizes of things
|
|
|
|
-- for reference down the query and easy maintenance
|
|
|
|
SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 8 AS ma
|
|
|
|
),
|
|
|
|
no_stats AS (
|
|
|
|
-- screen out table who have attributes
|
|
|
|
-- which dont have stats, such as JSON
|
|
|
|
SELECT table_schema, table_name,
|
|
|
|
n_live_tup::numeric as est_rows,
|
|
|
|
pg_table_size(relid)::numeric as table_size
|
|
|
|
FROM information_schema.columns
|
|
|
|
JOIN pg_stat_user_tables as psut
|
|
|
|
ON table_schema = psut.schemaname
|
|
|
|
AND table_name = psut.relname
|
|
|
|
LEFT OUTER JOIN pg_stats
|
|
|
|
ON table_schema = pg_stats.schemaname
|
|
|
|
AND table_name = pg_stats.tablename
|
|
|
|
AND column_name = attname
|
|
|
|
WHERE attname IS NULL
|
|
|
|
AND table_schema NOT IN ('pg_catalog', 'information_schema')
|
|
|
|
GROUP BY table_schema, table_name, relid, n_live_tup
|
|
|
|
),
|
|
|
|
null_headers AS (
|
|
|
|
-- calculate null header sizes
|
|
|
|
-- omitting tables which dont have complete stats
|
|
|
|
-- and attributes which aren't visible
|
|
|
|
SELECT
|
|
|
|
hdr+1+(sum(case when null_frac <> 0 THEN 1 else 0 END)/8) as nullhdr,
|
|
|
|
SUM((1-null_frac)*avg_width) as datawidth,
|
|
|
|
MAX(null_frac) as maxfracsum,
|
|
|
|
schemaname,
|
|
|
|
tablename,
|
|
|
|
hdr, ma, bs
|
|
|
|
FROM pg_stats CROSS JOIN constants
|
|
|
|
LEFT OUTER JOIN no_stats
|
|
|
|
ON schemaname = no_stats.table_schema
|
|
|
|
AND tablename = no_stats.table_name
|
|
|
|
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
|
|
|
|
AND no_stats.table_name IS NULL
|
|
|
|
AND EXISTS ( SELECT 1
|
|
|
|
FROM information_schema.columns
|
|
|
|
WHERE schemaname = columns.table_schema
|
|
|
|
AND tablename = columns.table_name )
|
|
|
|
GROUP BY schemaname, tablename, hdr, ma, bs
|
|
|
|
),
|
|
|
|
data_headers AS (
|
|
|
|
-- estimate header and row size
|
|
|
|
SELECT
|
|
|
|
ma, bs, hdr, schemaname, tablename,
|
|
|
|
(datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
|
|
|
|
(maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
|
|
|
|
FROM null_headers
|
|
|
|
),
|
|
|
|
table_estimates AS (
|
|
|
|
-- make estimates of how large the table should be
|
|
|
|
-- based on row and page size
|
|
|
|
SELECT schemaname, tablename, bs,
|
|
|
|
reltuples::numeric as est_rows, relpages * bs as table_bytes,
|
|
|
|
CEIL((reltuples*
|
|
|
|
(datahdr + nullhdr2 + 4 + ma -
|
|
|
|
(CASE WHEN datahdr%ma=0
|
|
|
|
THEN ma ELSE datahdr%ma END)
|
|
|
|
)/(bs-20))) * bs AS expected_bytes,
|
|
|
|
reltoastrelid
|
|
|
|
FROM data_headers
|
|
|
|
JOIN pg_class ON tablename = relname
|
|
|
|
JOIN pg_namespace ON relnamespace = pg_namespace.oid
|
|
|
|
AND schemaname = nspname
|
|
|
|
WHERE pg_class.relkind = 'r'
|
|
|
|
),
|
|
|
|
estimates_with_toast AS (
|
|
|
|
-- add in estimated TOAST table sizes
|
|
|
|
-- estimate based on 4 toast tuples per page because we dont have
|
|
|
|
-- anything better. also append the no_data tables
|
|
|
|
SELECT schemaname, tablename,
|
|
|
|
TRUE as can_estimate,
|
|
|
|
est_rows,
|
|
|
|
table_bytes + ( coalesce(toast.relpages, 0) * bs ) as table_bytes,
|
|
|
|
expected_bytes + ( ceil( coalesce(toast.reltuples, 0) / 4 ) * bs ) as expected_bytes
|
|
|
|
FROM table_estimates LEFT OUTER JOIN pg_class as toast
|
|
|
|
ON table_estimates.reltoastrelid = toast.oid
|
|
|
|
AND toast.relkind = 't'
|
|
|
|
),
|
|
|
|
table_estimates_plus AS (
|
|
|
|
-- add some extra metadata to the table data
|
|
|
|
-- and calculations to be reused
|
|
|
|
-- including whether we cant estimate it
|
|
|
|
-- or whether we think it might be compressed
|
|
|
|
SELECT current_database() as databasename,
|
|
|
|
schemaname, tablename, can_estimate,
|
|
|
|
est_rows,
|
|
|
|
CASE WHEN table_bytes > 0
|
|
|
|
THEN table_bytes::NUMERIC
|
|
|
|
ELSE NULL::NUMERIC END
|
|
|
|
AS table_bytes,
|
|
|
|
CASE WHEN expected_bytes > 0
|
|
|
|
THEN expected_bytes::NUMERIC
|
|
|
|
ELSE NULL::NUMERIC END
|
|
|
|
AS expected_bytes,
|
|
|
|
CASE WHEN expected_bytes > 0 AND table_bytes > 0
|
|
|
|
AND expected_bytes <= table_bytes
|
|
|
|
THEN (table_bytes - expected_bytes)::NUMERIC
|
|
|
|
ELSE 0::NUMERIC END AS bloat_bytes
|
|
|
|
FROM estimates_with_toast
|
|
|
|
UNION ALL
|
|
|
|
SELECT current_database() as databasename,
|
|
|
|
table_schema, table_name, FALSE,
|
|
|
|
est_rows, table_size,
|
|
|
|
NULL::NUMERIC, NULL::NUMERIC
|
|
|
|
FROM no_stats
|
|
|
|
),
|
|
|
|
bloat_data AS (
|
|
|
|
-- do final math calculations and formatting
|
|
|
|
select current_database() as databasename,
|
|
|
|
schemaname, tablename, can_estimate,
|
|
|
|
table_bytes, round(table_bytes/(1024^2)::NUMERIC,3) as table_mb,
|
|
|
|
expected_bytes, round(expected_bytes/(1024^2)::NUMERIC,3) as expected_mb,
|
|
|
|
round(bloat_bytes*100/table_bytes) as pct_bloat,
|
|
|
|
round(bloat_bytes/(1024::NUMERIC^2),2) as mb_bloat,
|
|
|
|
table_bytes, expected_bytes, est_rows
|
|
|
|
FROM table_estimates_plus
|
|
|
|
)
|
|
|
|
-- filter output for bloated tables
|
|
|
|
SELECT databasename, schemaname, tablename,
|
|
|
|
can_estimate,
|
|
|
|
est_rows,
|
|
|
|
pct_bloat, mb_bloat,
|
|
|
|
table_mb
|
|
|
|
FROM bloat_data
|
|
|
|
-- this where clause defines which tables actually appear
|
|
|
|
-- in the bloat chart
|
|
|
|
-- example below filters for tables which are either 50%
|
|
|
|
-- bloated and more than 20mb in size, or more than 25%
|
|
|
|
-- bloated and more than 4GB in size
|
|
|
|
WHERE ( pct_bloat >= 50 AND mb_bloat >= 10 )
|
|
|
|
OR ( pct_bloat >= 25 AND mb_bloat >= 1000 )
|
|
|
|
ORDER BY mb_bloat DESC;
|
|
|
|
```
|
|
|
|
|
|
|
|
It will show only tables which have significant bloat, which is
|
|
|
|
defined in the last few lines above. It makes the output much more
|
|
|
|
readable.
|
|
|
|
|
|
|
|
There's also [this other query](https://github.com/ioguix/pgsql-bloat-estimation) we haven't evaluated.
|
|
|
|
|
|
|
|
## Recovering disk space
|
|
|
|
|
|
|
|
In some cases, you do need to reclaim actual operating system disk
|
|
|
|
space from the PostgreSQL server (see above to see if you do). This
|
|
|
|
can happen for example,for example if you have [removed years of old
|
|
|
|
data from a database](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40525)).
|
|
|
|
|
|
|
|
### VACUUM FULL
|
|
|
|
|
|
|
|
Typically this is done with the `VACUUM FULL` command (instead of
|
|
|
|
plain `VACUUM`, which the auto-vacuum does, see [this discussion for
|
|
|
|
details](https://www.postgresql.org/docs/9.1/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY)). This will actually rewrite all the tables to make sure
|
|
|
|
only the relevant data is actually stored on this. It's roughly the
|
|
|
|
equivalent of a dump/restore, except it is faster.
|
|
|
|
|
|
|
|
### pg_repack
|
|
|
|
|
|
|
|
For very large changes (say, a dozens of terabytes) however, `VACUUM
|
|
|
|
FULL` (and even plain `VACUUM`) can be prohibitively slow (think
|
|
|
|
days). And while `VACUUM` doesn't require an exclusive lock on the
|
|
|
|
tables it's working on, `VACUUM FULL` *does* which implies a
|
|
|
|
significant outage.
|
|
|
|
|
|
|
|
An alternative to that method is the [pg_repack extension](https://github.com/reorg/pg_repack), which
|
|
|
|
is [packaged in Debian](https://tracker.debian.org/pkg/pg-repack). In Debian 10 buster, the following
|
|
|
|
procedure was used on `bacula-director-01` to purge old data about
|
|
|
|
removed Bacula clients that hadn't been cleaned up in years:
|
|
|
|
|
|
|
|
apt install postgresql-11-repack
|
|
|
|
|
|
|
|
Then install the extension on the database, as the postgres user (sudo
|
|
|
|
-u postgres -i), this needs to be done only once:
|
|
|
|
|
|
|
|
psql -c "CREATE EXTENSION pg_repack" -d bacula
|
|
|
|
|
|
|
|
Then, for each table:
|
|
|
|
|
|
|
|
pg_repack -d bacula --table media
|
|
|
|
|
|
|
|
It is a good idea to start with a small table we can afford to lose,
|
|
|
|
just in case something goes wrong. That job took about 2 hours on a
|
|
|
|
very large table (150GB, `file`). The entire Bacula database went from
|
|
|
|
using 161GB to 91GB after that cleanup, see [this ticket for
|
|
|
|
details](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40525#note_2763212).
|
|
|
|
|
|
|
|
When done, drop the `pg_repack` extension:
|
|
|
|
|
|
|
|
DROP EXTENSION pg_repack;
|
|
|
|
|
|
|
|
Also note that, after the repack, `VACUUM` performance improved
|
|
|
|
significantly, going from hours (if not days) to minutes.
|
|
|
|
|
|
|
|
Note that [pg_squeeze](https://github.com/cybertec-postgresql/pg_squeeze) is another alternative to `pg_repack`, but
|
|
|
|
it isn't available in Debian.
|
|
|
|
|
|
## Pager playbook
|
|
## Pager playbook
|
|
|
|
|
|
TODO: there's some more information about backup handling in the
|
|
TODO: there's some more information about backup handling in the
|
... | | ... | |