disk on weather-01 is almost full

assigned to @lelutin

The partition is currently mostly used by postgresql's disk storage, and particularly the WAL is occupying 5.4Gb so the vast majority of pg storage.

According to graphs, disk usage started growing linearly on september 23rd

there's probably something that's been preventhing postgres from committing it's WAL and then getting rid of old entries.

this is interesting:

postgres=# SELECT slot_name,
       pg_wal_lsn_diff(
          pg_current_wal_lsn(),
          restart_lsn
       ) AS bytes_behind,
       active,
       wal_status
FROM pg_replication_slots
WHERE wal_status <> 'lost'
ORDER BY restart_lsn;
 slot_name | bytes_behind | active | wal_status 
-----------+--------------+--------+------------
 barman    |   5811047248 | f      | extended
(1 row)

something related to barman is maybe holding up the wal behind?

cleared out the replication slot related to barman:

postgres=# select pg_drop_replication_slot('barman');

but the disk usage still hasn't shot down. the checkpoint interval seems to be set to 15mins (default value) so maybe I need to wait until postgres performs a checkpoint

nice, postgres finally cleaned up its WAL and disk usage is now back down to 37%. so the problem is solved.

FYI the query in the comment above for finding the replication slot was found on: https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/

I don't see a mention of the problem and/or solution in our documentation -- most probably because we don't use replication anywhere, that's a technique that barman is adding to our setup -- so I'll add a tiny something before closing this incident.

mentioned in commit wiki-replica@1ae70da4

closed

changed the incident status to Resolved by closing the incident

mentioned in issue #40950

disk on weather-01 is almost full

Child items ...

Activity