colchicifolium full backups take too long
colchicifolium (AKA colchi) has been having trouble doing full backups, and we need to figure out a way to fix this.
Steps to reproduce
- do nothing, relax, have the good life, good old days you know?
- realize there an LLM apocalypse under way
- periodically frantically reboot the fleet in a panic because of the 0day flood
What is the current bug behavior?
colchi hasn't succeeded in performing a full backup since 2026-03-31 19:46:59. the current full has been running for almost 4 days at this point.
coincidentally, reboots and a postgresql upgrade interrupted another upgrade, which clogged the entire backup queue (an issue that was not reported, but dealt with separately).
What is the expected correct behavior?
the previous full backup run under 2 days.
When did this start?
the first failure, strangely, is on 2026-04-02 04:43:37, only a couple days after the successful full backup.
but this is not the first time we've had issues like this. we've also had such a problem in 2022, see #40650 (closed) which we could investigate for possible fixes.
Relevant logs and/or screenshots
here's the history of full backups from the database:
bacula=# SELECT level, jobstatus, starttime, endtime, (CASE WHEN endtime IS NULL THEN NOW() ELSE endtime END)-starttime AS duration, jobfiles, pg_size_pretty(jobbytes) FROM job WHERE name='colchicifolium.torproject.org' and level='F' ORDER by starttime;
level | jobstatus | starttime | endtime | duration | jobfiles | pg_size_pretty
-------+-----------+---------------------+---------------------+------------------------+----------+----------------
F | f | 2019-07-22 14:06:13 | 2019-07-22 14:06:13 | 00:00:00 | 0 | 0 bytes
F | f | 2025-11-18 14:49:56 | 2025-11-18 14:49:56 | 00:00:00 | 0 | 0 bytes
F | T | 2026-02-15 04:30:46 | 2026-02-16 15:12:08 | 1 day 10:41:22 | 4954881 | 1728 GB
F | T | 2026-03-31 19:46:59 | 2026-04-02 04:42:19 | 1 day 08:55:20 | 4871474 | 1335 GB
F | f | 2026-04-02 04:43:37 | 2026-04-03 02:47:30 | 22:03:53 | 506243 | 546 GB
F | f | 2026-04-03 06:47:35 | 2026-04-03 06:47:35 | 00:00:00 | 0 | 0 bytes
F | f | 2026-05-13 17:39:21 | 2026-05-13 17:39:21 | 00:00:00 | 0 | 0 bytes
F | f | 2026-05-16 03:18:11 | 2026-05-16 03:18:11 | 00:00:00 | 0 | 0 bytes
F | A | 2026-05-22 01:51:46 | | 7 days 11:46:05.122086 | 0 | 0 bytes
F | R | 2026-05-25 18:06:22 | | 3 days 19:31:29.122086 | 0 | 0 bytes
(10 rows)the backup has been running for 3 days 19:31:29, almost 4 days! according to the console, it has backed up about two thirds of the ~5M files:
JobId Type Level Files Bytes Name Status
======================================================================
339448 Back Full 2,865,792 984.3 G colchicifolium.torproject.org is runningBy a simple rule of third, we can estimate the full backup will take another 66 hours, or about 2 days and 18h.
qalc> x / (3 days + 19h + 31min) = (4954881−2865792)/2865792 to h
(x/((3 days) + (19 hours) + (31 minutes))) = ((4 954 881 − 2 865 792)/2 865 792) to hour
≈ x = 66,713 307 054 4 hPossible fixes
- switch to borg (see #42677)
- tweak the server (see #40650 (closed), check grafana node exporter graphs)
- see with @hiro if it's possible to reduce the number of files on that server