colchicifolium backups are barely functional
colchicifolium's backup frequently come under the attention of Nagios:
2022-02-07 09:25:38 <nsa> tor-nagios: [colchicifolium] backup - bacula - last backup is WARNING: WARN: Last backup of colchicifolium.torproject.org/any is 1.27 days old.
2022-02-09 06:25:48 <nsa> tor-nagios: [colchicifolium] backup - bacula - last backup is CRITICAL: CRITICAL: Last backup of colchicifolium.torproject.org/any is 3.14 days old.
2022-02-11 08:25:44 <nsa> tor-nagios: [colchicifolium] backup - bacula - last backup is OK: OK: Last backup of colchicifolium.torproject.org/any is 0.04 days old.
2022-02-17 00:40:40 <nsa> tor-nagios: [colchicifolium] backup - bacula - last backup is WARNING: WARN: Last backup of colchicifolium.torproject.org/any is 1.42 days old.
2022-02-18 16:40:38 <nsa> tor-nagios: [colchicifolium] backup - bacula - last backup is CRITICAL: CRITICAL: Last backup of colchicifolium.torproject.org/any is 3.08 days old.
2022-02-19 02:40:45 <nsa> tor-nagios: [colchicifolium] backup - bacula - last backup is OK: OK: Last backup of colchicifolium.torproject.org/any is 0.03 days old.
2022-02-26 15:55:43 <nsa> tor-nagios: [colchicifolium] backup - bacula - last backup is WARNING: WARN: Last backup of colchicifolium.torproject.org/any is 1.35 days old.
2022-02-28 07:55:50 <nsa> tor-nagios: [colchicifolium] backup - bacula - last backup is CRITICAL: CRITICAL: Last backup of colchicifolium.torproject.org/any is 3.02 days old.
2022-03-01 08:25:47 <nsa> tor-nagios: [colchicifolium] backup - bacula - last backup is CRITICAL: CRITICAL: Last backup of colchicifolium.torproject.org/any is 3.85 days old.
The current backup has been running since 2022-02-27 18:37:53
, about
4 days ago at the time of writing.
This is the timing of the known jobs on that server in the current Bacula director:
bacula=# SELECT level, jobstatus, starttime, endtime, (CASE WHEN endtime IS NULL THEN NOW() ELSE endtime END)-starttime AS duration, jobfiles, jobbytes FROM job WHERE name='colchicifolium.torproject.org' ORDER by starttime;
level | jobstatus | starttime | endtime | duration | jobfiles | jobbytes
-------+-----------+---------------------+---------------------+------------------------+----------+--------------
I | f | 2015-12-03 09:57:02 | 2015-12-03 09:57:02 | 00:00:00 | 0 | 0
D | f | 2017-12-09 00:35:08 | 2017-12-09 00:35:08 | 00:00:00 | 0 | 0
D | f | 2019-03-05 18:15:28 | 2019-03-05 18:15:28 | 00:00:00 | 0 | 0
F | f | 2019-07-22 14:06:13 | 2019-07-22 14:06:13 | 00:00:00 | 0 | 0
I | f | 2019-09-07 20:02:52 | 2019-09-07 20:02:52 | 00:00:00 | 0 | 0
I | f | 2020-12-11 02:06:57 | 2020-12-11 02:06:57 | 00:00:00 | 0 | 0
F | T | 2021-10-30 04:18:48 | 2021-10-31 05:32:59 | 1 day 01:14:11 | 2973523 | 409597402632
F | T | 2021-12-10 06:06:18 | 2021-12-12 01:41:37 | 1 day 19:35:19 | 3404504 | 456273938172
D | E | 2022-01-12 15:03:53 | 2022-01-14 21:57:32 | 2 days 06:53:39 | 5029124 | 123658942337
D | T | 2022-01-15 01:57:38 | 2022-01-17 17:24:20 | 2 days 15:26:42 | 5457677 | 130269432219
F | T | 2022-01-19 22:33:54 | 2022-01-22 14:41:49 | 2 days 16:07:55 | 4336473 | 516207537019
I | T | 2022-01-26 14:12:52 | 2022-01-26 16:25:40 | 02:12:48 | 185016 | 7712392837
I | T | 2022-01-27 14:06:35 | 2022-01-27 16:47:50 | 02:41:15 | 188625 | 8433225061
D | T | 2022-01-28 06:21:56 | 2022-01-28 18:13:24 | 11:51:28 | 1364571 | 28815354895
I | T | 2022-01-29 06:41:31 | 2022-01-29 10:12:46 | 03:31:15 | 178896 | 33790932680
I | T | 2022-01-30 04:46:21 | 2022-01-30 07:10:41 | 02:24:20 | 177074 | 7298789209
I | T | 2022-01-31 04:19:19 | 2022-01-31 13:18:59 | 08:59:40 | 203085 | 37604120762
I | T | 2022-02-01 04:11:16 | 2022-02-01 07:11:08 | 02:59:52 | 195922 | 41592974842
I | T | 2022-02-02 04:30:15 | 2022-02-02 06:39:15 | 02:09:00 | 190243 | 8548513453
I | T | 2022-02-03 02:55:37 | 2022-02-03 06:25:57 | 03:30:20 | 186250 | 6138223644
I | T | 2022-02-04 01:06:54 | 2022-02-04 04:19:46 | 03:12:52 | 187868 | 8892468359
I | T | 2022-02-05 01:46:11 | 2022-02-05 04:09:50 | 02:23:39 | 194623 | 8754299644
I | T | 2022-02-06 01:45:57 | 2022-02-06 08:02:29 | 06:16:32 | 208416 | 9582975941
D | T | 2022-02-06 21:07:00 | 2022-02-11 12:31:37 | 4 days 15:24:37 | 3428690 | 57424284749
I | T | 2022-02-11 12:38:30 | 2022-02-11 18:52:52 | 06:14:22 | 590289 | 18987945922
I | T | 2022-02-12 14:03:10 | 2022-02-12 16:36:49 | 02:33:39 | 190798 | 6760825592
I | T | 2022-02-13 13:45:42 | 2022-02-13 15:34:05 | 01:48:23 | 189130 | 7132469485
I | T | 2022-02-14 15:19:05 | 2022-02-14 18:58:24 | 03:39:19 | 199895 | 6797607219
I | T | 2022-02-15 15:25:05 | 2022-02-15 19:40:27 | 04:15:22 | 199052 | 8115940960
D | T | 2022-02-15 20:24:17 | 2022-02-19 06:54:49 | 3 days 10:30:32 | 4967994 | 77854030910
I | T | 2022-02-19 07:02:32 | 2022-02-19 18:23:59 | 11:21:27 | 496812 | 24270098875
I | T | 2022-02-20 07:45:46 | 2022-02-20 10:45:13 | 02:59:27 | 174086 | 7179666980
I | T | 2022-02-21 06:57:49 | 2022-02-21 11:51:18 | 04:53:29 | 182035 | 15512560970
I | T | 2022-02-22 05:10:39 | 2022-02-22 07:57:01 | 02:46:22 | 172397 | 7210544658
I | T | 2022-02-23 06:36:44 | 2022-02-23 13:17:10 | 06:40:26 | 211809 | 29150059606
I | T | 2022-02-24 05:39:43 | 2022-02-24 09:57:25 | 04:17:42 | 179419 | 7469834934
I | T | 2022-02-25 05:30:58 | 2022-02-25 12:32:09 | 07:01:11 | 202945 | 30792174057
D | f | 2022-02-25 12:33:48 | 2022-02-25 12:33:48 | 00:00:00 | 0 | 0
D | R | 2022-02-27 18:37:53 | | 4 days 02:16:38.136257 | 0 | 0
(39 rows)
In there, the current backup has been running for 4 days, and is nowhere near done, as it has "only" backed up 1.8M files so far:
JobId Type Level Files Bytes Name Status
======================================================================
187892 Back Diff 1,799,185 56.09 G colchicifolium.torproject.org is running
In comparison, the last differential backed up 4.9M files, so this could go on for a long time.
So we need to find a solution to fix collector's backup. A few observations:
-
almost every file seems to change about every other week, a full backup being ~5M files and the last differential being also ~5M files
-
maybe it's just Bacula that is slow at crawling: backing up 5M files and 500GB shouldn't be that slow... at 2.5days, it's "only" 18mbps, far from the line rate of the network (gigabit) or disk (200+MB/s AKA 1600Mbps on the backup server disk and the ganeti node)
-
in general, incrementals are somewhat fast enough (a handful of hours), but full backups and differentials are slow enough to block backups for a number of days and trigger those warnings
-
during that time, needless to say the backup itself is inconsistent: files get added, removed, or changed constantly during that time and Bacula is not taking this into account
-
the current backup is particularly suspicious in that it has "only" backed up 1.8M files in four days while transfering a comparatively large 56GB... typically, when we transfer ~50GB, we end up with 3.6M files, or, inversely, 1.5M files is closer to 29GB of content
-
we also recently had to add disk space on that server (#40535 (closed))
/cc @hiro