outage in gnt-dal cluster (#41176) · Issues · The Tor Project / TPA / TPA team · GitLab

Snippets Groups Projects

Activity

anarcat added Doing label 1 year ago

added Doing label
anarcat assigned to @anarcat 1 year ago

assigned to @anarcat
anarcat changed the description 1 year ago

changed the description
anarcat marked the checklist item telegram-bot not reachable as completed 1 year ago

marked the checklist item telegram-bot not reachable as completed
Jérôme Charaoui marked the checklist item static-shim down as completed 1 year ago

marked the checklist item static-shim down as completed
Jérôme Charaoui changed title from outage in gnt-fsn cluster to outage in gnt-dal cluster 1 year ago

changed title from outage in gnt-fsn cluster to outage in gnt-dal cluster
Jérôme Charaoui changed the description 1 year ago

changed the description
Jérôme Charaoui marked the checklist item telegram-bot bacula crashing as completed 1 year ago

marked the checklist item telegram-bot bacula crashing as completed
anarcat changed the description 1 year ago

changed the description
Jérôme Charaoui @lavamind · 1 year ago

Owner

bacula crashing on the telegam-bot was because the ipv6 address was missing on the interface
Jérôme Charaoui @lavamind · 1 year ago

Owner

It seems like what happened is yesterday @kez did routine reboots for security upgrades, and some machines on gnt-dal, though they rebooted successfully, lost their network access because the /etc/network/interfaces somehow got corrupted.

Restoring this file and rebooting fixed the problem on telegram-bot-01 and static-gitlab-shim.

Jérôme Charaoui @lavamind · 1 year ago

Owner

On static-gitlab-shim, I kept the corrupted /etc/network/interfaces file around as /etc/network/interfaces.bak. It contains only the characters 77 (no line break).

Please register or sign in to reply
anarcat changed title from outage in gnt-dal cluster to outage in gnt-dal clusters 1 year ago

changed title from outage in gnt-dal cluster to outage in gnt-dal clusters
anarcat changed title from outage in gnt-dal clusters to outage in gnt-dal cluster 1 year ago

changed title from outage in gnt-dal clusters to outage in gnt-dal cluster
anarcat marked the checklist item redis liveness on crm-int-01 from crm-ext-01 as completed 1 year ago

marked the checklist item redis liveness on crm-int-01 from crm-ext-01 as completed
anarcat marked the checklist item henryi systemd degraded (possibly unrelated) as completed 1 year ago

marked the checklist item henryi systemd degraded (possibly unrelated) as completed
anarcat changed the description 1 year ago

changed the description
anarcat marked the checklist item tb-pkg-stage-01 down as completed 1 year ago

marked the checklist item tb-pkg-stage-01 down as completed
anarcat added an incident timeline event 1 year ago

added an incident timeline event
anarcat edited the event time/date on incident timeline event 1 year ago

edited the event time/date on incident timeline event
anarcat added an incident timeline event 1 year ago

added an incident timeline event
anarcat added an incident timeline event 1 year ago

added an incident timeline event
anarcat edited the event time/date on incident timeline event 1 year ago

edited the event time/date on incident timeline event
anarcat added an incident timeline event 1 year ago

added an incident timeline event
anarcat added an incident timeline event 1 year ago

added an incident timeline event
anarcat added an incident timeline event 1 year ago

added an incident timeline event
anarcat edited the event time/date on incident timeline event 1 year ago

edited the event time/date on incident timeline event
anarcat added an incident timeline event 1 year ago

added an incident timeline event
anarcat edited the text on incident timeline event 1 year ago

edited the text on incident timeline event
anarcat edited the text on incident timeline event 1 year ago

edited the text on incident timeline event
anarcat edited the text on incident timeline event 1 year ago

edited the text on incident timeline event
anarcat edited the text on incident timeline event 1 year ago

edited the text on incident timeline event
anarcat edited the text on incident timeline event 1 year ago

edited the text on incident timeline event
anarcat added an incident timeline event 1 year ago

added an incident timeline event
anarcat @anarcat · 1 year ago

Author Owner

we're taking a short break on this to think over about the root cause. we haven't figured out anything yet, but i made a timeline. it seems unrelated to the datacenter intervention yesterday to deploy dal-rescue-01 (#41135 (closed)), but more related to the ganeti node reboots.

@kez did anything out of the ordinary happen in the gnt-dal cluster reboot?
anarcat marked this issue as related to #41135 (closed) 1 year ago

marked this issue as related to #41135 (closed)
anarcat @anarcat · 1 year ago

Author Owner

@lavamind asked whether or not this was the first cluster-wide reboot on the gnt-dal cluster, but looking at prometheus, it seems like no:

https://prometheus.torproject.org/classic/graph?g0.range_input=12w&g0.expr=time%28%29-node_boot_time_seconds%7Bjob%3D%22node%22%2Calias%3D%7E%22dal-node.*%22%7D&g0.tab=0
anarcat changed the description 1 year ago

changed the description
Kez @kez · 1 year ago
when i rebooted dal-node-01, the reboot script timed out and i had to manually gnt-instance start all the instances. from the bash history:

for h in dangerzone-01 forum-test-01 gitlab-dev-01 metrics-psqlts-01 onionbalance-02 static-gitlab-shim tb-tester-01 telegram-bot-01 web-dal-07; do gnt-instance start $h.torproject.org; done

they all seemed happy when i left, but i wonder if the reboot was what corrupted those files

Jérôme Charaoui @lavamind · 1 year ago

Owner

The contents of the faulty /etc/network/interfaces on tb-pkgstage-01 looks like ~~syslog~~ journald messages being written in the wrong place on the filesystem.

00000000  01 00 00 00 00 00 00 00  50 00 00 00 00 00 00 00  |........P.......|
00000010  c8 f2 b3 45 f3 8e fe 89  00 00 00 00 00 00 00 00  |...E............|
00000020  38 2a 5d 00 00 00 00 00  78 31 5d 00 00 00 00 00  |8*].....x1].....|
00000030  48 35 5d 00 00 00 00 00  02 00 00 00 00 00 00 00  |H5].............|
00000040  53 59 53 4c 4f 47 5f 50  49 44 3d 32 35 36 37 30  |SYSLOG_PID=25670|
00000050  01 00 00 00 00 00 00 00  61 00 00 00 00 00 00 00  |........a.......|
00000060  17 35 31 69 c4 d4 ce d2  90 66 58 01 00 00 00 00  |.51i.....fX.....|
00000070  60 20 5d 00 00 00 00 00  78 31 5d 00 00 00 00 00  |` ].....x1].....|
00000080  80 35 5d 00 00 00 00 00  04 00 00 00 00 00 00 00  |.5].............|
00000090  53 59 53 4c 4f 47 5f 54  49 4d 45 53 54 41 4d 50  |SYSLOG_TIMESTAMP|
000000a0  3d 46 65 62 20 31 31 20  30 38 3a 33 30 3a 34 34  |=Feb 11 08:30:44|
000000b0  20 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  | ...............|
000000c0  4a 00 00 00 00 00 00 00  83 5d 55 74 9c 04 10 81  |J........]Ut....|
000000d0  00 00 00 00 00 00 00 00  88 2a 5d 00 00 00 00 00  |.........*].....|
000000e0  78 31 5d 00 00 00 00 00  b8 35 5d 00 00 00 00 00  |x1]......5].....|
000000f0  02 00 00 00 00 00 00 00  5f 50 49 44 3d 32 35 36  |........_PID=256|
00000100  37 30 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |70..............|
00000110  6b 00 00 00 00 00 00 00  54 20 83 30 a4 f0 42 3b  |k.......T .0..B;|
00000120  00 00 00 00 00 00 00 00  50 2d 5d 00 00 00 00 00  |........P-].....|
00000130  78 31 5d 00 00 00 00 00  00 00 00 00 00 00 00 00  |x1].............|
00000140  01 00 00 00 00 00 00 00  5f 53 4f 55 52 43 45 5f  |........_SOURCE_|
00000150  52 45 41 4c 54 49 4d 45  5f 54 49 4d 45 53 54 41  |REALTIME_TIMESTA|
00000160  4d 50 3d 31 36 37 36 31  30 34 32 34 34 36 32 37  |MP=1676104244627|
00000170  38 35 30 00 00 00 00 00  03 00 00 00 00 00 00 00  |850.............|
00000180  b0 01 00 00 00 00 00 00  3d bb 46 00 00 00 00 00  |........=.F.....|
00000190  c8 89 1b 6e 68 f4 05 00  7f 03 eb aa 08 00 00 00  |...nh...........|
000001a0  03 1c 9b 60 92 0b 46 2f  ad f3 cf 0a 70 4a a3 3e  |...`..F/....pJ.>|
000001b0  6f 3a db cc 7c f7 70 c5  70 85 37 00 00 00 00 00  |o:..|.p.p.7.....|
000001c0  ff 9d 48 6c a5 5c 28 b3  00 86 37 00 00 00 00 00  |..Hl.\(...7.....|
000001d0  81 1c 38 bf eb fd 5a 37  80 86 37 00 00 00 00 00  |..8...Z7..7.....|
000001e0  a2 16 59 8b c3 29 96 12  10 87 37 00 00 00 00 00  |..Y..)....7.....|
000001f0  ea f5 0e c7 76 5e e5 f5  d0 88 37 00 00 00 00 00  |....v^....7.....|
00000200  75 ed 49 00 15 d2 f8 93  08 8a 37 00 00 00 00 00  |u.I.......7.....|
00000210  17 a3 99 0a 6e 46 bd 33  80 8a 37 00 00 00 00 00  |....nF.3..7.....|
00000220  3d 0b 00 1a ba d0 87 19  f8 8a 37 00 00 00 00 00  |=.........7.....|
00000230  e1 75 27 05 26 eb e6 8c                           |.u'.&...|
00000238

Edited 1 year ago by Jérôme Charaoui

anarcat @anarcat · 1 year ago

Author Owner

on tb-pkgstage-01 there was also a corrupted /etc/network/interfaces.d/eth0. it was moved to /lost+found/#377 by fsck (where I renamed it again to keep track of its original name). here's the contents:

root@tb-pkgstage-01:~# wc /lost+found/#377-eth0
 0  1 67 /lost+found/#377-eth0
root@tb-pkgstage-01:~# hd /lost+found/#377-eth0 
00000000  23 23 23 23 23 23 23 23  23 23 23 23 23 23 23 23  |################|
*
00000040  23 23 23                                          |###|
00000043

it is a series of 67 pound signs.

also, more data on /etc/network/interfaces:

root@tb-pkgstage-01:~# cat /etc/network/interfaces.corrupt 
PE8*]x1]H5]SYSLOG_PID=25670a51ifX` ]x1]5]SYSLOG_TIMESTAMP=Feb 11 08:30:44 J]Ut*]x1]5]_PID=25670kT 0;P-]x1]_SOURCE_REALTIME_TIMESTAMP=1676104244627850=Fȉh`
                                                                                                                                                          F/
pJ>o:|p7Hl\(78Z77Y7^Ј77
nF37=
     Ї7'&root@tb-pkgstage-01:~# ^C
root@tb-pkgstage-01:~# wc /etc/network/interfaces.corrupt
  2  10 568 /etc/network/interfaces.corrupt
root@tb-pkgstage-01:~# strings /etc/network/interfaces.corrupt
SYSLOG_PID=25670
SYSLOG_TIMESTAMP=Feb 11 08:30:44 
_PID=25670
_SOURCE_REALTIME_TIMESTAMP=1676104244627850
root@tb-pkgstage-01:~#

anarcat @anarcat · 1 year ago

Author Owner

on static-gitlab-shim:

root@static-gitlab-shim:~# file /etc/network/interfaces.d/eth0.bak
/etc/network/interfaces.d/eth0.bak: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), missing section headers
root@static-gitlab-shim:~# wc /etc/network/interfaces.d/eth0.bak
  0   9 379 /etc/network/interfaces.d/eth0.bak
root@static-gitlab-shim:~# strings /etc/network/interfaces.d/eth0.bak
BOt	QQh
RRh.
RVh?
PRVhD
[^_]
root@static-gitlab-shim:~# hd /etc/network/interfaces.d/eth0.bak
00000000  7f 45 4c 46 01 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  01 00 03 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  48 05 00 00 00 00 00 00  34 00 00 00 00 00 28 00  |H.......4.....(.|
00000030  0e 00 0d 00 55 89 e5 57  56 53 83 ec 4c 89 45 b4  |....U..WVS..L.E.|
00000040  89 ce 9c 58 89 c1 35 00  00 20 00 50 9d 9c 58 31  |...X..5.. .P..X1|
00000050  c8 85 c0 74 1f 89 d7 31  c0 0f a2 89 5d c8 89 4d  |...t...1....]..M|
00000060  d0 89 55 cc 85 c0 74 0c  b8 01 00 00 00 0f a2 80  |..U...t.........|
00000070  e2 20 75 0b 53 53 68 00  00 00 00 6a 26 eb 42 4f  |. u.SSh....j&.BO|
00000080  74 09 51 51 68 18 00 00  00 eb 34 c7 05 00 00 00  |t.QQh.....4.....|
00000090  00 00 00 00 00 8b 06 89  45 c4 31 c9 8d 55 c4 e8  |........E.1..U..|
000000a0  fc ff ff ff 89 c1 8b 1d  00 00 00 00 85 db 75 64  |..............ud|
000000b0  8b 45 c4 80 38 00 74 12  52 52 68 2e 00 00 00 6a  |.E..8.t.RRh....j|
000000c0  12 e8 fc ff ff ff 89 c3  eb 47 0f 32 89 c6 8b 45  |.........G.2...E|
000000d0  b4 8b 40 04 83 38 00 74  2b 83 ec 0c 52 56 68 3f  |..@..8.t+...RVh?|
000000e0  00 00 00 6a 11 8d 55 d7  52 e8 fc ff ff ff 83 c4  |...j..U.R.......|
000000f0  20 8b 45 b4 8b 40 04 8b  40 04 8d 55 d7 e8 fc ff  | .E..@..@..U....|
00000100  ff ff eb 10 50 52 56 68  44 00 00 00 e8 fc ff ff  |....PRVhD.......|
00000110  ff 83 c4 10 89 d8 8d 65  f4 5b 5e 5f 5d c3 55 89  |.......e.[^_].U.|
00000120  e5 83 ec 0c 68 00 00 00  00 68 52 00 00 00 68 76  |....h....hR...hv|
00000130  00 00 00 31 c9 ba 00 00  00 00 b8 4c 00 00 00 e8  |...1.......L....|
00000140  fc ff ff ff a3 00 00 00  00 58 c9 c3 a1 00 00 00  |.........X......|
00000150  00 e9 fc ff ff ff 00 00  00 00 00 00 00 00 00 00  |................|
00000160  00 00 00 00 76 00 00 00  00 00 00 00 7b 00 00 00  |....v.......{...|
00000170  a2 00 00 00 01 00 00 00  00 00 00                 |...........|
0000017b
root@static-gitlab-shim:~# file /etc/network/interfaces.
interfaces.bak  interfaces.d/   
root@static-gitlab-shim:~# file /etc/network/interfaces.bak 
/etc/network/interfaces.bak: data
root@static-gitlab-shim:~# wc /etc/network/interfaces.bak
  0   1 567 /etc/network/interfaces.bak
root@static-gitlab-shim:~# strings /etc/network/interfaces.bak
root@static-gitlab-shim:~# hd /etc/network/interfaces.bak
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000190  b0 b4 37 00 00 00 00 00  b0 b4 37 00 00 00 00 00  |..7.......7.....|
000001a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000230  00 00 00 00 00 00 00                              |.......|
00000237

anarcat @anarcat · 1 year ago

Author Owner

this stinks memory error to me. let's run a memtest86 on dal-node-03.

Jérôme Charaoui @lavamind · 1 year ago

Owner

Memory test started. It took a while because I mistakenly thought booting the GRML live image would give me the usual menu prompt to go to Memtest but actually we only get that when booting the actual ISO.

So I installed the memtest86+ package and booted into that, but it kept hanging until I realized I should try with SMP off, and now it's finally memtesting!

Edited 1 year ago by Jérôme Charaoui

Please register or sign in to reply
anarcat marked the checklist item relay-01 NRPE socket timeout (down?) as completed 1 year ago

marked the checklist item relay-01 NRPE socket timeout (down?) as completed
anarcat changed the description 1 year ago

changed the description
anarcat @anarcat · 1 year ago

Author Owner

redis liveness on crm-int-01 from crm-ext-01

also, about that particular issue: it seems like it was unrelated. the problem was that the ipsec tunnel was not correctly setup between the crm-int-01 and crm-ext-01 which (naturally) broke the redis backend.

it's not very clear why it happened, but when i tried to manually bring up the tunnel, it failed with some sort of permission denied. i had a hunch this was due to some kernel module not loading, rebooted both boxes and everything came back normally.

there was an outage at Hetzner's vSwitch setup earlier and i suspect this was the cause, possible combined with crm-int-01's reboot. in any case, it's solved now, and shouldn't be considered part of this incident other than it was noticed together...

this outage lasted:

Date: Tue, 16 May 2023 20:03:17 +0000 Date: Wed, 17 May 2023 15:24:47 +0000

anarcat @anarcat · 1 year ago

Author Owner

there was an outage at Hetzner's vSwitch setup earlier and i suspect this was the cause, possible combined with crm-int-01's reboot. in any case, it's solved now, and shouldn't be considered part of this incident other than it was noticed together...

i opened #41178 (closed) about the hetzner outage and after making the timeline I noticed the hetzner outage happened after the outage started here. the hetzner outage happened between 2023-05-17 12:18UTC and 12:41UTC, while the problem here started a full 16 hours earlier and ended 3 hours later.

so the crm-int outage does not seem related to the hetzner issue at all.

Please register or sign in to reply
anarcat marked this issue as related to #41178 (closed) 1 year ago

marked this issue as related to #41178 (closed)

Please register or sign in to reply

0 Assignees

Severity

Low - S4

Status

None

Labels

None

Milestone

None

Due date

None

Confidentiality

Confidentiality controls have moved to the issue actions menu () at the top of the page.

0 Participants