dsa-check-libs triggers the OOM-killer on shadow/chi-node-14
nagios is unhappy about services being down on chi-node-14:
10:03:44 <nsa> tor-nagios: [chi-node-14] process - ntpd is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, command name ntpd, args /usr/sbin/ntpd -p /var/run/ntpd.pid
10:04:44 <nsa> tor-nagios: [chi-node-14] process - postfix - master is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name master, args /usr/lib/postfix/sbin/master
according to systemd, at least ntpd
was terminated with a SIGKILL, which, according to dmesg is the OOM-killer's fault:
[775967.370568] Out of memory: Killed process 2027654 (ntpd) total-vm:78480kB, anon-rss:224kB, file-rss:0kB, shmem-rss:0kB, UID:108 pgtables:68kB oom_score_adj:0
at first i thought it was shadow eating up all memory, but it's actually behaving pretty well. what's eating all memory is the dsa-check-libs
process, or more specifically... lsof!
top - 14:06:31 up 8 days, 23:47, 1 user, load average: 161.16, 136.99, 141.52
Tasks: 2123 total, 11 running, 2112 sleeping, 0 stopped, 0 zombie
%Cpu(s): 22.4 us, 39.9 sy, 0.0 ni, 36.9 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 1546797.+total, 134029.6 free, 1410797.+used, 1970.0 buff/cache
MiB Swap: 30720.0 total, 3407.8 free, 27312.2 used. 129521.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3338378 root 20 0 222.4g 2.5g 752744 S 3796 0.2 258:56.77 shadow
3258869 root 20 0 58.0g 57.8g 2200 R 100.0 3.8 11:42.23 dsa-check-libs
3273070 root 20 0 46.6g 46.3g 1228 R 100.0 3.1 12:50.24 dsa-check-libs
3287055 root 20 0 34.1g 33.4g 1268 R 100.0 2.2 14:09.52 dsa-check-libs
3302215 root 20 0 22.7g 22.6g 1236 R 100.0 1.5 15:37.86 dsa-check-libs
3316260 root 20 0 11.6g 11.6g 992 R 100.0 0.8 17:11.18 dsa-check-libs
3204203 root 20 0 174.0g 168.3g 284 R 99.3 11.1 504:46.41 lsof
3227095 root 20 0 157.3g 153.0g 312 R 99.3 10.1 447:42.66 lsof
3219968 root 20 0 169.8g 169.4g 288 R 98.4 11.2 478:26.18 lsof
3223533 root 20 0 163.7g 163.2g 300 R 97.4 10.8 463:12.19 lsof
3230838 root 20 0 151.0g 141.6g 332 D 85.3 9.4 433:40.74 lsof
3244850 root 20 0 71.8g 70.8g 2412 R 59.5 4.7 7:21.62 dsa-check-libs
504 root 20 0 0 0 0 S 28.4 0.0 13:32.20 kcompactd0
3219967 root 20 0 71.8g 71.1g 320 S 20.3 4.7 7:25.55 dsa-check-libs
3227094 root 20 0 71.2g 69.1g 288 S 19.9 4.6 7:31.91 dsa-check-libs
3223530 root 20 0 68.4g 68.3g 320 S 19.3 4.5 7:02.61 dsa-check-libs
3204202 root 20 0 56.2g 55.3g 332 S 18.3 3.7 6:17.70 dsa-check-libs
3230837 root 20 0 69.2g 67.5g 280 S 16.3 4.5 7:12.65 dsa-check-libs
4816 root 20 0 5922180 17728 0 S 2.0 0.0 36:54.13 containerd
3314923 root 20 0 0 0 0 I 2.0 0.0 0:21.53 kworker/u162:14-kcryptd/253:0
3306182 root 20 0 0 0 0 I 1.6 0.0 0:21.63 kworker/u162:7-kcryptd/253:0
3332128 root 20 0 0 0 0 I 1.6 0.0 0:02.36 kworker/u162:37-kcryptd/253:0
3353566 root 20 0 12416 5948 3040 R 1.6 0.0 0:01.10 top
3322573 root 20 0 0 0 0 I 1.3 0.0 0:19.27 kworker/u162:19-kcryptd/253:0
3330624 root 20 0 0 0 0 I 1.3 0.0 0:11.93 kworker/u162:24-kcryptd/253:0
3353609 root 20 0 0 0 0 I 1.3 0.0 0:00.30 kworker/u162:2-kcryptd/253:0
3328556 root 20 0 0 0 0 I 1.0 0.0 0:15.72 kworker/u162:28-kcryptd/253:0
3348164 root 20 0 328704 29212 28576 S 1.0 0.0 0:03.55 tor
3326231 root 20 0 0 0 0 I 0.7 0.0 0:15.25 kworker/u162:21-kcryptd/253:0
3348166 root 20 0 329084 28872 28236 S 0.7 0.0 0:04.14 tor
3349381 root 20 0 307164 6492 5856 S 0.7 0.0 0:00.18 tor
215 root 20 0 0 0 0 S 0.3 0.0 3:33.53 ksoftirqd/40
2883 root 20 0 0 0 0 S 0.3 0.0 30:47.42 dmcrypt_write/2
2920 root 0 -20 0 0 0 I 0.3 0.0 0:03.06 kworker/59:1H-kblockd
3322 root 0 -20 0 0 0 I 0.3 0.0 0:02.79 kworker/61:1H-kblockd
3330 root 0 -20 0 0 0 I 0.3 0.0 0:03.09 kworker/63:1H-kblockd
3391 root 0 -20 0 0 0 I 0.3 0.0 0:03.30 kworker/55:1H-kblockd
3322041 root 20 0 0 0 0 I 0.3 0.0 0:08.65 kworker/u161:10-kcryptd/253:0
3332025 root 20 0 0 0 0 I 0.3 0.0 0:03.59 kworker/0:1-events
3338290 root 20 0 2267244 50872 3996 S 0.3 0.0 0:05.23 tornettools
3348191 root 20 0 307164 6396 5760 S 0.3 0.0 0:00.20 tor
3348281 root 20 0 307164 6364 5728 S 0.3 0.0 0:00.23 tor
3348295 root 20 0 307164 6544 5908 S 0.3 0.0 0:00.27 tor
3348495 root 20 0 307164 6484 5848 S 0.3 0.0 0:00.21 tor
weird, and should be investigated, because we're basically killing that box because of monitoring.
/cc @jnewsome