meronense database server crashes on I/O errors (#32692) · Issues · The Tor Project / TPA / TPA team

meronense database server crashes on I/O errors

meronense crashed today. local time (UTC-5) ``` 08:35:18 <nsa> tor-nagios: [meronense] system - filesystem check is WARNING: /dev/mapper/vg_meronense-srv (/home) has 2 errors 09:03:53 <nsa> tor-nagios: [meronense] process - postgresql96 - master is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (postgres), command name postgres, args /usr/lib/postgresql/9.6/bin/postgres 09:08:33 <nsa> tor-nagios: [meronense] system - all services running is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds. 09:22:43 <nsa> tor-nagios: [meronense] system - all services running is WARNING: degraded ``` i restarted psql when i noticed: ``` 10:11:50 <anarcat> restarted postgres on meronense, unknown reason crash ``` but it crashed again. the errors actually started here (UTC), in /var/log/postgresql/postgresql-9.6-main.log: ``` 2019-12-07 12:09:30 GMT WARNING: corrupted statistics file "pg_stat_tmp/global.stat" [repeats until] 2019-12-07 12:44:28 GMT WARNING: corrupted statistics file "pg_stat_tmp/global.stat" 2019-12-07 12:44:28 GMT LOG: using stale statistics instead of current ones because stats collector is not responding 2019-12-07 12:44:28 GMT WARNING: corrupted statistics file "pg_stat_tmp/global.stat" 2019-12-07 12:47:26 GMT PANIC: could not fsync file "base/16401/17372": Input/output error 2019-12-07 12:47:26 GMT LOG: checkpointer process (PID 1284) was terminated by signal 6: Aborted 2019-12-07 12:47:26 GMT LOG: terminating any other active server processes 2019-12-07 12:47:26 GMT WARNING: terminating connection because of crash of another server process 2019-12-07 12:47:26 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-12-07 12:47:26 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2019-12-07 12:47:26 GMT LOG: archiver process (PID 1288) exited with exit code 1 2019-12-07 12:47:26 GMT WARNING: terminating connection because of crash of another server process 2019-12-07 12:47:26 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-12-07 12:47:26 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command. 2019-12-07 12:47:26 GMT CONTEXT: SQL statement "UPDATE bwhist SET read = array_merge(read, insert_read), written = array_merge(written, insert_written), dirread = array_merge(dirread, insert_dirread), dirwritten = array_merge(dirwritten, insert_dirwritten) WHERE fingerprint = insert_fingerprint AND date = insert_date" PL/pgSQL function insert_bwhist(character,date,bigint[],bigint[],bigint[],bigint[]) line 12 at SQL statement 2019-12-07 12:47:36 GMT LOG: statistics collector process (PID 1289) was terminated by signal 9: Killed 2019-12-07 12:47:36 GMT LOG: all server processes terminated; reinitializing 2019-12-07 12:47:58 GMT LOG: database system was interrupted; last known up at 2019-12-07 11:02:32 GMT 2019-12-07 12:48:09 GMT LOG: could not fsync file "./base/16401/17372_fsm": Input/output error 2019-12-07 12:48:09 GMT LOG: could not fsync file "./base/16401/17372_vm": Input/output error 2019-12-07 12:48:09 GMT LOG: could not fsync file "./base/16401/17488": Input/output error 2019-12-07 12:48:17 GMT LOG: could not fsync file "./pg_stat_tmp/global.stat": Input/output error 2019-12-07 12:48:17 GMT LOG: database system was not properly shut down; automatic recovery in progress 2019-12-07 12:48:17 GMT LOG: redo starts at 3D7/57E470D8 2019-12-07 12:48:17 GMT LOG: redo done at 3D7/5BB04860 2019-12-07 13:49:37 GMT PANIC: could not fsync file "base/16401/17488": Input/output error 2019-12-07 13:49:37 GMT LOG: startup process (PID 20894) was terminated by signal 6: Aborted 2019-12-07 13:49:37 GMT LOG: aborting startup due to startup process failure 2019-12-07 13:49:37 GMT LOG: database system is shut down ``` here's my attempt at restarting it: ``` 2019-12-07 15:11:33 GMT FATAL: the database system is starting up 2019-12-07 15:11:33 GMT FATAL: the database system is starting up 2019-12-07 15:11:34 GMT LOG: could not stat file "./pg_stat_tmp/global.stat": Structure needs cleaning 2019-12-07 15:11:34 GMT LOG: database system was not properly shut down; automatic recovery in progress 2019-12-07 15:11:34 GMT FATAL: the database system is starting up 2019-12-07 15:11:34 GMT FATAL: the database system is starting up 2019-12-07 15:11:35 GMT FATAL: the database system is starting up 2019-12-07 15:11:35 GMT FATAL: the database system is starting up 2019-12-07 15:11:36 GMT FATAL: the database system is starting up 2019-12-07 15:11:36 GMT LOG: redo starts at 3D7/57E470D8 2019-12-07 15:11:36 GMT FATAL: the database system is starting up 2019-12-07 15:11:37 GMT FATAL: the database system is starting up 2019-12-07 15:11:37 GMT FATAL: the database system is starting up 2019-12-07 15:11:38 GMT FATAL: the database system is starting up 2019-12-07 15:11:38 GMT LOG: incomplete startup packet 2019-12-07 15:11:46 GMT LOG: redo done at 3D7/5BB04860 2019-12-07 15:48:19 GMT PANIC: could not fsync file "base/16401/17488": Input/output error 2019-12-07 15:48:19 GMT LOG: startup process (PID 27916) was terminated by signal 6: Aborted 2019-12-07 15:48:19 GMT LOG: aborting startup due to startup process failure 2019-12-07 15:48:19 GMT LOG: database system is shut down ``` then i shutdown the host from kvm4, and restarted it. then psql was able to recover: ``` 2019-12-07 16:54:17 GMT LOG: incomplete startup packet 2019-12-07 16:54:17 GMT LOG: database system shutdown was interrupted; last known up at 2019-12-07 15:26:47 GMT 2019-12-07 16:54:18 GMT FATAL: the database system is starting up 2019-12-07 16:54:18 GMT FATAL: the database system is starting up 2019-12-07 16:54:18 GMT LOG: database system was not properly shut down; automatic recovery in progress 2019-12-07 16:54:18 GMT LOG: redo starts at 3D7/57E470D8 2019-12-07 16:54:19 GMT FATAL: the database system is starting up 2019-12-07 16:54:19 GMT FATAL: the database system is starting up 2019-12-07 16:54:20 GMT FATAL: the database system is starting up 2019-12-07 16:54:20 GMT FATAL: the database system is starting up 2019-12-07 16:54:21 GMT FATAL: the database system is starting up 2019-12-07 16:54:21 GMT FATAL: the database system is starting up 2019-12-07 16:54:22 GMT FATAL: the database system is starting up 2019-12-07 16:54:22 GMT FATAL: the database system is starting up 2019-12-07 16:54:23 GMT FATAL: the database system is starting up 2019-12-07 16:54:23 GMT LOG: incomplete startup packet 2019-12-07 16:54:23 GMT LOG: redo done at 3D7/5BB04860 2019-12-07 16:57:28 GMT LOG: MultiXact member wraparound protections are now enabled 2019-12-07 16:57:28 GMT LOG: autovacuum launcher started 2019-12-07 16:57:28 GMT LOG: database system is ready to accept connections ``` disk arrays are fine according to mdadm on kvm4,the parent host: ``` root@kvm4:~# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md2 : active raid1 sda[2] sdb[3] 3906887488 blocks super 1.2 [2/2] [UU] bitmap: 12/30 pages [48KB], 65536KB chunk md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0] 479002944 blocks super 1.2 [2/2] [UU] bitmap: 3/4 pages [12KB], 65536KB chunk md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0] 20955136 blocks super 1.2 [2/2] [UU] unused devices: <none> ``` this is the second I see this happen. not sure what's going on. i need to run now, would love another pair of eyes on this.

issue