I think I fixed a bunch of things related to the recent Debian upgrade, so the website is up and accessible again. Great. However, our Onionoo job is still crashing and I am inclined to think that I don't have the powers to fix that:
11/08/2023 02:00:15 PM : JOB : INFO : Onionoo Job Crashed - (psycopg2.OperationalError) connection to server at "localhost" (::1), port 5432 failed: FATAL: password authentication failed for user "torweather"connection to server at "localhost" (::1), port 5432 failed: FATAL: password authentication failed for user "torweather"(Background on this error at: https://sqlalche.me/e/20/e3q8)
I was looking over the service description but it's not obvious to me right now how to debug this further...
delete /srv/backup/bacula/recup_dir.* and /root/RECOVER* on bungei (@anarcat)
delete backups in /srv/f* and /srv/dump on weather-01 (@anarcat)
remove old psql 13 cluster on weather-01 (@anarcat)
delete /root/weather-01-rootfs-20231108T154128.dump on fsn-node-01 (@anarcat)
Future improvements
improve upgrade procedure to avoid dropping non-empty clusters (@anarcat, done in wiki-replica@199627ca but really not pretty)
ensure proper contact information for servers (we already have contact information in the service) and that the upgrade procedure MUST notify server admins to test their services (@anarcat)
extend backup retention period on upgrade, or just keep in the same backup rotation (@anarcat) (for now we're going to assume there's a good reason to move the old version aside and keep this setup, we matched the 21 days rotation now in wiki-replica@267a9818)
Ok the authentication issue is related to PostgreSQL. It seems the torweather database doesn't exist on the PostgreSQL 15 cluster, and the old PostgreSQL 13 cluster is not present anymore, which suggests that the cluster was upgraded to 15.
@anarcat Did anything unusual happen with the PostgreSQL upgrade on this machine?
yes, as a matter of fact it did! i remember losing the cluster for a while, and i was about to go through the restore procedure and then i found it again. so clearly i must have messed up the psql upgrade...
okay so clearly something was deeply messed up in the postgresql upgrade procedure. the whole cluster is gone. normally, we make a backup of the old backups when we rotate the new cluster in place, and we did do so here, but that backup was cleared after a week, as per our policies.
so i don't know how else to put this, but the database is completely gone. zlit. boom. deleted. forever. there might be a way to do forensics on the server to try to recover the deleted files, but it's been so long that I doubt we'll find anything.
how can we rebuild this server from scratch? how important is the historical data here? i guess we lost the user database of people that registered? maybe we can recover some of this through the mail logs?
root@bungei:/srv/backups/bacula# for file in /root/RECOVERDIR/recup_dir.1/*.gz recup_dir.*/*gz; do tar -O -x -z -f $file backup_label 2>/dev/null | grep weather && ls -alh $file; doneLABEL: bungei.torproject.org-20221222-005448-weather-01.torproject.org-main-13-backup-rw-r--r-- 1 root root 13G Dec 22 2022 recup_dir.10/f1191248128.gzLABEL: bungei.torproject.org-20231008-005104-weather-01.torproject.org-main-13-backup-rw-r--r-- 1 root root 3.5G Oct 8 00:51 recup_dir.18/f1959051264.gzLABEL: bungei.torproject.org-20231027-062310-weather-01.torproject.org-main-13-backup-rw-r--r-- 1 root root 3.1G Oct 27 06:23 recup_dir.20/f3005349888.gz
... but that older copy has more potential, since it's larger... i am still confused by this because I don't see a drop in the disk usage in grafana when i supposedly destroyed that database... that should have been visible...
in #41252 (closed), I "marked the checklist item weather-01.torproject.org (@anarcat) as completed 31 Oct 2023, 16:05", which means some time on Oct 31st, i cleared out the old database. but looking at the last 14 days, i see no such drop:
PhotoRec 7.1, Data Recovery Utility, July 2019 Christophe GRENIER <grenier@cgsecurity.org> https://www.cgsecurity.org Disk /dev/mapper/vg_bulk-backups--pg - 3298 GB / 3072 GiB (RO) Partition Start End Size in sectors P ext4 0 0 1 2147483647 0 1 6442450944 24827 files saved in /srv/backups/bacula/recup_dir directory. Recovery completed. You are welcome to donate to support and encourage further development https://www.cgsecurity.org/wiki/Donation
i think it's now safe to remount the pg partition on bungei, as i don't think we can recover anything else from there, agreed @lavamind?
root@bungei:/srv/backups/bacula# for file in /root/RECOVERDIR/recup_dir.1/*.gz recup_dir.*/*gz; do tar -O -x -z -f $file backup_label 2>/dev/null | grep weather && ls -alh $file; doneLABEL: bungei.torproject.org-20221222-005448-weather-01.torproject.org-main-13-backup-rw-r--r-- 1 root root 13G Dec 22 2022 recup_dir.10/f1191248128.gzLABEL: bungei.torproject.org-20231008-005104-weather-01.torproject.org-main-13-backup-rw-r--r-- 1 root root 3.5G Oct 8 00:51 recup_dir.18/f1959051264.gzLABEL: bungei.torproject.org-20231027-062310-weather-01.torproject.org-main-13-backup-rw-r--r-- 1 root root 3.1G Oct 27 06:23 recup_dir.20/f3005349888.gzLABEL: bungei.torproject.org-20231017-095114-weather-01.torproject.org-main-13-backup-rw-r--r-- 1 root root 1.2G Oct 17 09:51 recup_dir.35/f5669519360.gz
needless to say i'm not even attempting to recover the WAL files here. i think we'll be happy with losing a few days of data: it would be pretty hard to find the WAL files and order them properly, considering their names is so important.
i believe i have restored the latter dump with relative fidelity. it's possible that deleted records might reappear or that committed records were lost, but it meats having absolutely fuckall:
root@weather-01:/srv/dump# sudo -u postgres psql -p 5433psql (15.3 (Debian 15.3-0+deb12u1), server 13.11 (Debian 13.11-0+deb11u1))Type "help" for help.postgres=# \c torweatherpsql (15.3 (Debian 15.3-0+deb12u1), server 13.11 (Debian 13.11-0+deb11u1))You are now connected to database "torweather" as user "postgres".torweather=# \d List of relations Schema | Name | Type | Owner --------+-----------------------------+----------+------------ public | admin | table | torweather public | admin_id_seq | sequence | torweather public | node_bandwidth_sub | table | torweather public | node_bandwidth_sub_id_seq | sequence | torweather public | node_down_sub | table | torweather public | node_down_sub_id_seq | sequence | torweather public | node_flag_exit_sub | table | torweather public | node_flag_exit_sub_id_seq | sequence | torweather public | node_flag_fast_sub | table | torweather public | node_flag_fast_sub_id_seq | sequence | torweather public | node_flag_guard_sub | table | torweather public | node_flag_guard_sub_id_seq | sequence | torweather public | node_flag_stable_sub | table | torweather public | node_flag_stable_sub_id_seq | sequence | torweather public | node_flag_valid_sub | table | torweather public | node_flag_valid_sub_id_seq | sequence | torweather public | node_version_sub | table | torweather public | node_version_sub_id_seq | sequence | torweather public | relay | table | torweather public | relay_id_seq | sequence | torweather public | subscriber | table | torweather public | subscriber_id_seq | sequence | torweather public | subscription | table | torweather public | subscription_id_seq | sequence | torweather(24 rows)torweather=# select count(*) from subscriber; count ------- 185(1 row)torweather=# select count(*) from subscription; count ------- 7670(1 row)
i have done a timeline analysis of this incident, as part of a post-mortem. interesting take aways:
this issue was open about 13 hours after the backups were restored. had it been filed a day before, recovery would have been much easier, without data loss. obviously, @anarcat should have notified the service admin to test the service after the upgrade as well. not having clear contact information on servers is partly in cause (I wasn't sure who the service admin was)
Nagios did notice something wrong with the service (503 error) but this was somehow fixed by @lavamind even though the actual service was down, our monitoring of this service is therefore inadequate and we need to figure out a better way
i've also added a checklist to the issue summary, for next steps. the next step now is for me to review the upgrade log to see what exactly I did wrong in the upgrade procedure so that this never happens again in the first place.
@anarcat should have notified the service admin to test the service after the upgrade as well. not having clear contact information on servers is partly in cause (I wasn't sure who the service admin was)
tor-weather is maintained by the network-health team.
That's boiling down to pinging either me or hiro as a starter (although, arguably, the link on that document might better go to https://gitlab.torproject.org/tpo/network-health/team). But, if you need more than just a general pointer to network health, then feel free to mention particular people directly on the doc (even though I think a pointer to the team (as we have it right now) is a smarter idea).
I think one problem is we don't have that "in-band", on the server. Like
this could be part of the motd for example, or in LDAP. It's not
always obvious which "service" is bound to the server as well... in this
case, I should really just have looked in the wiki, but I don't really
feel the wiki is actually the best place to document this. It's just a
temporary solution.
So no blame here but my own: we need better inventory mechanisms, simply
(#30273).
@anarcat should have notified the service admin to test the service after the upgrade as well. not having clear contact information on servers is partly in cause (I wasn't sure who the service admin was)
tor-weather is maintained by the network-health team.
That's boiling down to pinging either me or hiro as a starter (although, arguably, the link on that document might better go to https://gitlab.torproject.org/tpo/network-health/team). But, if you need more than just a general pointer to network health, then feel free to mention particular people directly on the doc (even though I think a pointer to the team (as we have it right now) is a smarter idea).
step 8 actually purged the postgresql13 package...
... but the package prompted as to whether or not we wanted to remove the data, and I responded no, so that's not where the data was lost (although it could have been there)
after step 8, the PostgreSQL upgrade procedure was followed with the following command:
root@weather-01:~# export LC_ALL=C.UTF-8 && printf "about to stop cluster main on postgresql-13, press enter to continue" && read _ && pg_dropcluster --stop 13 main && pg_upgradecluster -m upgrade -k 13 main && for cluster in `ls /etc/postgresql/13/`; do mv /etc/postgresql/13/$cluster/conf.d/* /etc/postgresql/15/$cluster/conf.d/ doneabout to stop cluster main on postgresql-13, press enter to continue
That pipeline failed with the error:
Error: specified cluster does not exist
The problem there is that I thought it was the pg_dropcluster command that failed, but that wasn't it! The failure was:
root@weather-01:~# pg_upgradecluster -m upgrade -k 13 mainError: specified cluster does not exist
And at this point, the 13 cluster was gone. This was a critical error in the upgrade procedure: the dropcluster command should have targeted the new cluster, not the old one. An improvement to the procedure might be to make sure the new cluster is actually "empty", somehow.
I even noticed the database being destroyed, according to the log (ls /var/lib/postgresql only showed the 15 cluster and psql only showed an empty database). At this point, it seems like I started recovering the old database by installing the postgresql-13 package. But somehow, during the restore procedure, I got confused and thought the database was already restored somehow.
So instead of completing the recovery from the backup server, I went back to the upgrade procedure to upgrade the (now empty) 13 cluster to 15. Part of the confusion was due to this back-and-forth between the restore and the upgrade procedure. I got stuck on trying to make the new cluster talk with the backup server, for example, which was very distracting.
Having all PostgreSQL servers properly managed by Puppet (as opposed to having to manually update pg_hba.conf) would improve that situation as well.
So that's it! The root cause of the problem was a faulty upgrade procedure that was destroying the live, production database server, compounded with a lack of proper monitoring, lack of communication of the upgrade, and lack of automation of the database servers.
Recommendations:
fix the upgrade procedure to ensure the pg_dropcluster command is only ran on an empty database (alternatively: automate the procedure so that it's hard or impossible to confuse the production (13 in this case) and target (15) clusters
properly monitor the Weather service, possibly through prometheus and the prometheus-alerts repository
properly communicate to service admins the upgrade completion
i documented the data recovery procedure in howto/backup, but it goes a bit beyond just "backups" as it delves in the hairy postgresql stuff... i guess that's good enough for now, phew.
i just realized another thing: a key problem here is the backups were destroyed after 7 days. that's a tad short: database backups are normally kept for four weeks, if my memory is correct, so maybe we could change the procedure to do that.
in fact, i wonder why we move that directory around in the first place. shouldn't we just keep it in the same directory and leave the normal backup rotation operate?
something to review anyway, adding to the checklist here.
i opened status-site#32 (closed) to propose to remove the staging as a consequence of this and also other similar situations, rationale detailed there.