Launching Tor Weather breaks with password authentication error
I think I fixed a bunch of things related to the recent Debian upgrade, so the website is up and accessible again. Great. However, our Onionoo job is still crashing and I am inclined to think that I don't have the powers to fix that:
11/08/2023 02:00:15 PM : JOB : INFO : Onionoo Job Crashed - (psycopg2.OperationalError) connection to server at "localhost" (::1), port 5432 failed: FATAL: password authentication failed for user "torweather"
connection to server at "localhost" (::1), port 5432 failed: FATAL: password authentication failed for user "torweather"
(Background on this error at: https://sqlalche.me/e/20/e3q8)
I was looking over the service description but it's not obvious to me right now how to debug this further...
/cc @sarthikg
Emergency checklist
-
status site update (@gk, status-site!39 (merged)) -
check if we can recover from backups (@anarcat yes, barely) -
recover PostgreSQL 15 cluster from backups (@anarcat) -
confirm service works properly (tested by @anarcat, @trinity-1686a, would like confirmation from @sarthikg and @gk)
Cleanup work
-
post-mortem (@anarcat) -
mark incident as resolved in status site (@anarcat, see status-site!40 (merged)) -
delete /srv/backup/bacula/recup_dir.*
and/root/RECOVER*
on bungei (@anarcat) -
delete backups in /srv/f*
and/srv/dump
on weather-01 (@anarcat) -
remove old psql 13 cluster on weather-01 (@anarcat) -
delete /root/weather-01-rootfs-20231108T154128.dump
on fsn-node-01 (@anarcat)
Future improvements
-
improve upgrade procedure to avoid dropping non-empty clusters (@anarcat, done in wiki-replica@199627ca but really not pretty) -
ensure proper contact information for servers(we already have contact information in the service) and that the upgrade procedure MUST notify server admins to test their services (@anarcat) -
puppetize all PostgreSQL servers (@lavamind, moved to issue #41401 (closed) ) -
properly monitor tor-weather service (@gk, moved to issue tpo/network-health/metrics/monitoring-and-alerting#24) -
extend backup retention period on upgrade, or just keep in the same backup rotation(@anarcat) (for now we're going to assume there's a good reason to move the old version aside and keep this setup, we matched the 21 days rotation now in wiki-replica@267a9818) -
remove staging component from status site (@lavamind, status-site#32 (closed))
Edited by anarcat