Skip to content
GitLab
  • Explore
  • Sign in
  • The Tor Project
  • TPA
  • TPA teamTPA team
  • Issues
  • #41388

Launching Tor Weather breaks with password authentication error

I think I fixed a bunch of things related to the recent Debian upgrade, so the website is up and accessible again. Great. However, our Onionoo job is still crashing and I am inclined to think that I don't have the powers to fix that:

11/08/2023 02:00:15 PM : JOB : INFO : Onionoo Job Crashed - (psycopg2.OperationalError) connection to server at "localhost" (::1), port 5432 failed: FATAL:  password authentication failed for user "torweather"
connection to server at "localhost" (::1), port 5432 failed: FATAL:  password authentication failed for user "torweather"

(Background on this error at: https://sqlalche.me/e/20/e3q8)

I was looking over the service description but it's not obvious to me right now how to debug this further...

/cc @sarthikg

Emergency checklist

  • status site update (@gk, status-site!39 (merged))
  • check if we can recover from backups (@anarcat yes, barely)
  • recover PostgreSQL 15 cluster from backups (@anarcat)
  • confirm service works properly (tested by @anarcat, @trinity-1686a, would like confirmation from @sarthikg and @gk)

Cleanup work

  • post-mortem (@anarcat)
    • timeline analysis (@anarcat )
    • document data recovery procedures (@anarcat)
    • analyze upgrade logs to figure out what happened and how to prevent it (@anarcat)
    • recommend (monitoring?) improvements to prevent this from happening again (@anarcat)
  • mark incident as resolved in status site (@anarcat, see status-site!40 (merged))
  • delete /srv/backup/bacula/recup_dir.* and /root/RECOVER* on bungei (@anarcat)
  • delete backups in /srv/f* and /srv/dump on weather-01 (@anarcat)
  • remove old psql 13 cluster on weather-01 (@anarcat)
  • delete /root/weather-01-rootfs-20231108T154128.dump on fsn-node-01 (@anarcat)

Future improvements

  • improve upgrade procedure to avoid dropping non-empty clusters (@anarcat, done in wiki-replica@199627ca but really not pretty)
  • ensure proper contact information for servers (we already have contact information in the service) and that the upgrade procedure MUST notify server admins to test their services (@anarcat)
  • puppetize all PostgreSQL servers (@lavamind, moved to issue #41401 )
  • properly monitor tor-weather service (@gk, moved to issue tpo/network-health/metrics/monitoring-and-alerting#24)
  • extend backup retention period on upgrade, or just keep in the same backup rotation (@anarcat) (for now we're going to assume there's a good reason to move the old version aside and keep this setup, we matched the 21 days rotation now in wiki-replica@267a9818)
  • remove staging component from status site (@lavamind, status-site#32 (closed))
Edited Nov 14, 2023 by anarcat
Assignee
Assign to
Time tracking