prepare for the break

make sure we'll survive the all hands break without too many interruptions.

this issue will collect a bunch of issues (or unfiled issues) that we're worried about for the break.

  • prometheus1's disk is close to being full (#42219 - closed) (declared safe for the break in #42219 (comment 3218635), dashboard)
  • https://gitlab.torproject.org/tpo/tpa/team/-/issues/42152+ (performance relatively acceptable, filed prometheus-alerts!72 (merged) to raise latency tolerance in monitoring, latency dashboard, cpu dashboard)
  • lists-01 performance issues (OOM, latency) (#41957 - closed) (performance relatively acceptable, filed prometheus-alerts!72 (merged) to raise latency tolerance in monitoring, dashboard)
  • internal network saturation in gnt-dal cluster (#42174 - closed) (switched instance to plain mode, network dashboard, VM IO dashboard, per day write graph)
  • https://gitlab.torproject.org/tpo/web/donate-neo/-/issues/172+ (@mattlav found good mitigations, dashboard)
  • https://gitlab.torproject.org/tpo/web/support/-/issues/399+ (deployed, waiting for confirmation from submitter)
  • NVMe RAID disk failure on dragon.tails.net (tails-sysadmin#18215 - closed) (can wait 10 days, according to @zen)
  • assess new trixie kernel (minor point update, CVEs checked and not critical for us)
  • review alerts sent in the past week, silence or fix
  • same for the month

this ticket should have been created a week ago, but alas...

current status:

  • some alerts are still present (but silenced) in Karma, namely:
    • NeedsReboot on trixie: new kernel, minor, should be done on return
    • ObsoletePackages on trixie: left over kernel packages
    • rdsys-staging reachability issues (still in deployment, #41769 (closed))
  • otherwise ready for the break as of 2025-06-27T15:42:21-04:00
Edited Jun 28, 2025 by anarcat
Assignee Loading
Time tracking Loading