prepare for the break

make sure we'll survive the all hands break without too many interruptions.

this issue will collect a bunch of issues (or unfiled issues) that we're worried about for the break.

prometheus1's disk is close to being full (#42219 - closed) (declared safe for the break in #42219 (comment 3218635), dashboard)
https://gitlab.torproject.org/tpo/tpa/team/-/issues/42152+ (performance relatively acceptable, filed prometheus-alerts!72 (merged) to raise latency tolerance in monitoring, latency dashboard, cpu dashboard)
lists-01 performance issues (OOM, latency) (#41957 - closed) (performance relatively acceptable, filed prometheus-alerts!72 (merged) to raise latency tolerance in monitoring, dashboard)
internal network saturation in gnt-dal cluster (#42174 - closed) (switched instance to plain mode, network dashboard, VM IO dashboard, per day write graph)
https://gitlab.torproject.org/tpo/web/donate-neo/-/issues/172+ (@mattlav found good mitigations, dashboard)
https://gitlab.torproject.org/tpo/web/support/-/issues/399+ (deployed, waiting for confirmation from submitter)
NVMe RAID disk failure on dragon.tails.net (tails-sysadmin#18215 - closed) (can wait 10 days, according to @zen)
assess new trixie kernel (minor point update, CVEs checked and not critical for us)
review alerts sent in the past week, silence or fix
same for the month

this ticket should have been created a week ago, but alas...

current status:

some alerts are still present (but silenced) in Karma, namely:
- NeedsReboot on trixie: new kernel, minor, should be done on return
- ObsoletePackages on trixie: left over kernel packages
- rdsys-staging reachability issues (still in deployment, #41769 (closed))
otherwise ready for the break as of 2025-06-27T15:42:21-04:00

Edited Jun 28, 2025 by anarcat

Assignee Loading

Time tracking Loading