prepare for the break
make sure we'll survive the all hands break without too many interruptions.
this issue will collect a bunch of issues (or unfiled issues) that we're worried about for the break.
-
prometheus1's disk is close to being full (#42219 - closed) (declared safe for the break in #42219 (comment 3218635), dashboard) -
https://gitlab.torproject.org/tpo/tpa/team/-/issues/42152+ (performance relatively acceptable, filed prometheus-alerts!72 (merged) to raise latency tolerance in monitoring, latency dashboard, cpu dashboard) -
lists-01 performance issues (OOM, latency) (#41957 - closed) (performance relatively acceptable, filed prometheus-alerts!72 (merged) to raise latency tolerance in monitoring, dashboard) -
internal network saturation in gnt-dal cluster (#42174 - closed) (switched instance to plain mode, network dashboard, VM IO dashboard, per day write graph) -
https://gitlab.torproject.org/tpo/web/donate-neo/-/issues/172+ (@mattlav found good mitigations, dashboard) -
https://gitlab.torproject.org/tpo/web/support/-/issues/399+ (deployed, waiting for confirmation from submitter) -
NVMe RAID disk failure on dragon.tails.net (tails-sysadmin#18215 - closed) (can wait 10 days, according to @zen) -
assess new trixie kernel (minor point update, CVEs checked and not critical for us) -
review alerts sent in the past week, silence or fix -
same for the month
this ticket should have been created a week ago, but alas...
current status:
- some alerts are still present (but silenced) in Karma, namely:
-
NeedsReboot
on trixie: new kernel, minor, should be done on return -
ObsoletePackages
on trixie: left over kernel packages - rdsys-staging reachability issues (still in deployment, #41769 (closed))
-
- otherwise ready for the break as of 2025-06-27T15:42:21-04:00
Edited by anarcat