lists-01 performance issues (OOM, latency)
I got this email today:
Date: Thu, 26 Dec 2024 15:22:10 +0000
From: alertmanager@hetzner-nbg1-01.torproject.org
To: root@localhost
Subject: HTTPSUnreachable Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS
Total firing alerts: 1
Pager playbook: TODO
## Firing Alerts
-----
Time: 2024-12-26 15:21:40.651 +0000 UTC
Summary: Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS
Description: Unable to connect to HTTPS port or response code is not the expected code for https://lists.torproject.org/mailman3/postorius/lists/
-----
There were actually a bunch of alerts flapping all over the place there:
Day changed to 26 Dec 2024
09:42:10 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
10:22:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
10:57:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
11:36:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
11:40:46 <zen-fu> flap flap flap
11:41:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
12:07:10 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
12:16:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
12:31:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
12:38:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
12:43:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
12:58:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
13:03:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
13:24:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
13:39:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
13:57:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
14:02:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
Day changed to 27 Dec 2024
times are UTC-5.
@lelutin had reported the VM was doing OOMs: perhaps we could simply bump the memory on that box and move on, but for now it seems important to document this issue at the very least.
next steps
-
Enable per-process memory graphs -
Pin the exact process that was eating up memory → uwsgi
-
Replace uWSGI with Gunicorn -
Check if that change prevents OOMs -
Investigate python3
spikes -
Delete or revert the now-unmanaged files to their original state -
Lower the amount of memory of the VM once again -
reroll xapian-haystack patch upstream -
grow the disk to take into account the xapian index size (or investigate why it's so big) -
deploy xapian-haystack patched package -
wait a while (a week?) from 2024-02-07 to confirm OOMs are gone
Dashboards
- node exporter
- disk usage
- memory usage (includes OOM graphs)
- per-process memory
- reachability
External issues
- hyperkitty crash #408
- xapian-haystack PR #238, to fix hyperkitty crash
- debian mailman3-web bug #1014037 about OOMs
- hyperkitty xapian disk usage issue #533
Docs
Edited by anarcat