lists-01 performance issues (OOM, latency)

I got this email today:

Date: Thu, 26 Dec 2024 15:22:10 +0000
From: alertmanager@hetzner-nbg1-01.torproject.org
To: root@localhost
Subject: HTTPSUnreachable Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS

Total firing alerts: 1

Pager playbook: TODO

## Firing Alerts

-----
Time: 2024-12-26 15:21:40.651 +0000 UTC
Summary:  Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS 
Description:  Unable to connect to HTTPS port or response code is not the expected code for https://lists.torproject.org/mailman3/postorius/lists/ 

-----

There were actually a bunch of alerts flapping all over the place there:

Day changed to 26 Dec 2024
09:42:10 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
10:22:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
10:57:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
11:36:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
11:40:46 <zen-fu> flap flap flap
11:41:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
12:07:10 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
12:16:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
12:31:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
12:38:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
12:43:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
12:58:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
13:03:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
13:24:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
13:39:55 -ALERTOR1:#tor-alerts- HTTPSResponseDelayExceeded [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is too slow.
13:57:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
14:02:10 -ALERTOR1:#tor-alerts- HTTPSUnreachable [resolved] Website https://lists.torproject.org/mailman3/postorius/lists/ is unreachable via HTTPS CRITICAL!
Day changed to 27 Dec 2024

times are UTC-5.

@lelutin had reported the VM was doing OOMs: perhaps we could simply bump the memory on that box and move on, but for now it seems important to document this issue at the very least.

next steps

  • Enable per-process memory graphs
  • Pin the exact process that was eating up memory → uwsgi
  • Replace uWSGI with Gunicorn
  • Check if that change prevents OOMs
  • Investigate python3 spikes
  • Delete or revert the now-unmanaged files to their original state
  • Lower the amount of memory of the VM once again
  • reroll xapian-haystack patch upstream
  • grow the disk to take into account the xapian index size (or investigate why it's so big)
  • deploy xapian-haystack patched package
  • wait a while (a week?) from 2024-02-07 to confirm OOMs are gone

Dashboards

  • node exporter
  • disk usage
  • memory usage (includes OOM graphs)
  • per-process memory
  • reachability

External issues

  • hyperkitty crash #408
  • xapian-haystack PR #238, to fix hyperkitty crash
  • debian mailman3-web bug #1014037 about OOMs
  • hyperkitty xapian disk usage issue #533

Docs

  • TPA wiki mailman search engine docs
Edited Feb 17, 2025 by anarcat
Assignee Loading
Time tracking Loading