materculae hits the OOM killer since bullseye upgrade
last night (from my perspective), PostgreSQL crashed on materculae. in systemd's logs, we see:
May 05 05:25:33 materculae systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer.
then a bunch of errors happened in the postgresql log:
2022-05-05 05:25:33 GMT LOG: server process (PID 16279) was terminated by signal 9: Killed
2022-05-05 05:25:33 GMT DETAIL: Failed process was running: select * from search_by_date_address24($1, $2) as result
2022-05-05 05:25:33 GMT LOG: terminating any other active server processes
2022-05-05 05:25:33 GMT WARNING: terminating connection because of crash of another server process
2022-05-05 05:25:33 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
2022-05-05 05:25:33 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command.
it's unclear why this is happening, but it's clearly a regression from the upgrade. here's a memory graph from the last 3 days:
https://grafana.torproject.org/d/xfpJB9FGz/1-node-exporter-for-prometheus-dashboard-en-v20201010?orgId=1&var-origin_prometheus=&var-job=node&var-hostname=All&var-node=materculae.torproject.org:9100&var-device=All&var-interval=2m&var-maxmount=%2Fhome&var-show_hostname=materculae&var-total=93&viewPanel=156&from=now-3d&to=now&refresh=1m
i think the upgrade completed at about 15:26 UTC yesterday, at least according to the graph. (this comment is later, but that's probably just me reporting after the fact: #40692 (comment 2799945)).
then we can see the server restarting (the blank), and slowly reclaiming memory. then there's this unusual jump at 22:18 and things go a little out of whack for a few hours, but seem to stabilise at a somewhat reasonable pattern at 11:00 next day. that's about 1GB more memory usage than the previous normal though, so that's already a little worrisome.
but then, at 22:46UTC, memory usage just starts to grown linearly, eventually hitting the above OOM at around 5:30 or so.
it seems we don't have prometheus instrumentation for postgresql on that host at all right now, so i guess that would be one next step.
/cc @hiro