Skip to content

materculae hits the OOM killer since bullseye upgrade

last night (from my perspective), PostgreSQL crashed on materculae. in systemd's logs, we see:

May 05 05:25:33 materculae systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer.

then a bunch of errors happened in the postgresql log:

2022-05-05 05:25:33 GMT LOG:  server process (PID 16279) was terminated by signal 9: Killed
2022-05-05 05:25:33 GMT DETAIL:  Failed process was running: select * from search_by_date_address24($1, $2) as result
2022-05-05 05:25:33 GMT LOG:  terminating any other active server processes
2022-05-05 05:25:33 GMT WARNING:  terminating connection because of crash of another server process
2022-05-05 05:25:33 GMT DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
2022-05-05 05:25:33 GMT HINT:  In a moment you should be able to reconnect to the database and repeat your command.

it's unclear why this is happening, but it's clearly a regression from the upgrade. here's a memory graph from the last 3 days:

image https://grafana.torproject.org/d/xfpJB9FGz/1-node-exporter-for-prometheus-dashboard-en-v20201010?orgId=1&var-origin_prometheus=&var-job=node&var-hostname=All&var-node=materculae.torproject.org:9100&var-device=All&var-interval=2m&var-maxmount=%2Fhome&var-show_hostname=materculae&var-total=93&viewPanel=156&from=now-3d&to=now&refresh=1m

i think the upgrade completed at about 15:26 UTC yesterday, at least according to the graph. (this comment is later, but that's probably just me reporting after the fact: #40692 (comment 2799945)).

then we can see the server restarting (the blank), and slowly reclaiming memory. then there's this unusual jump at 22:18 and things go a little out of whack for a few hours, but seem to stabilise at a somewhat reasonable pattern at 11:00 next day. that's about 1GB more memory usage than the previous normal though, so that's already a little worrisome.

but then, at 22:46UTC, memory usage just starts to grown linearly, eventually hitting the above OOM at around 5:30 or so.

it seems we don't have prometheus instrumentation for postgresql on that host at all right now, so i guess that would be one next step.

/cc  @hiro

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information