Snowflake broker process was OOM killed on 2025-08-02
I noticed the snowflake broker process restarted 2025-08-02
while investigating a metrics bug. @shelikhoo checked the snowflake broker logs during the anticensorship team meeting on 2025-08-14 and found that the broker process was killed by the OOM killer:
16:31 <+shelikhoo> cohosh: the restart happened at 1990804:Aug 02 12:03:41 snowflake-broker-40349
broker[488594]: 2025/08/02 12:03:41 Loading geoip databases
16:34 <+shelikhoo> 2248897:Aug 02 12:03:34 snowflake-broker-40349 kernel: Out of memory: Killed process
629 (broker) total-vm:7873300kB, anon-rss:5600452kB, file-rss:0kB, shmem-rss:0kB,
UID:1003 pgtables:11224kB oom_score_adj:200
16:34 <+shelikhoo> it is oom killer
This coincides with a spike in client polls caused by a snowflake bridge outage (#40475 (closed)). This outage has been addressed, but it's worth noting and responding to the fact that a bridge failure caused a broker failure due to load.
We temporarily increased the broker's CPU and memory resources for the spike in client polls caused by the Iran shutdown (#40465 (comment 3211319)), which was motivated by CPU usage. I am curious about how much of the available broker machine memory is currently used during a period of normal load.
However, unless we're close to maxing out memory usage at the moment, I don't think we need to increase broker machine resources preemptively. Perhaps a better way of responding to this issue is to prevent the overloads we've seen in both #40465 (closed) and #40475 (closed), when connections to proxies or bridges are not working for clients. Maybe with #40466.