gitlab is slow - high CPU and I/O wait
We've been having issues with gitlab since last week. The trouble started on Thursday 11th.
The symptoms that we can see are:
- ram usage is high, but 2Gb are still reported as free (not counting caches)
- swap is being used
- CPU usage is high, mostly used by git and ruby processes
- I/O wait is high on all CPUs
- nginx requests per second is not very high https://grafana.torproject.org/d/bdibil2hfyu4gf/nginx-exporter?orgId=1&from=now-1y&to=now&refresh=5s
Current status:
- load spikes are still an issue as of early September 2024
- correlation between large CI runs (in tor-browser and friends, in particular) which do lots of concurrent fetches, tracked in tpo/applications/tor-browser#43121 (closed), possible workaround: object cache (#41705)
- mitigations previously deployed by @brizental seem incomplete, possibly because artifacts storage is also slow, possible fix is to move to object storage ( #41403) but then we need to handle backups ( #41415).
- multiple cause scenario more and more likely, could also be bots like last issues in May 2024 (#41597 (closed))
- @brizental's experiments discarded the "noisy neighbor" theory for now, although we have a proposal to insert "idle canaries" to confirm such hypothesis (#41750 (closed))
- TPA has been considering moving the GitLab VM to another, faster, cluster (#41431 (closed)) and scaling up the service by splitting GitLab components (#40479)
Edited by anarcat