Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • Trac Trac
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Issues 246
    • Issues 246
    • List
    • Boards
    • Service Desk
    • Milestones
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
  • Wiki
    • Wiki
  • Activity
  • Create a new issue
  • Issue Boards
Collapse sidebar
  • Legacy
  • TracTrac
  • Issues
  • #32660

Closed (moved)
(moved)
Open
Created Dec 02, 2019 by anarcat@anarcat

onionoo-backend is killing the ganeti cluster

hello!

today i noticed that, since last friday (UTC) morning, there has been pretty big spikes on the internal network between the ganeti nodes, every hour. it looks like this, in grafana:

snap-2019.12.02-16.06.11.png, 700​

We can clearly see a correlation between the two node's traffic, in reverse. This was confirmed using iftop and tcpdump on the nodes during a surge.

It seems this is due to onionoo-backend-01 blasting the disk and CPU for some reason. This is the disk I/O graphs for that host, which correlate pretty cleanly with the above graphs:

snap-2019.12.02-16.30.33.png​

This was confirmed by an inspection of drbd, the mechanisms that synchronizes the disks across the network. It seems there's a huge surge of "writes" on the network every hour which lasts anywhere between 20 and 30 minutes. This was (somewhat) confirmed by running:

watch -n 0.1 -d cat /proc/drbd

on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in DRBD. 13 and 17 are the web nodes, so that's expected - probably log writes? But device ID 4 is onionoo-backend, which is what led me to the big traffic graph.

could someone from metrics investigate?

can i just turn off this machine altogether, considering it's basically trying to murder the cluster every hour? :)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking