fsn VMs lost connectivity this morning

This morning several of our VMs at fsn were without network.

The instances were still running, and gnt-console still got me a console that I could log into, but the machines were not reachable from the network, nor could they reach the network. tcpdumping the bridge interface on the node did not show any network traffic for the instance.

Migrating them made them be online again (tried with vineale for instance). Rebooting also helped (tried with everything else).

Looking at the running openswitch config on a node when its instances did not have network looked like this:

root@fsn-node-04:~# ovs-vsctl show
ce[...]
    Bridge "br0"
        Port vlan-gntinet
            tag: 4000
            Interface vlan-gntinet
                type: internal
        Port "eth0"
            Interface "eth0"
        Port "br0"
            Interface "br0"
                type: internal
        Port vlan-gntbe
            tag: 4001
            Interface vlan-gntbe
                type: internal
    ovs_version: "2.10.1"

When its working, it should look more like this:

root@fsn-node-04:~# ovs-vsctl show
ce[...]
    Bridge "br0"
        Port "tap3"
            tag: 4000
            trunks: [4000]
            Interface "tap3"
        Port vlan-gntinet
            tag: 4000
            Interface vlan-gntinet
                type: internal
        Port "eth0"
            Interface "eth0"
        Port "tap4"
            tag: 4000
            trunks: [4000]
            Interface "tap4"
        Port "br0"
            Interface "br0"
                type: internal
        Port "tap5"
            tag: 4000
            trunks: [4000]
            Interface "tap5"
        Port "tap1"
            tag: 4000
            trunks: [4000]
            Interface "tap1"
        Port vlan-gntbe
            tag: 4001
            Interface vlan-gntbe
                type: internal
        Port "tap2"
            tag: 4000
            trunks: [4000]
            Interface "tap2"
        Port "tap0"
            tag: 4000
            trunks: [4000]
            Interface "tap0"
    ovs_version: "2.10.1"

My first guess was that migrating somehow had screwed up the network config, but that's probably not what happened, as the issue happened again shortly afterwards when I was running upgrades. So:

My current working theory is that the following happened:

  • In the morning, once automaticallly and once manually, we ran package upgrades.
  • Today this included an openssl update. And openvswitch is linked against openssl.
  • needrestart restarted openvswitch.
  • restarting openvswitch does not restore the dynamically added VM taps into the bridge.

I propose we blacklist openvswitch from being restarted by needrestart.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information