fsn VMs lost connectivity this morning
This morning several of our VMs at fsn were without network.
The instances were still running, and gnt-console
still got me a console that I could log into, but the machines were not reachable from the network, nor could they reach the network. tcpdumping the bridge interface on the node did not show any network traffic for the instance.
Migrating them made them be online again (tried with vineale for instance). Rebooting also helped (tried with everything else).
Looking at the running openswitch config on a node when its instances did not have network looked like this:
root@fsn-node-04:~# ovs-vsctl show
ce[...]
Bridge "br0"
Port vlan-gntinet
tag: 4000
Interface vlan-gntinet
type: internal
Port "eth0"
Interface "eth0"
Port "br0"
Interface "br0"
type: internal
Port vlan-gntbe
tag: 4001
Interface vlan-gntbe
type: internal
ovs_version: "2.10.1"
When its working, it should look more like this:
root@fsn-node-04:~# ovs-vsctl show
ce[...]
Bridge "br0"
Port "tap3"
tag: 4000
trunks: [4000]
Interface "tap3"
Port vlan-gntinet
tag: 4000
Interface vlan-gntinet
type: internal
Port "eth0"
Interface "eth0"
Port "tap4"
tag: 4000
trunks: [4000]
Interface "tap4"
Port "br0"
Interface "br0"
type: internal
Port "tap5"
tag: 4000
trunks: [4000]
Interface "tap5"
Port "tap1"
tag: 4000
trunks: [4000]
Interface "tap1"
Port vlan-gntbe
tag: 4001
Interface vlan-gntbe
type: internal
Port "tap2"
tag: 4000
trunks: [4000]
Interface "tap2"
Port "tap0"
tag: 4000
trunks: [4000]
Interface "tap0"
ovs_version: "2.10.1"
My first guess was that migrating somehow had screwed up the network config, but that's probably not what happened, as the issue happened again shortly afterwards when I was running upgrades. So:
My current working theory is that the following happened:
- In the morning, once automaticallly and once manually, we ran package upgrades.
- Today this included an openssl update. And openvswitch is linked against openssl.
-
needrestart
restarted openvswitch. - restarting openvswitch does not restore the dynamically added VM taps into the bridge.
I propose we blacklist openvswitch from being restarted by needrestart.