On Sunday 5. June 2016, morning, we noticed that several VMs in the "control plane"
part of SWITCHengines had become unresponsive during the night. (This
later turned out to be due to a conflict between TCP port use of our
Ceph distributed storage system and some internal access lists.)
All those VMs were running on the same hypervisor host, and given that
we also saw scary kernel backtraces on that host, we decided around
0905 UTC that rebooting it would be the best option to get things
running smoothly again.
Unfortunately because of a latent error in that hosts’s boot
configuration, we didn’t manage to bring it back up. We had to start
the VMs in question on another hypervisor that mirrors disks with the
one we lost. We got them running again by 0940 UTC. Full network
connectivity was restablished by 0958 UTC.
We have already fixed the access list configuration that had caused
the initial problem. The broken configuration that prevented the
hypervisor from booting has been fixed (and tested) on that host; we
still need to propagate those changes to the rest of the fleet.
Finally, we are in the process of improving alerting so that we’ll
react more quickly should such problems reoccur.