What is VENOM?
On 13 May 2015 (this Wednesday) a security vulnerability was publicized under the name VENOM. In the Common Vulnerabilities and Exposures system, this is registered as CVE-2015-3456.
This vulnerability concerns certain hypervisors such as Qemu/KVM. Hypervisors are used in virtualized IT environments—such as SWITCHengines—to run virtual machines. On the one hand, the hypervisor emulates an environment that looks to the "guest" like a real machine where the guest has full control over the (virtual) hardware. On the other hand, the hypervisor must ensure that different "guests" (VMs) are isolated from each other and from the "real" (physical) host that those VMs run on.
VENOM is an error (basically buffer overflow/insufficient bounds checking) in the emulation part, specifically the emulation of floppy disk drives. The error allows the guest to write into the hypervisor process's memory space. This can easily be used to crash the hypervisor. Past experience shows that, with some additional work, this type of vulnerability can be used to "break out" of the hypervisor process and perform actions with the privilege level of that process. Hypervisor processes run as root (the super-user) on most systems including ours. So a clever exploit of this bug could allow an attacker to completely take over control of the host.
Fixing the Problem: Reboot all VMs?
In order to protect SWITCHengines users (and our infrastructure) against these possible attacks, we installed fixed software versions of the hypervisor we use (the Ubuntu qemu-system-x86 package). Fortunately these fixed packages became available on the same day that the vulnerability was published, and we had them installed on all our systems the next morning (14 May 2015).
But what to do with the hypervisor processes that were already running? There were more than 500 of those, one for each active customer instance/VM. In an article about VENOM and OpenStack, the authors claim that "a reboot or suspend/resume [of all customer VMs] is required in order for this fix to apply". Ouch! So we need to reboot or interrupt all these 500 VMs of our customers?
If you have a long-running VM on SWITCHengines, you will notice that it was not rebooted or suspended this week. And yet, it now runs under a fresh (and patched) hypervisor. You may have noticed a short period where the VM was running slowly and/or with limited network connectivity.
Live Migration to the Rescue
SWITCHengines is an installation of OpenStack that uses shared storage (implemented using Ceph RBD) for all instances by default. This allows us to transparently move running VMs from one hypervisor host to another. That process is called "live migration". Live migration starts a new hypervisor process on the destination host, transfers the complete state (virtual RAM contents, network interface configuration etc.) from the still-running source hypervisor to the new process, and eventually switches over to the new process. If this seems complicated, yes, it is, but modern hypervisors handle this surprisingly well.
Although live migration is not a function that is directly accessible to SWITCHengines users, our customers do benefit from it, because it allows us (the SWITCHengines operators) to perform various kinds of maintenance on the infrastructure without noticeable impact to them. This has been described quite well—at the time of "Heartbleed", another recent large-scale vulnerability—in a blog post from Google, who also use live migration in their IaaS offering, Google Compute Engine.
Anyway, the VENOM patch is an excellent example for this. We "just" needed to perform a live migration on each of the 500 machines after the new Qemu/KVM packages were installed, and the security problem was fixed completely.
If you noticed any impact due to our live migration orgy this Thursday (14 May), please let us know! Unfortunately we have to expect that similar vulnerabilities will be found in the future, and we want to make sure that SWITCHengines users aren't affected more than absolutely necessary.