We use Ceph as the basis of our storage cluster.
Ceph is a distributed storage system, that bundles multiple servers and disks into one “software defined storage” that can be accessed in various ways. A Ceph cluster consists of servers that contain multiple disks. Each disk is controlled by a process called “OSD” (Object Storage Daemon). The OSD is responsible for storing and retrieving objects. Ceph stores data in objects and it can be accessed in various ways. There are options for a pure object storage (natively or via an S3 compatible interface), block devices (RBD) and an (experimental) filesystem (CephFS).
Ceph has no single point of control and therefore no single point of failure. It bills itself as self healing, and can grow or shrink more or less automatically. When a disk or a server dies, the data it contained is redistributed to other running servers. New hardware can be integrated while the system is running and is incorporated without any downtime.
Ceph and OpenStack
We use Ceph RBD as the basis of disk in our OpenStack cluster. Every virtual machine gets a virtual block device that is backed by an RBD volume on Ceph. Additional volumes in arbitrary sizes can be created through the OpenStack GUI and attached to virtual machines.
That gives us a tremendous amount of flexibility. Normally, a hypervisor has local disks and VMs are running from that disk. That makes it impossible to live-migrate running VMs from one hypervisor to another (for example when we need to perform security related maintenance on the OpenStack cluster). With Ceph, this is trivial, as the storage is equally available to all hypervisors.
The creation of new volumes is also very fast and it is possible to create a “Copy on Write” clone of a disk. When you start a new virtual machine, such a clone will be created from the master image (for example an Ubuntu or CentOS image). This clone is ready within less than a second as only a few data structures need to be created, and no data is copied.
There are however, drawbacks. When the performance of the Ceph cluster is bad, all VMs will feel that and run slow in the worst case grind to a halt for several seconds.
Here is what happens…
Ceph stores data multiple times (3 times by default) on separate disks on separate servers. A clever algorithm (CRUSH) calculates the placement of data across the whole infrastructure and makes sure, that data is as safe as it can be (for example two copies of the data are not stored on the same server, as the death of a server would take down too many copies). Every time that the topology of the storage cluster changes (a disk dies or is added or a server is taken away or added) the CRUSH algorithm computes the new optimal layout for the data and starts a process called “rebalancing” that makes sure that the data is stored optimally. In the case of the loss of a disk for example, certain data is only available two times instead of three. Ceph will therefore immediately create a new copy on another disk to make sure that the data is safe.
The problem - in a nutshell - is that potentially a lot of data has to be moved when the topology changes. Today, we had a faulty disk that we took out of the Ceph cluster at around 11:15. Around 15:30 we added another disk to the cluster to take over the duties of the failed disk. You can see the write traffic of our cluster in the following graph.
Unfortunately, this amount of read / write traffic can cause the VMs that are stored on all of those disks to stall and hang until the cluster has sufficiently calmed down.
We are still working on tweaking various settings to make this process less resource intensive and to minimize the impact on running VMs.