On Saturday, 23.5.2015 SWITCHdrive was offline for most of the day. Here's the post mortem and what we learnt in the process.
Around 2 weeks before the 23.5 we noticed that our Ceph storage cluster was reporting several disks that were almost full. Ceph is a distributed storage system, that in our case consists of 16 storage nodes with a 102 disks, each 4 TB in size. This gives us a raw capacity of 408 TB. Ceph stores each set of data three times, the actual capacity therefore was around 136 TB. The Ceph cluster is uses as the backend storage for the whole OpenStack cluster - not only SWITCHdrive but other services as well (SWITCHtube for example stores its videos in Ceph as well)
Ceph uses a deterministic pseudo-random algorithm (CRUSH) to distribute data across the drives and tries to store data evenly across the disks. However, some disks get more data than others. This will lead to problems when a disk becomes full, as Ceph no longer will write to it.
We had four disks that were above 85% full and heading towards 90%. Our team decided to add another 22 disks to the Ceph cluster on Friday, 22.5. As already mentioned in the "Ceph storage cluster and VM performance" post, adding disks to the cluster will severely impact the performance of the running virtual machines. We decided to start the onboarding process Friday evening, so that most performance problems were expected during the night.
The process of adding more disks to the Ceph cluster ran normally. Performance was impacted but the system were running. Then at midnight something happened:
The disk utilization of one of the data disks of SWITCHdrive (the one that contains symlinks to all other drives) spiked at 100%. Immediately, all of our web servers became unresponsive and were removed by the load balancer.
At that point in time, SWITCHdrive was gone.
The case of the crashing NFS server
When we discovered this in the morning, we rebooted the web servers (as this usually takes care of any problems) and the NFS server and indeed, SWITCHdrive started working again - until it failed again, after a few minutes.
By looking at
syslog, we saw that the NFS server had crashed hard. We found this in
/var/log/syslog (yes - this is from a later date, but the original crash was the same):
May 23 21:44:10 drive-nfs1 kernel: [ 165.264648] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
May 23 21:44:19 drive-nfs1 kernel: [ 173.880092] NFSD: starting 90-second grace period (net ffffffff81cdab00)
May 23 21:44:23 drive-nfs1 rpc.mountd: Version 1.2.8 starting
May 23 21:51:26 drive-nfs1 kernel: [ 600.776177] INFO: task nfsd:1696 blocked for more than 120 seconds.
May 23 21:51:26 drive-nfs1 kernel: [ 600.778090] Not tainted 3.13.0-53-generic #89-Ubuntu
May 23 21:51:26 drive-nfs1 kernel: [ 600.779507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 23 21:51:26 drive-nfs1 kernel: [ 600.781504] nfsd D ffff88013fd93180 0 1696 2 0x00000000
May 23 21:51:26 drive-nfs1 kernel: [ 600.781508] ffff8800b2391c50 0000000000000046 ffff8800b22f9800 ffff8800b2391fd8
May 23 21:51:26 drive-nfs1 kernel: [ 600.781511] 0000000000013180 0000000000013180 ffff8800b22f9800 ffff880035f48240
May 23 21:51:26 drive-nfs1 kernel: [ 600.781513] ffff880035f48244 ffff8800b22f9800 00000000ffffffff ffff880035f48248
May 23 21:51:26 drive-nfs1 kernel: [ 600.781515] Call Trace:
May 23 21:51:26 drive-nfs1 kernel: [ 600.781523] [<ffffffff81727749>] schedule_preempt_disabled+0x29/0x70
May 23 21:51:26 drive-nfs1 kernel: [ 600.781526] [<ffffffff817295b5>] __mutex_lock_slowpath+0x135/0x1b0
May 23 21:51:26 drive-nfs1 kernel: [ 600.781528] [<ffffffff8172964f>] mutex_lock+0x1f/0x2f
May 23 21:51:26 drive-nfs1 kernel: [ 600.781557] [<ffffffffa03b1761>] nfsd_lookup_dentry+0xa1/0x490 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781568] [<ffffffffa03b044b>] ? fh_verify+0x14b/0x5e0 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781591] [<ffffffffa03b1bb9>] nfsd_lookup+0x69/0x130 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781613] [<ffffffffa03be90a>] nfsd4_lookup+0x1a/0x20 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781628] [<ffffffffa03c055a>] nfsd4_proc_compound+0x56a/0x7d0 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781638] [<ffffffffa03acd3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781662] [<ffffffffa028762d>] svc_process_common+0x46d/0x6d0 [sunrpc]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781678] [<ffffffffa0287997>] svc_process+0x107/0x170 [sunrpc]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781687] [<ffffffffa03ac71f>] nfsd+0xbf/0x130 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781696] [<ffffffffa03ac660>] ? nfsd_destroy+0x80/0x80 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [ 600.781702] [<ffffffff8108b6b2>] kthread+0xd2/0xf0
May 23 21:51:26 drive-nfs1 kernel: [ 600.781707] [<ffffffff8108b5e0>] ? kthread_create_on_node+0x1c0/0x1c0
May 23 21:51:26 drive-nfs1 kernel: [ 600.781712] [<ffffffff81733868>] ret_from_fork+0x58/0x90
May 23 21:51:26 drive-nfs1 kernel: [ 600.781717] [<ffffffff8108b5e0>] ? kthread_create_on_node+0x1c0/0x1c0
Reading the logs, it seemed that the NFS server was waiting to read from a disk (and waited more than 120 seconds). Our first hunch was that one of the volumes attached to the NFS servers virtual machine had gone bad. The timing (midnight) seemed to point to a problem with snapshotting (We snapshot all NFS volumes for backup purposes)
We tried to determine if a specific volume had gone bad, recreated several volumes from snapshots, attached them to the NFS server, rebooted everything for good measures. Each time, SWITCHdrive came back online - was running for anywhere between 5 and 50 minutes until the NSF server crashed.
We then suspected a problem with the hypervisor, that the NFS server was running on, so we live migrated all running VMs from the hypervisor to a different one and tried to see if things got better. Every time, the NFS server crashed after a variable amount of time. This trial and error process took most of Saturday and we were running out of options.
Our engineers checked the complete Ceph cluster (which had finished rebalancing late in the morning) but saw no signs of any abnormal behavior.
Slowly, we were running out of ideas...
As mentioned, we have 3 NFS servers. Only one crashed regularly, the others were running just fine. The one crashing had attached to it 6 2TB volumes for user data, and one volume for the symlinks. In addition it also had the old data volume attached - the one with 100 TB size. As mentioned in the previous post mortem we had started to distribute the data to more and smaller disks. We still had a few users that had their data on this old volume, as we couldn't migrate them away safely (we have a migration script that starts to migrate a user if she hasn't been active for a couple of minutes. As those users were online all the time, their data wasn't migrated). We used the downtime to migrate those and could then remove that volume from the NFS server.
In the meantime, we also rebooted one of the Ceph storage nodes that was complaining about a bad disk.
Lo and behold, the NFS server didn't crash during 60 minutes. It was around 22:30 on Saturday night now, and we hoped that we had a system that would continue to work. As there was almost no load on SWITCHdrive by that time, we started an instance of SmashBox to at least do a bit of stress testing during the night.
The NFS server continued to run both on Sunday and Monday, and SWITCHdrive was available and picked up more and more users. The question now was: Was it the removal of one disk from the NFS server or the rebooting of the Ceph node that caused that?
Our engineers created a new test VM and attached 9 virtual disks to it. As soon as he started to access all of those disks, the access hung for 120 seconds - exactly the cause that led to the NFS servers crash. This leads to the question: Why did this work before the disk upgrade?
We are now investigating the uses of threads / processes on the hypervisor to the various OSD (Object Storage Daemon) processes - it seems that there are many and that there are limits... We will keep you updated.
Update: It seems that we are hitting the 1024 file descriptors limits of the QEMU process. For every mounted disk in the VM the QEMU process creates a TCP connection to all physical disks (via the OSD processes running on the storage servers). Going from 102 to 125 disks with 9 mounted disks exceeded the 1024 limit.