SWITCH | SWITCHdrive | SWITCHengines |

SWITCHdrive Upgrade 6.0.6 to 7.0.5 Post Mortem


#1

This is the post-mortem of our ownCloud 6.0.6 to 7.0.5 upgrade, performed on March, 27th 2015.

SWITCH is running a Sync&Share service called SWITCHdrive for swiss academia. The software used is a very lightly branded version of OwnCloud, an open-source software with server components (written in PHP) and various native clients for different operating systems and mobile devices. SWITCH runs a complex, multi virtual machine setup on its own IaaS (Infrastructure as a Service) cloud offering SWITCHengines. SWITCHdrive is setup in a classic web stack architecture with a load balancer (HAProxy 1.5.2) that separates traffic based on client type (Web user or desktop client) to a 10 web servers that run Nginx 1.4.6 and 60 PHP5-FPM processes that run the actual application.

More details on our architecture can be found here: (BlackBook)

Upgrade Attempt No 1

We had planned to upgrade to ownCloud 7 for quite some time and we did a test upgrade (with production data) with the final release candidate of ownCloud 7 Enterprise in October or November 2014. Various other issues delayed the upgrade, but our system was creaking and we expected to gain a lot of performance by finally going to version 7. Based on the test results from October, we decided to do a spontaneous upgrade during the working week from 6.0.6 to 7.0.4. How hard could that be?

It turned out that it was a lot harder. The upgrade process ran for over 2 hours with no indication of what it was doing or if it was crashed. We used strace to snoop into what the process was doing and saw that it was hitting the filesystem and the database and generally working.

After the upgrade had finished, we started one web server and made it available internally for tests. When we loaded the application, all the test users files were gone. As we already had been offline for 2+ hours we decided to roll back to the previous version. We managed to do that in 30 minutes and were back online

Cause for failed upgrade

Diagnosing the failed system, eventually it was determined by ownCloud engineering, that the 7.0.4 upgrade didn’t load the LDAP users module, in effect not recognizing our 6000 users as valid and creating new storage entries in the database for them. There were no files associated with these storage entries, so the ownCloud server showed an empty folder to the users.

It took over a month before we received a fixed version of the server software.

Testing the upgrade

Not wanting to run into the same problem another time, we duplicated our live environment with live data (around 7000 users with 12 TB of data) and ran the upgrade on that. It worked flawlessly. The only problem was, that the upgrade process ran for over 16 hours. We instrumented the code and found that for every user a new directory ( ‘Shared’ ) was being created in the filesystem. Our creaking NFS server (which ran on virtualized storage) didn’t like that and took around 5 seconds per user to create the directory.

After directory creation, ownCloud ran a very expensive database query to remove files from the oc_share table that were no longer in the oc_filecache database. The DB query iterated of all entries in oc_share and did a full table scan on oc_filecache for each entry. 6000 full table scans on a table with 12 million rows takes a while…

With much wailing and gnashing of teeth we decided to go ahead with the planned upgrade and take a maintenance window of 24 hours. This was communicated to the IT departments of the institutions that had opted in to use SWITCHdrive, but not to the individual users.

The LDAP load balancing

In the meantime, we had noticed that our LDAP server was totally overloaded and hit some CPU limit. It seemed to not being able to handle the number of requests that the ownCloud software was throwing at it. We looked at various ways of improving performance, compared notes with other NRENs (Garr) and based on what we heard and what was recommended by ownCloud engineering, created 2 additional LDAP slave servers and hid all three servers behind our trusty HAProxy load balancer. Immediately the load on the LDAP master dropped and we actually were able to squeeze out a bit more performance on the entire system. Happy days!

Upgrade Number 2

March 27th - March 29th were the dates for the maintenance window. We took down SWITCHdrive on the evening of the 27th and backed up database and filesystem. Virtualisation and snapshots of the virtual disks made this relatively straight forward. Around 22:00 the actual upgrade process was started and we expected it to be done late afternoon on the 28th.

When we checked the status on the 28th, the upgrade process already had run. Because SWITCHdrive itself wasn’t running, the shared storage was a lot less busy and the directory creation went a lot faster than expected. We started an internal web server and tested to see that the application behaved normally. Everything looked good, and we went ahead and put it back online - happy to having shaved 16 hours of the maintenance window.

The lost files

An hour later, around 8:00 the first trouble tickets rolled in that had users tell us, that their files were gone - not only from the web GUI but much more worryingly: On their local machines as well! We switched of the servers immediately to contain the damage and made the first of many phone calls to the engineer designated to help us with the upgrade process. After some prodding and looking, we had identified the first problem: the server software had created additional entries for around 2300 users in the oc_storages table - the same problem that had plagued us in the first failed upgrade. We deleted those entries and files started to appear again and be synced back to the clients. First sighs of relief - the data itself wasn’t lost, just pointers to it were wrong.

Unfortunately, those wrong database entries came back. As soon as we started the service those entries popped up in the database again at random. We spent most of Saturday hunting down what was happening. We found the place where these entries were created in the database easily enough, but had no idea why they were being created. There are multiple code paths that eventually lead to that code and numerous if/else decisions in the code that made it difficult to pin-point the root problem. Eventually we gave up to search for the root cause and instead disabled the creation of the wrong storage entry - hoping that this would at least make the system halfway usable.

Sunday saw abysmal performance and the return of those entries in some cases. We switched off the synchronization servers and only let people work through the Web GUI. A number of small issues with the Web GUI were fixed as well - mostly stuff, that already had been worked out, but failed to make it into the released patch. Worst offender was the wrong NGINX configuration that caused all file downloads through the web to fail.

On monday, we started to work with three ownCloud developers on finding and fixing the root cause of the problem. After 10 hours of debugging, we were able to pin point the problem to a wrong or nonexistent answer from the LDAP servers. As noted above, we put our 3 LDAP servers behind the HAproxy load balancer. It seemed, that either HAProxy or the LDAP server behind it was not responding properly to all LDAP requests from the application. We assume that one every several thousand requests failed.

The fundamental problem was, how the ownCloud code handled the failure. Instead of properly catching and handling the exception, it silently swallowed the exception and returned a NULL object instead of a user object.

The code that ran after that, received a NULL user object and assumed that it had to create a new entry in the oc_storages table. When the desktop clients connected to the server, the server would assume that the user had no data and issue the deletion request to the local clients.

Unknown unknowns

We asked ownCloud engineering about other customers that had done the upgrade to ownCloud 7 and what their experiences were. They pointed us towards the colleagues of TU Berlin who had done an upgrade from 7.0.4 to 7.0.5 on the same weekend as we. We talked to the people at TU Berlin and learned that their upgrade from version 6 to 7.0.4 had been as troublesome as ours and that they also had seen the duplicate storage entries, and had resorted to delete them manually.

Had we known that they had experienced these problems with their upgrade or had the ownCloud engineer that helped us on Saturday realized that we were indeed experiencing the exactly same error in our upgrade, we would either have postponed the upgrade or at least rolled back on Saturday immediately after receiving the trouble reports.

This communication and information failure cost us an enormous amount of work and goodwill from our customers.

Taking care of users

During these days, there was a near constant stream of user tickets to handle. Some of them actually lost data that we weren’t able to recover (See below for details), some had lost data that was in our backups that we could recover. Many were complaining about the bad performance and the constant 502 and 504 errors that plagued the service.

We switched of the access for the desktop clients and only let people use the web interface to minimize the impact. We added a link to an explanation page in the SWITCHdrive Web UI, to help inform people.

Lost data

In at least one case, a user completely lost data that was written just before the update. As far as we can reconstruct the chain of events, this is what happened:

  • user writes on Bachelor Thesis and syncs it with SWITCHdrive
  • User writes some more, but no sync is taking place
  • Upgrade of SWITCHdrive happens and wrong storage entry for this user is created
  • User syncs and all data is deleted from local computer
  • Storage entry is removed in the database
  • User syncs again, and gets the data that was on our servers

The data that was written between the last sync and the upgrade is lost and could not be restored.

Rollback?

As we still had problems with the system, we asked ownCloud on Monday evening what they would recommend. The engineers/developers recommended we do a rollback to 6.0.6.

We rebuilt a complete infrastructure for Drive on our Openstack cluster on Tuesday, 31.3.2015 and copied both the database and the filesystem data to that infrastructure. Thanks to virtualization everywhere, this task was easy. We had snapshots of the volumes that contained the database and the 13 TB of user data. Creating new volumes and moving them around was a matter of few minutes.

The preparations took around one day after we had a running 6.0.6 system. We decided to open it for internal SWITCH users first to see if users that had synched with the new version would be able to go back to the state of a week earlier without loosing any data. The tests - unfortunately - showed that we would loose data that was generated between the 27.3 and the date of the rollback.

Meanwhile ownCloud engineering was busy working on a patch that would prevent the data corruption problem to take place again and promised us the patch by Wednesday 1.4 afternoon.

The team SWITCH people (including the admins, the product owner, marketing and management) debated on which direction to go: rollback with guaranteed data loss of unknown proportions or going forward with the risk of the patch not working.

We got some feedback on a working installation of ownCloud 7.0.5 from David Jericho at AARNet that sounded promising (the performance of 7 was much, much better than ownCloud 6)

Based on the data we had available, we decided to pursue the option of going forward with ownCloud 7 and go with the new (untested) patch. It was put into place on Wednesday evening and looked like it solved the problem - until we opened one of the sync servers to the public. The duplicate storage problem manifested itself almost immediately and we informed ownCloud engineering that there were more code paths that needed checking.

Our management had set a deadline for going live - one way or another for Thursday 2.4, 12:00. On Thursday morning we implemented another version of the patch and disabled the LDAP load balancing. During the morning we started to slowly open our sync servers to public traffic and monitored the performance of the system. By 11:45 we were completely live and started serving our users again.

In the afternoon, an email was sent out to all users with apologies and instructions on how to proceed.

As Easter was approaching, we hoped that the load on the system would be moderate in the beginning, giving us time to adapt our understanding of how the system performed.

Over Easter the admin team kept on working on keeping the system up and running.

Performance - Part 0 - Options

Based on the input from AARNet, we changed a number of configuration options for ownCloud. Notably these two options were set:

  ‘enable_previews’ => false,
  ‘filesystem_check_changes’ => 0,

By disabling the generation of previews, we could reduce the load on our NFS server. ownCloud tries to generate a preview for every file that is being uploaded (even if it doesn’t make sense for that file type to be rendered into a preview image). In the case of AARnet, this yielded significant performance improvements. We also saw an increase in overall performance.

The filesystem_check_changes flag tells ownCloud, that the data in the file storage doesn’t change outside of ownCloud control, so there is no need to constantly check for new or changed files. Again, this leads to less load on the file storage.

These options slipped through the cracks when we prepared the upgrade - even though we had already argued with ownCloud that the second one should be implemented.

Performance - Part 1 - DB Server

After we had the data situation under control, we started to look into the performance problems our users were reporting (and that we could see as well). We saw that the DB server had a high load - it was writing to disk constantly and saturating the disk io it was given by the shared Ceph storage. We already had plans to move it to a different machine with local SSD storage instead. The hardware was available, but not yet installed. Our previous OpenStack cluster had the servers, but they weren’t yet in the datacenter where our production cluster lived, and we hadn’t been able to move them earlier due to some production VMs that still were running on it. The SSDs were in our office, waiting for the servers.

On 7.4.2015, the 8 servers were moved to the new datacenter and two of them outfitted with SSDs. In the evening, one of them was running as a hypervisor in our Openstack cluster.
We created a new database server on this machine on 8.4. and made it the primary server during the morning. As expected, the performance of the DB server increased by more than an order of magnitude. We could disable the write barriers on the filesystem that is housing the actual database, which increases IOPS dramatically (at the cost of security - however, the SSDs are able to survive a power loss and will save the data in flight, making this problem a non issue)

Performance - Part 2 - Load Balancer

By recommendation of ownCloud, we made one code change and one policy change.

OwnCloud uses a PHP session during a web request. The session information is stored in a file in a RAM disk on each web server. ownCloud 7 releases this file very soon into the request, improving parallel requests from the same user or client. This can lead to problems (duplicated/lost data) when a user renames a folder. In order to minimize the chance of this happening, ownCloud engineering suggested we keep the session locked for the duration of each request. This would lead to lower performance but more security.

In addition, it was recommended that the desktop clients are sticky on one web server. Previously we had distributed the clients round-robin across our web servers.

When we put these changes in production, we started to see that our load balancer started to remove the web servers from rotation, seemingly at random. This led to the same load being distributed among fewer web servers, and that usually caused more servers to be unresponsive and removed from the load balancer.

It took us a while to understand the problem. We were guessing that the load on the web server was to high, or that there were problems connecting to the NFS servers. When we investigated one of the servers with New Relic RPM we saw, that there were many extremely long running PHP request processes. They spent insane amounts of time in the function that constructs a session. Effectively, one client could now take down an entire App server by not responding to a request fast enough, and new requests piling up on the next PHP5-FPM processes, starving the servers resources.

Removing the lock, and distributing the request round robin again, as well as implementing load ramping as described here https://www.igvita.com/2008/05/13/load-balancing-qos-with-haproxy/ and increasing the queue depth on HAproxy to 5000 allows us to keep the system running.

Even though it takes a long time for requests to be handled, clients are now held in the HAproxy queue instead of being treated with a 504 Bad Gateway Error message. This has allowed clients to finish their sync processes.

Performance - Part 3 - Dead Tuples

The next issue we saw was that the database server was overloaded and got stuck. A restart of the database server cured the problem for a while. We saw that the oc_filecache table was filling up with so called dead tuples very fast.

A dead tuple is a database row that is no longer valid and needs to be cleaned up eventually. Postgres does not overwrite a row when it is being UPDATEd by a SQL query. It write a new version of the row and marks the old one as dead and then updates any indexes necessary. A process called AUTOVACUUM periodically goes trough tables with many dead rows and cleans them up. If a table is too busy (i.e. too many updates happen), then this process will never start and dead tuples accumulate. We noticed that the table was collecting dead tuples at a rate of 10-40 million per day, on 15 million entries. As a first line of getting the system back in shape, we ran a manual VACUUM FULL ANALYZE on the database. This physically re-writes the complete table, discarding any dead rows. The downside of this is, that the database is essentially locked for the duration of the vacuum - between 10 and 20 minutes.

We tried several measures to mitigate this problem:

  • Postgres can set a fillfactor on a table. This tells the database how efficient to pack the rows into disk pages. For heavily updated tables, it is recorded to lower this value (the default is 100). We set it to 50, leaving half the page empty for updates. The advantage of this is, that updated rows are written into the same page, and the indexes don’t have to be adjusted. We saw no immediate benefits from this setting

Postgres Fillfactor

  • We made the auto vacuum process more aggressive, by lowering the threshold (dead / valid rows), and allowing it to interrupt the running database processes more often. We saw the auto vacuum process start more often, but the oc_filecache table still grew out of control during the day
  • We patched the central update clause in the ownCloud code to check if the values in the database actually differ from the ones being written - assuming that many updates were in essence superfluous. This seemed to reduce the amount of updates a bit.

don’t update identical values

None of these setting however resulted in a major breakthrough in performance. We still forced the VACUUM process to run twice a day (at 22:30 and 05:00). We noticed that the there is massive DB activity during the night - strange, because the number of clients is drastically reduced, compared to daytime.

We then noticed another issue: Some PHP5-FPM processes were continuing to hammer the database, even after the server that they were running on was removed from the load balancer pool. We are still investigating, what these Zombie processes are doing - because they seem to be the main culprit of our database performance problems. Restarting the PHP5-FPM service kills those zombies and immediately drops the load on the database.

The same effect can be achieved by restarting the database server. We are now restarting the DB server twice a day. this drops the connections and allows the AUTOVACUUM process to run and clean up the tables. The downtime for a DB restart is around 30 seconds compared to 15 minutes of VACUUM.

DISCARD ALL on PgBouncer

While researching the db performance problems, we noticed that around 50% of the queries on the database server were DISCARD ALL. These queries are issued by the PGBouncer session pool after a client session ends. This command cleans internal data structures on the database server (like temporary data and caches) and is generally advised to be used in a normal setting. We are however only connecting the same application to the database server, so the different sessions can use the same basic parameters. We switched the clearing of sessions off in the pgbouncer config and saw an immediate doubling of number of DB transactions handled per second. Interestingly enough, this jump was not reflected in any increase in handled HTTP sessions or other performance improvements. We are still investigating this behavior.

This discussion https://github.com/rails/rails/issues/14037 convinced us that it’s ok to disable DISCARD ALL.

Reasons for many UPDATEs

In our discussions with ownCloud engineering, we learned that ownCloud updates the ETAGs for all directories above any file that is changed on the server. An update to file that is 5 directories deep will therefore trigger at least 4 UPDATES for directories. This is used so that the clients can efficiently query the server for any changes in the filesystem. When a file (or directory) is shared with other users, these updates have to be propagated to theses users as well, resulting in even more updates. For a file that was stored in one directory and shared with one person, we counted 42 UPDATES to the oc_filecache table.

Performance - Part 4 - NFS Server

We store our users data on one NFS server on a virtual 100TB sized Ceph backed RBD volume, that we snapshot daily for backup purposes. Disk utilization on this disk was more or less constantly 100% and the NFS server was really, really busy. We knew (from the start of the project actually) that we’d have to change this simple solution eventually to something that would allow us to scale better.

After our installation started to work halfway decently, we began to migrate the users file data to more, smaller volumes on 2 NFS servers. The reasonings were this:

Most of the time, ownCloud performs meta data lookup, something the NFS servers are notoriously bad at. By splitting up the load to two NFS servers, we buy some time for now.

Using smaller volumes makes it easier for us to wrangle these volumes in the Ceph cluster. We discovered the hard way, that deleting a snapshot or a volume that is 100TB big can take multiple hours. During this time, no other deletions of volumes can take place. This is really bad. By making smaller volumes and distributing the data among them, we hope to improve on that.

The end goal however is to eventually move to a fully parallel file systems. Tests are underway, but nothing is in production yet.

Performance - Part 5 - Zombie Database Clients

Another possible explanation for the DB performance problems (and that the restarting of the DB Server helps) could be what we call “zombie php processes”.

We regularly see the app servers come into a state where they have multiple PHP processes that seem to be dead but still hold open DB connections. There are queries being executed and data transferred but the clients don’t seem to use them at all.

Restarting the app server cures that, as well as restarting the DB server. (Restarting the DB server right now is simpler for us)

Maybe this blog post from Zalando shows a similar behavior?

We have started looking at locked requests and see cases, where one update statement locks oc_filecache (objects, rows, tuples?) and other DB queries are blocking on this:

blocking_pid | blocked_pid |     age
———————+——————+——————
    5379 |        6698 | 00h:01m:04s
    5379 |        5896 | 00h:01m:04s
    5379 |        6681 | 00h:01m:04s
    5379 |        5743 | 00h:00m:49s
    5379 |        5390 | 00h:00m:49s
    5379 |        5902 | 00h:00m:48s
    5379 |        5899 | 00h:00m:35s
    5379 |        5904 | 00h:00m:35s
    5379 |        5377 | 00h:00m:25s
    5804 |        6696 | 00h:00m:04s
    6696 |        5468 | 00h:00m:03s
    5804 |        5760 | 00h:00m:00s
    5760 |        5459 | 00h:00m:00s

These locks go away after several seconds or a few minutes (we have only just started to investigate this)

SELECT
kl.pid as blocking_pid,
bl.pid as blocked_pid,
to_char(age(now(), a.query_start),’HH24h:Mim:SSs’) as age
FROM pg_catalog.pg_locks bl
JOIN pg_catalog.pg_stat_activity a
ON bl.pid = a.pid
JOIN pg_catalog.pg_locks kl
ON bl.locktype = kl.locktype
and bl.database is not distinct from kl.database
and bl.relation is not distinct from kl.relation
and bl.page is not distinct from kl.page
and bl.tuple is not distinct from kl.tuple
and bl.virtualxid is not distinct from kl.virtualxid
and bl.transactionid is not distinct from kl.transactionid
and bl.classid is not distinct from kl.classid
and bl.objid is not distinct from kl.objid
and bl.objsubid is not distinct from kl.objsubid
and bl.pid <> kl.pid
JOIN pg_catalog.pg_stat_activity ka
ON kl.pid = ka.pid
WHERE kl.granted and not bl.granted
ORDER BY a.query_start;

The real reason

Observing the code and talking to ownCloud engineering led us to find the true root problem: Deletions and renaming of directories in ownCloud are done within a database transaction. The actual removal or moving of files takes a long time, as there is a lot of superfluous work is being done for each file (for example after each file is deleted, the complete directory structure is scanned and updated (the size of the directory is recalculated - an expensive operation) and the same operations are done for each user the directory is shared with. In addition, each file is copied (not moved) to the trashbin directory. For large directories with a couple of ten-thousand files this process takes hours.

During that time, the database is internally inconsistent (as the transaction hasn’t been committed) and the VACUUM process can’t run. Also, no updates are sent to the DB slave. With time, this renders the database unusable and we have to restart it (in effect aborting the transaction)

While this gets performance back up, the directories are not deleted, so uses will usually try to delete them again - resulting in the same result… We have had quite a few users that have asked us to manually delete their file structures, because the deletion process never finished.

We are in contact with ownCloud engineering about ways to improve this process. We hope that relevant changes will make it into version 8.1

Our immediate future

As of mid may 2015 we are in a situation where SWITCHdrive mostly works. We are soon forcing our users to use newer sync clients (As ownCloud engineering strongly recommends to limit access to only newer versions >= 1.7.1)

We are also starting to test the upgrade to version 8.1 which we are told is mostly a bug fix release for 8, with lots of improvements that are directly relevant for us.

In addition we are thinking about breaking up our monolithic ownCloud installation into around 30 smaller ones - one for each institution that has signed up for the SWITCHdrive service. We have an idea on how to do that with exactly the same number of virtual machines and a bit more configuration. The advantages we see are in the DB area: Instead of one huge DB with 20 million rows, we’d have 30 smaller ones. All operation will perform faster, there is less likelyhood of the whole database being locked for long periods of time etc. The drawback is that people have a bit more work to do to share files between institutions. ownCloud version 7 and above support federated sharing and that has worked nicely in our tests so far. Also, most users share within their institution, so the impact is relatively minor.

We are currently setting up two complete production copies of our SWITCHdrive installation and will perform testing on what the best way of proceeding will be.

Takeaways for OwnCloud

  • Improve the verbosity of the upgrade process - especially for long running task like creating directories or doing expensive DB queries
  • Test on a “real” system: Several 1000 users, a few TB of data and a few million files
  • Re-Think the architecture - it does not scale and it does not work reliably
  • Supervise junior programmers more closely.
  • The quality of the code is abysmal in places - swallowing exceptions is but one symptom. Methods are big, complicated (have a high cyclomatic complexity) and are hard to understand and debug.
  • There are code paths that make no sense: A check for user information in LDAP should not result in a storage entry being created “by accident”. a) because the user is known, as the code is querying for a shared user and b) because of separation of concerns in the code

Takeways for SWITCH

  • We have heard loud and clear that we need be better at communicating both planned upgrades and problems (should they occur)

  • We will schedule major upgrades in semester breaks, and not during the semester. SWITCHdrive is being used in various ways and not having it available was a major source of frustration for many of our users

  • As many problems only seem to manifest themselves under heavy load, we are working to setting up automated load testing - the CERN smashbox initiative is a good starting point for that

  • We have intensified the connections to other academic and educational institutions across Europe (and the world) that are using ownCloud in order to have a better feel for problems they have encountered. We don’t want to run into the situation again where we are hit by a known bug

  • We will be much more likely to roll back to the previous version after an update, should we encounter problems


SWITCHdrive Problems 23.5.2015