When the Storage System Goes Down – A Retrospective on Our Ceph Incident

tl;dr

On September 3rd, 2025, we experienced a major outage.The cause was a bug in the storage software Ceph, which forms the backbone of our hosting infrastructure.

As a result, websites and emails for many of our customers were only partially available or completely inaccessible for several days. What made the situation especially critical was the initial uncertainty about whether all data could be fully restored. While we did have backups, they were not fully up to date.

Our top priority was to restore essential functions, such as sending and receiving emails, as quickly as possible. We managed to achieve that on the evening of September 5th. A few days later, we were able to recover and secure the full data sets step by step. By September 9th, it was clear: no data had been lost.

For those interested in the technical details, how the outage occurred and what lessons we learned, you’ll find all the background information below.

Details of the Outage

On the evening of September 3rd, we performed an upgrade of our production Ceph cluster from version 18 to 19. This cluster has been the foundation of our hosting infrastructure for years and is used both for virtual machines (via RBD disks) in combination with Proxmox and for our web and mail hosting through CephFS.

Throughout this article, you’ll come across a few key components that play an important role:

  • MONs (Monitors): Manage the cluster’s state and configuration. They form a quorum, which must always be maintained.
  • MDS (Metadata Servers): Responsible for managing metadata within CephFS.

A Brief Overview of the Affected Infrastructure

Our affected setup consisted of six physical servers, each equipped with twelve drives (HDDs or SSDs), and one CephFS metadata server (MDS), distributed across two data centers. In addition, a virtual server was running solely as a Ceph monitor.

Ceph is a distributed storage system that stores data redundantly across multiple physical servers. Its foundation is RADOS (Reliable Autonomic Distributed Object Store), an object-based storage layer where data is stored as “blobs” together with their metadata.

We used Ceph in two main ways:

  • Block storage (RBD): Often used in virtualization environments where Ceph provides virtual “disks” that can be used as standard disks with standard file systems (such as ext4, XFS, or NTFS). In our case, Ceph works hand in hand with Proxmox to provide storage for our virtual machines.
  • CephFS: A filesystem that can be mounted directly and allows multiple systems to access data in parallel (similar to NFS). We use CephFS to store all our web and mail hosting data.

CephFS manages an enormous amount of data within a single directory tree that’s accessed concurrently by many systems.It’s essential to note that this directory tree is handled by several of the MDS servers mentioned above. This feature is known as “Dynamic Subtree Partitioning” and is essential for maintaining the filesystem’s responsiveness under heavy load. In our setup, we ran five of these MDS servers in parallel.

The Beginning of the Problem

The upgrade was meant to follow the official Ceph documentation. After a few preparation steps, the process requires reducing the number of active MDS servers to one before proceeding with the package updates on the hardware nodes.

However, that’s where things started to go wrong. Around 5:30 p.m., we noticed issues with CephFS. Some of the MDS servers had crashed or failed to start again after the update. As a result, parts of the CephFS became inaccessible.

Our initial response was to restart the affected MDS servers, assuming they were the root cause. When that failed, we attempted to remove them from the configuration, but to our surprise, this caused three of the seven MONs to crash as well. The remaining monitors lost their quorum, which meant that no interaction with the Ceph cluster was possible anymore.

At that point, it was unclear whether the RBD disks would also be affected. A complete failure would have meant that our entire virtualization environment was unusable. Thankfully, none of the RBD disks failed during the incident, and no data was lost.

Of course, INWX maintains redundant backups of all customer data. However, a full restoration of the entire infrastructure would have taken several days.

Standstill and Uncertainty

Needless to say, the atmosphere was tense. We had expected to perform a routine update of a software we’ve been using for many years. In the past, such upgrades had never been anything out of the ordinary. The sudden realization that a total outage had occurred within such a short time, and the severity of the situation, left us momentarily in shock.

It quickly became clear that the failure of the monitors was linked to the fact that the MONs also store the states of the MDS servers. This meant that the issue affected not only CephFS but the entire cluster.

We soon discovered that the monitors had ended up in a state that, for us, was irreparable. We suspected that one of the unsuccessful repair attempts had caused them to persist a corrupted view of the cluster’s state.

Despite our many years of experience with Ceph, we had reached a point where we needed outside help.

Support from croit and the First Steps Toward Recovery

A few hours later, we brought in external support from croit GmbH. Their specialists analyzed the situation and eventually discovered that we had indeed encountered a bug in Ceph.

During the attempt to reduce the number of active MDS servers, the Ceph monitor accessed an empty internal data structure and crashed. The exact same issue was reported by croit to the Ceph project just one day later as Pull Request #65413. If there’s one good thing to come out of this incident, it’s that it directly contributed to improving Ceph itself.

At the same time, our team at INWX was working to get hosting services back up and running.

We had a backup of the hosting data, but it had been completed 48 hours before the outage due to technical constraints. Creating such a large backup takes about eight hours, meaning that the dataset wasn’t fully “atomic”. Fortunately, our web hosting databases were stored outside of CephFS and remained unaffected. Still, we hesitated to restore from backup right away, as mismatches between website files and databases could easily have led to inconsistencies.

By the evening of September 4th, the croit team had provided us with patched monitor binaries that allowed the cluster to become quorate again, meaning it could reach consensus and resume proper operation. This first milestone was a relief: our RBD-based storage was fully functional again, and we no longer had to fear downtime for our virtual machines.

However, CephFS was still inaccessible. Starting early the next morning, Friday, September 5th, croit continued debugging the Ceph software. That evening, they delivered another newly patched version, which we immediately tested together with them on Saturday. Unfortunately, it didn’t yet bring the success we were hoping for.

Interim Solutions for Our Customers

The days leading up to the full restoration were no less stressful for our customers than they were for us.

Above all, our customers needed to be able to send and receive emails again. That’s why, on Friday, we focused our efforts on restoring at least this core functionality.

By Friday evening, September 5th, we managed to bring mail hosting back online in a basic form using a ZFS- and NFS-based storage setup. Mailboxes and addresses were once again accessible, and new emails could be sent and received. Older messages were still missing at that point, but the experience quickly showed how important it was to at least get this partial service running again.

We also set a firm deadline: by Monday at midnight, we would restore the data from our backups no matter what, and we began preparing for that recovery in advance.

In addition, we decided to provide emergency support throughout the weekend and, upon request, offer manual restoration of hosting packages from backups for customers who needed it sooner.

Restoring from Backup

By Monday, it was already becoming apparent that we might soon regain access to CephFS, though it was still unclear in what state and to what extent that access would be possible. When our self-imposed deadline arrived, we decided to proceed with restoring the backup.

Fortunately, the web hosting restoration went mostly smoothly, with only a few minor issues.

For mail hosting, we merged the restored backup data with the emails that had arrived since Friday. The subsequent re-indexing of all mailboxes then served as the first real load test for our new storage infrastructure.

CephFS Back Online

It was only through asession reset of the MDS servers, based on a potentially destructive disaster recovery procedure combined with custom-patched Ceph binaries, that croit managed to bring the filesystem back to life. An extraordinary achievement!

On September 9th, six days after the beginning of the outage, we were finally able to mount CephFS again and back it up using rsync. The result was clear: no data had been lost.

We immediately began restoring the missing emails as well. This was done in two steps:

  • The data created between the end of the backup and the start of the outage could be clearly identified based on modification timestamps.
  • The data created between the start and end of the backup could theoretically contain duplicates. We used the tool rdfind to detect and remove these before cleanly reintegrating the messages into the dataset.

This process continued through the following weekend, but in the end, it was confirmed: not a single email was lost.

Communication with Customers

Throughout the incident, we kept our customers informed via regular status updates on https://is.inwx.online, starting on Wednesday evening. Since not all customers were familiar with the status page, we also sent out our first email update on Friday.

Our communication was at times quite reserved. From a customer’s perspective, it would have been more helpful to explain the root cause of the outage in greater detail. Even if that meant diving a bit deeper into the technical background.

Looking ahead, we plan to communicate upcoming maintenance work more proactively. In the past, such announcements were posted only on the status page, but we’ll now make greater use of the INWX Tech Newsletter to share infrastructure maintenance updates directly with our customers.

Lessons Learned

We learned a great deal during this week. Cluster systems like Ceph inherently carry the risk that a single configuration change can bring the entire environment to a halt. We were aware of this in theory, but we had clearly underestimated how significant that risk could be in practice. Going forward, we’ll place a much stronger focus on redundancy by operating multiple independent clusters.

One particularly critical takeaway is that Ceph offers very limited recovery options if the monitors fail. For systems where maximum availability is essential, we’ll therefore rely on other storage technologies. Ceph will continue to play a role in our infrastructure, but we’ve already decided to retire CephFS and have migrated our hosting platform to ZFS and NFS.

We also realized that when running such a central and complex system as Ceph, having a strong partner like croit is absolutely invaluable. Without the support of the Ceph developers, data loss would have been nearly impossible to avoid.

And finally: backups and snapshots. With ZFS, we can create snapshots and perform backups far more frequently than was previously feasible. Our goal now is to complement our daily backups with additional hourly data snapshots, providing an extra layer of safety.

Conclusion

The outage of our Ceph cluster was one of the most critical incidents we’ve faced in the past ten years. It cost us nerves, time, and quite a few gray hairs—but it also resulted in a bug fix for Ceph itself and provided us with invaluable insights.

The fact that no data was lost is entirely thanks to the croit team. Only through their tireless effort and deep understanding of Ceph was it possible to bring the cluster back to life. The level of technical depth required to achieve this can hardly be captured within the scope of this article. Members of the Ceph Steering Committee and component leads of the affected software were directly involved through croit.

It was a tough week for our customers and for us. The fact that we came out of it without any data loss is thanks to the dedication of our team, the patience of our customers, and the unwavering support of croit and the Ceph community.