# Experiments with a high-availability /home

I was recently experimenting with ways to configure my computing setup for high availability of my personal data, which is stored in a Btrfs-formatted partition on my SSD. When my workstation is booted into Windows, however, I want to be able to access my data with minimal effort. Since there’s no way to access a Btrfs volume natively from within Windows, I had to find another approach. It seemed like automatically syncing files out to my NAS was the best solution, since that’s always available and independent of most other things I would be doing at any time.

# Candidates

The obvious first option for syncing files to the NAS is the ever-common rsync. It’s great at periodic file transfers, but real-time syncing of modifications is rather beyond the ken of rsync.  lsync provides a reasonable way to keep things reasonably in-sync, but it’s far from an elegant solution.  Were I so motivated, it would be reasonable to devise a similar rsync wrapper using inotify (or similar mechanisms) to only handle modified files and possibly even postpone syncing changes until some change threshold is exceeded.  With existing software, however, rsync is a rather suboptimal solution.

From a cursory scan, cluster filesystems such as ceph or lustre seem like good options for tackling this problem.  The main disadvantage of the cluster filesystem approach, however, is rather high complexity. Most cluster filesystem implementations require a few layers of software, generally both a metadata server and storage server. In large deployments that software stack makes sense, but it’s needless complexity for me.  In addition, ensuring that data is correctly duplicated across both systems at any given time may be a challenge.  I didn’t end up trying this route so ensuring data duplication may be easier than it seems, but a cluster filesystem ultimately seemed like needless complexity for what I wanted to do.

While researching cluster filesystems, I discovered xtreemfs, which has a number of unique features, such as good support for wide-area storage networks, and is capable of operating securely even over the internet.  Downsides of xtreemfs are mostly related to the technology it’s built on, since the filesystem itself is implemented with Linux’s FUSE (Filesystem in USErspace) layer and is implemented in Java.  Both those properties make it rather clunky to work with and configure, so I ended up looking for another solution after a little time spent attempting to build and configure xtreemfs.

The solution I ultimately settled upon was DRBD, which is a block-level replication tool.  Unlike the other approaches, DRBD sits at the block level (rather than the filesystem level), so any desired filesystem can be run on top of it.  This was a major advantage to me, because Btrfs provides a few features that I find important (checksums for data, and copy-on-write snapshotting). Handling block-level syncing is necessarily somewhat more network-intensive than running at the file level, but since I was targeting use over a gigabit LAN, network usage was a peripheral concern.

# Implementation

From the perspective of normal operation, a DRBD volume looks like RAID 1 running over a network.  One host is marked as the primary, and any changes to the volume on that host are propagated to the secondary host.  If the primary goes offline for whatever reason, the secondary system can be promoted to the new primary, and the resource stays available. In the situation of my designs for use of DRBD, my workstation machine would be the primary in order to achieve normal I/O performance while still replicating changes to the NAS. Upon taking the workstation down for whatever reason (usually booting it into another OS), all changes should be on the NAS, which remains active as a lone secondary.

DRBD doesn’t allow secondary volumes to be used at all (mainly since that would introduce additional concerns to ensure data integrity), so in order to mount the secondary and make it accessible (such as via a Samba share) the first step is to mark the volume as primary. I was initially cautious about how bringing the original primary back online would affect synchronization, but it turned out to handle such a situation gracefully. When the initial primary (workstation) comes back online following promotion of the secondary (NAS), the former primary is demoted back to secondary status, which also ensures that any changes while the workstation was offline are correctly mirrored back. While the two stores are resyncing, it is possible to mark the workstation as primary once more and continue normal operation while the NAS’ modifications sync back.

Given that both my NAS and workstation machines run Arch Linux, setup of DRBD for this scheme was fairly simple. First order of business was to create a volume to base DRBD on. The actual DRBD driver is part of mainline Linux since version 2.6.33, so having the requisite kernel module loaded was easy. The userspace utilities are available in the AUR, so it was easy to get those configured and installed. Finally, I created a resource configuration file as follows:

resource home {
device /dev/drbd0;
meta-disk internal;

protocol A;
startup {
become-primary-on Nakamura;
}

on Nakamura {
disk /dev/Nakamura/home;
}
on Nero {
disk /dev/loop0;
}

}

The device option specifies what name the DRBD block device should be created with, and meta-disk internal specifies that the DRBD metadata (which contains such things as the dirty bitmap for syncing modified blocks) should be stored within the backing device, rather than in some external file. The protocol line specifies asynchronous operation (don’t wait for a response from the secondary before returning saying a write is complete), which helps performance but makes the system less robust in the case of a sudden failure. Since my use case is less concerned with robustness and more with simple availability and maintaining performance as much as possible, I opted for the asynchronous protocol. The startup block specifies that Nakamura (the workstation) should be promoted to primary when it comes online.

The two on blocks specify the two hosts of the cluster. Nakamura’s volume is backed by a Linux logical volume (in the volume group ‘Nakamura’), while Nero’s is hosted on a loop device. I chose to use a loop device on Nero simply because the machine has a large amount of storage (6TB in RAID5), but no unallocated space, so I had to use a loop device. In using a loop device I ended up ignoring a warning in the DRBD manual about running it over loop block devices causing deadlocks– this ended up being a poor choice, as described later.

It was a fairly simple matter of bringing the volumes online once I had written the configuration. Load the relevant kernel module, and use the userland utilities to set up the backing device. Finally, bring the volume up. Repeat this series of steps again on the other host.

# modprobe drbd