Web history archival and WARC management

I’ve been a sort of ‘rogue archivist’ along the lines of the Archive Team for some time, but generally lack the combination of motivation and free time to directly take part in their activities. That said, I do sometimes go on bursts of archival since these things do concern me; it’s just a question of when I’ll get manic enough to be useful and latch onto an archival task as the one to do. An earlier public example is when I mirrored ticalc.org.

The historical record contains plenty of instances where people maintained copies of their communications or other documentation which has proven useful to study, and in the digital world the same is likely to be true. With the ability to cheaply store large amounts of data, it is also relatively easy to generate collections in the hope of their future utility.

Something I first played with back in 2014 was extracting lists of web pages to archive from web browser history. From a public perspective this may not be particularly interesting, but if maintained over a period of time this data could be interesting as a snapshot of a typical-in-some-fashion individual’s daily life, or for purposes I can’t foresee.

Today I’m going to write a little about how I collect this data and reduce the space requirements. The products of this work that are source code can be found on Bitbucket.

High-availability /home revisited

About a month ago, I wrote about my experiments in ways to keep my home directory consistently available. I ended up concluding that DRBD is a neat solution for true high-availability systems, but it’s not really worth the trouble for what I want to do, which is keeping my home directory available and in-sync across several systems.

Considering the problem more, I determined that I really value a simple setup. Specifically, I want something that uses very common software, and is resistant to network failures. My local network going down is an extremely rare occurence, but it’s possible that my primary workstation will become a portable machine at some point in the future- if that happens, anything that depends on a constant network connection becomes hard to work with.

If an always-online option is out of the question, I can also consider solutions which can handle concurrent modification (which DRBD can do, but requires using OCFS, making that solution a no-go).

Rsync

rsync is many users’ first choice for moving files between computers, and for good reason: it’s efficient and easy to use.  The downside in this case is that rsync tends to be destructive, because the source of a copy operation is taken to be the canonical version, any modifications made in the destination will be wiped out.  I already have regular cron jobs running incremental backups of my entire /home so the risk of rsync permanently destroying valuable data is low.  However, being forced to recover from backup in case of accidental deletions is a hassle, and increases the danger of actual data loss.

In that light, a dumb rsync from the NAS at boot-time and back to it at shutdown could make sense, but carries undesirable risk.  It would be possible to instruct rsync to never delete files, but the convenience factor is reduced, since any file deletions would have to be done manually after boot-up.  What else is there?

Unison

I eventually decided to just use Unison, another well-known file synchronization utility.  Unison is able to handle non-conflicting changes between destinations as well as intelligently detect which end of a transfer has been modified.  Put simply, it solves the problems of rsync, although there are still situations where it requires manual intervention.  Those are handled with reasonable grace, however, with prompting for which copy to take, or the ability to preserve both and manually resolve the conflict.

Knowing Unison can do what I want and with acceptable amounts of automation (mostly only requiring intervention on conflicting changes), it became a simple matter of configuration.  Observing that all the important files in my home directory which are not already covered by some other synchronization scheme (such as configuration files managed with Mercurial) are only in a few subdirectories, I quickly arrived at the following profile:

root = /home/tari
root = /media/Caring/sync/tari

path = incoming
path = pictures
path = projects
path = wallpapers

Fairly obvious function here, the two sync roots are /home/tari (my home directory) and /media/Caring/sync/tari (the NAS is mounted via NFS at /media/Caring), and only the four listed directories will be syncronized. An easy and robust solution.

I have yet to configure the system for automatic syncronization, but I’ll probably end up simply installing a few scripts to run unison at boot and when shutting down, observing that other copies of the data are unlikely to change while my workstation is active.  Some additional hooks may be desired, but I don’t expect configuration to be difficult.  If it ends up being more complex, I’ll just have to post another update on how I did it.

Update Jan. 30: I ended up adding a line to my rc.local and rc.shutdown scripts that invokes unison:

su tari -c "unison -auto home"

Note that the Unison profile above is stored as ~/.unison/home.prf, so this handles syncing everything I listed above.

rtorrent scripting considered harmful

As best I can tell, whomever designed the scripting system for rtorrent did so in a manner contrived to make it as hard to use as possible.  It seems that = is the function application operator, and precedence is stated by using a few levels of distinct escaping. For example:

# Define a method 'tnadm_complete', which executes 'baz' if both 'foo' and 'bar' return true.
system.method.insert=tnadm_complete,simple,branch={and="foo=,bar=",baz=}

With somewhat more sane design, it might look more like this:

system.method.insert(tnadm_complete, simple, branch(and(foo(),bar()),baz()))

That still doesn’t help the data-type ambiguity problems (‘tnadm_complete’ is a string here, but not obviously so), but it’s a bit better in readability. I haven’t tested whether the escaping with {} can be nested, but I’m not confident that it can.

In any case, that’s just a short rant since I just spent about two hours wrapping my brain around it. Hopefully that work turns into some progress on a new project concept, otherwise it was mostly a waste. As far as the divergence meter goes, I’m currently debugging a lack of communication between my in-circuit programmer and the microcontroller.

Incidentally, the rtorrent community wiki is a rather incomplete but still useful reference for this sort of thing, while gi-torrent provides a reasonably-organized overview of the XMLRPC methods available (which appear to be what the scripting exposes), and the Arch wiki has a few interesting examples.

Experiments with a high-availability /home

I was recently experimenting with ways to configure my computing setup for high availability of my personal data, which is stored in a Btrfs-formatted partition on my SSD. When my workstation is booted into Windows, however, I want to be able to access my data with minimal effort. Since there’s no way to access a Btrfs volume natively from within Windows, I had to find another approach. It seemed like automatically syncing files out to my NAS was the best solution, since that’s always available and independent of most other things I would be doing at any time.

Candidates

The obvious first option for syncing files to the NAS is the ever-common rsync. It’s great at periodic file transfers, but real-time syncing of modifications is rather beyond the ken of rsync.  lsync provides a reasonable way to keep things reasonably in-sync, but it’s far from an elegant solution.  Were I so motivated, it would be reasonable to devise a similar rsync wrapper using inotify (or similar mechanisms) to only handle modified files and possibly even postpone syncing changes until some change threshold is exceeded.  With existing software, however, rsync is a rather suboptimal solution.

From a cursory scan, cluster filesystems such as ceph or lustre seem like good options for tackling this problem.  The main disadvantage of the cluster filesystem approach, however, is rather high complexity. Most cluster filesystem implementations require a few layers of software, generally both a metadata server and storage server. In large deployments that software stack makes sense, but it’s needless complexity for me.  In addition, ensuring that data is correctly duplicated across both systems at any given time may be a challenge.  I didn’t end up trying this route so ensuring data duplication may be easier than it seems, but a cluster filesystem ultimately seemed like needless complexity for what I wanted to do.

While researching cluster filesystems, I discovered xtreemfs, which has a number of unique features, such as good support for wide-area storage networks, and is capable of operating securely even over the internet.  Downsides of xtreemfs are mostly related to the technology it’s built on, since the filesystem itself is implemented with Linux’s FUSE (Filesystem in USErspace) layer and is implemented in Java.  Both those properties make it rather clunky to work with and configure, so I ended up looking for another solution after a little time spent attempting to build and configure xtreemfs.

The solution I ultimately settled upon was DRBD, which is a block-level replication tool.  Unlike the other approaches, DRBD sits at the block level (rather than the filesystem level), so any desired filesystem can be run on top of it.  This was a major advantage to me, because Btrfs provides a few features that I find important (checksums for data, and copy-on-write snapshotting). Handling block-level syncing is necessarily somewhat more network-intensive than running at the file level, but since I was targeting use over a gigabit LAN, network usage was a peripheral concern.

Implementation

From the perspective of normal operation, a DRBD volume looks like RAID 1 running over a network.  One host is marked as the primary, and any changes to the volume on that host are propagated to the secondary host.  If the primary goes offline for whatever reason, the secondary system can be promoted to the new primary, and the resource stays available. In the situation of my designs for use of DRBD, my workstation machine would be the primary in order to achieve normal I/O performance while still replicating changes to the NAS. Upon taking the workstation down for whatever reason (usually booting it into another OS), all changes should be on the NAS, which remains active as a lone secondary.

DRBD doesn’t allow secondary volumes to be used at all (mainly since that would introduce additional concerns to ensure data integrity), so in order to mount the secondary and make it accessible (such as via a Samba share) the first step is to mark the volume as primary. I was initially cautious about how bringing the original primary back online would affect synchronization, but it turned out to handle such a situation gracefully. When the initial primary (workstation) comes back online following promotion of the secondary (NAS), the former primary is demoted back to secondary status, which also ensures that any changes while the workstation was offline are correctly mirrored back. While the two stores are resyncing, it is possible to mark the workstation as primary once more and continue normal operation while the NAS’ modifications sync back.

Given that both my NAS and workstation machines run Arch Linux, setup of DRBD for this scheme was fairly simple. First order of business was to create a volume to base DRBD on. The actual DRBD driver is part of mainline Linux since version 2.6.33, so having the requisite kernel module loaded was easy. The userspace utilities are available in the AUR, so it was easy to get those configured and installed. Finally, I created a resource configuration file as follows:

resource home {
device /dev/drbd0;
meta-disk internal;

protocol A;
startup {
become-primary-on Nakamura;
}

on Nakamura {
disk /dev/Nakamura/home;
}
on Nero {
disk /dev/loop0;
}

}

The device option specifies what name the DRBD block device should be created with, and meta-disk internal specifies that the DRBD metadata (which contains such things as the dirty bitmap for syncing modified blocks) should be stored within the backing device, rather than in some external file. The protocol line specifies asynchronous operation (don’t wait for a response from the secondary before returning saying a write is complete), which helps performance but makes the system less robust in the case of a sudden failure. Since my use case is less concerned with robustness and more with simple availability and maintaining performance as much as possible, I opted for the asynchronous protocol. The startup block specifies that Nakamura (the workstation) should be promoted to primary when it comes online.

The two on blocks specify the two hosts of the cluster. Nakamura’s volume is backed by a Linux logical volume (in the volume group ‘Nakamura’), while Nero’s is hosted on a loop device. I chose to use a loop device on Nero simply because the machine has a large amount of storage (6TB in RAID5), but no unallocated space, so I had to use a loop device. In using a loop device I ended up ignoring a warning in the DRBD manual about running it over loop block devices causing deadlocks– this ended up being a poor choice, as described later.

It was a fairly simple matter of bringing the volumes online once I had written the configuration. Load the relevant kernel module, and use the userland utilities to set up the backing device. Finally, bring the volume up. Repeat this series of steps again on the other host.

# modprobe drbd
# drbdadm up home

With the module loaded and a volume online, status information is visible in /proc/drbd, looking something like the following (shamelessly taken from the DRBD manual):

Back on topic

Returning from that digression, the point of switching this site back to a WordPress backend is to get data out to the public more reliably and faster, in order to preserve the information more permanently.  What finally pushed me back was a sudden realization that there’s nothing stopping me from customizing WordPress in a similar fashion to what I did on the Hyde-based site- it simply requires a bit of experience with the backend code.  While PHP is one language I tend to loathe, the immediate utility of a working system is more valuable than the potential utility of a system I need to program myself.

There’s another lesson I can derive from this experience, too: building a flexible system is good, but you should distribute it ready-to-go for a common use case.  Reducing the barrier to entry for a tool can make or break it, and tools that go unused are of no use- getting people using a new creation is the primary barrier to progress

Btrfs

I recently converted the root filesystem on my netbook, a now rather old Acer Aspire One with an incredibly slow 1.8″ Flash SSD, from the ext3 I had been using for quite a while to the shiny new btrfs, which becomes more stable every time the Linux kernel gets updated. As I don’t keep any data of particular importance on there, I had no problem with running an experimental filesystem on it.

Not only was the conversion relatively painless, but the system now performs better than it ever did with ext3/4.

Conversion

Btrfs supports a nearly painless conversion from ext2/3/4 due to its flexible design. Because btrfs has almost no fixed locations for metadata on the disc, it is actually possible to allocate btrfs metadata inside the free space in an ext filesystem. Given that, all that’s required to convert a filesystem is to run btrfs-convert on it- the only requirement is that the filesystem not be mounted.

As the test subject of this experiment was just my netbook, this was easy, since I keep a rather simple partition layout on that machine. In fact, before the conversion, I had a single 8GB ext4 partition on the system’s rather pathetic SSD, and that was the extent of available storage. After backing up the contents of my home directory to another machine, I proceeded to decimate the contents of my home directory and drop the amount of storage in-use from about 6GB to more like 3GB, a healthy gain.

Linux kernel

To run a system on Btrfs, there must, of course, be support for it in the kernel. Because I customarily build my own kernels on my netbook, it was a simple matter of enabling Btrfs support and rebuilding my kernel image. Most distribution kernels probably won’t have such support enabled since the filesystem is still under rather heavy development, so it was fortunate that my setup made it so easy.

GRUB

The system under consideration runs GRUB 2, currently version 1.97, which has no native btrfs support. That’s a problem, as I was hoping to only have a single partition. With a little research, it was easy to find that no version of GRUB currently supports booting from btrfs, although there is an experimental patchset with provides basic btrfs support in a module. Unfortunately, to load a module, GRUB needs to be able to read the partition in which the module resides. If my /boot is on btrfs, that’s a bit troublesome. Thus, the only option is for me to create a separate partition for /boot, containing GRUB’s files and my Linux kernel image to boot, formatted with some other file system. The obvious choice was the tried-and-true ext3.

This presents a small problem, in that I need to resize my existing root partition to make room on the disc for a small /boot partition. Easily remedied, however, with application of the Ultimate Boot CD, which includes the wonderful Parted Magic. GParted, included in Parted Magic, made short work of resizing the existing partition and its filesystem, as well as moving that partition to the end of the disc, which eventually left me with a shiny new ext3 partition filling the first 64MB of the disc.

Repartitioning

After creating my new /boot partition, it was a simple matter of copying the contents of /boot on the old partition to the new one, adjusting the fstab, and changing my kernel command line in the GRUB config file to mount /dev/sda2 as root rather than sda1.

Move the contents of /boot:

$mount /dev/sda1 /mnt/boot$ cp -a /boot /mnt/boot
\$ rm -r /boot

Updated fstab:

/dev/sda1       /boot   ext3    defaults    0 1
/dev/sda2       /       btrfs   defaults    0 1

Finishing up

Finally, it was time to actually run btrfs-convert. I booted the system into the Arch Linux installer (mostly an arbitrary choice, since I had that image laying around) and installed the btrfs utilities package (btrfs-progs-unstable) in the live environment. Then it was a simple matter of running btrfs-convert on /dev/sda2 and waiting about 15 minutes, during which time the disc was being hit pretty hard. Finally, a reboot.

..following which the system failed to come back up, with GRUB complaining loudly about being unable to find its files. I booted the system from the Arch installer once again and ran grub-install on sda1 in order to reconfigure GRUB to handle the changed disc layout. With another reboot, everything was fine.

With my new file system in place, I took some time to tweak the mount options for the new partition. Btrfs is able to tune itself for solid-state devices, and will set those options automatically. From the Btrfs FAQ:

There are some optimizations for SSD drives, and you can enable them by mounting with -o ssd. As of 2.6.31-rc1, this mount option will be enabled if Btrfs is able to detect non-rotating storage.

However, there’s also a ssd_spread option:

Mount -o ssd_spread is more strict about finding a large unused region of the disk for new allocations, which tends to fragment the free space more over time. Mount -o ssd_spread is often faster on the less expensive SSD devices

That sounds exactly like my situation- a less expensive SSD device which is very slow when doing extensive writes to ext3/4. In addition to ssd_spread, I turned on the noatime option for the filesystem, which cuts down on writes at the expense of not recording access times for files and directories on the file system. As I’m seldom, if ever, concerned with access times, and especially so on my netbook, I lose nothing from such a change and gain (hopefully) increased performance.

Thus, my final (optimized) fstab line for the root filesystem:

/dev/sda2       /       btrfs   defaults,noatime,ssd_spread    0

Results

After running with the new setup for about a week and working on normal tasks with it, I can safely say that on my AA1, Btrfs with ssd_spread is significantly more responsive than ext4 ever was. While running Firefox, for example, the system would sometimes stop responding to input while hitting the disc fairly hard.

With Btrfs, I no longer have any such problem- everything remains responsive even under fairly high I/O load (such as while Firefox is downloading data from Firefox Sync, or when I’m applying updates).