# Considering my backup systems

With the recent news that Crashplan were doing away with their “Home” offering, I had reason to reconsider my choice of online backup backup provider. Since I haven’t written anything here lately and the results of my exploration (plus description of everything else I do to ensure data longevity) might be of interest to others looking to set up backup systems for their own data, a version of my notes from that process follows.

## The status quo

I run a Linux-based home server for all of my long-term storage, currently 15 terabytes of raw storage with btrfs RAID on top. The choice of btrfs and RAID allows me some degree of robustness against local disk failures and accidental damage to data.

If a disk fails I can replace it without losing data, and using btrfs’ RAID support it’s possible to use heterogenous disks, meaning when I need more capacity it’s possible to remove one disk (putting the volume into a degraded state) and add a new (larger) one and rebalance onto the new disk.

btrfs’ ability to take copy-on-write snapshots of subvolumes at any time makes it reasonable to take regular snapshots of everything, providing a first line of defense against accidental damage to data. I use Snapper to automatically create rolling snapshots of each of the major subvolumes:

• Synchronized files (mounted to other machines over the network) have 8 hourly, 7 daily, 4 weekly and 3 monthly snapshots available at any time.
• Staging items (for sorting into other locations) have a snapshot for each of the last two hours only, because those items change frequently and are of low value until considered further.
• Everything else keeps one snapshot from the last hour and each of the last 3 days.

This configuration strikes a balance according to my needs for accident recovery and storage demands plus performance. The frequently-changed items (synchronized with other machines and containing active projects) have a lot of snapshots because most individual files are small but may change frequently, so a large number of snapshots will tend to have modest storage needs. In addition, the chances of accidental data destruction are highest there. The other subvolumes are either more static or lower-value, so I feel little need to keep many snapshots of them.

I use Crashplan to back up the entire system to their “cloud”1 service for $5 per month. The rate at which I add data to the system is usually lower than the rate at which it can be uploaded back to Crashplan as a backup, so in most cases new data is backed up remotely within hours of being created. Finally, I have a large USB-connected external hard drive as a local offline backup. Also formatted with btrfs like the server (but with the entire disk encrypted), I can use btrfs send to send incremental backups to this external disk, even without the ability to send information from the external disk back. In practice, this means I can store the external disk somewhere else completely (possibly without an Internet connection) and occasionally shuttle diffs to it to update to a more recent version. I always unplug this disk from power and its host computer when not being updated, so it should only be vulnerable to physical damage and not accidental modification of its contents. ### Synchronization and remotes For synchronizing current projects between my home server (which I treat as the canonical repository for everything), the tools vary according to the constraints of the remote system. I mount volumes over NFS or SMB from systems that rarely or never leave my network. For portable devices (laptop computers), Syncthing (running on the server and portable device) makes bidirectional synchronization easy without requiring that both machines always be on the same network. I keep very little data on portable devices that is not synchronized back to the server, but because it is (or, was) easy to set up, I used Crashplan’s peer-to-peer backup feature to back up my portable computers to the server. Because the Crashplan application is rather heavyweight (it’s implemented in Java!) and it refuses to include peer-to-peer backups in uploads to their storage service (reasonably so; I can’t really complain about that policy), my remote servers back up to my home server with Borg. I also have several Android devices that aren’t always on my home network- these aren’t covered very well by backups, unfortunately. I use FolderSync to automatically upload things like photos to my server which covers the extent of most data I create on those devices, but it seems difficult to make a backup of an Android device that includes things like preferences and per-app data without rooting the device (which I don’t wish to do for various reasons). ### Summarizing the status quo • btrfs snapshots offer quick access to recent versions of files. • btrfs RAID provides resilience against single-disk failures and easy growth of total storage in my server. • Remote systems synchronize or back up most of their state to the server. • Everything on the server is continuously backed up to Crashplan’s remote servers. • A local offline backup can be easily moved and is rarely even connected to a computer so it should be robust against even catastrophic failures. ## Evaluating alternatives Now that we know how things were, we can consider alternative approaches to solve the problem of Crashplan’s$5-per-month service no longer being available. The primary factors for me are cost and storage capacity. Because most of my data changes rarely but none of it is strictly immutable, I want a system that makes it possible to do incremental backups. This will of course also depend on software support, but it means that I will tend to prefer services with straightforward pricing because it is difficult to estimate how many operations (read or write) are necessary to complete an incremental backup.

Some services like Dropbox or Google Drive as commonly-known examples might be appropriate for some users, but I won’t consider them. As consumer-oriented services positioned for the use case of “make these files available whenever I have Internet access,” they’re optimized for applications very different from the needs of my backups and tend to be significantly more expensive at the volumes I need.

So, the contenders:

• Crashplan for Small Business: just like Crashplan Home (which was going away), but costs $10/mo for unlimited storage and doesn’t support peer-to-peer backup. Can migrate existing Crashplan Home backup archives to Small Business as long as they are smaller than 5 terabytes. • Backblaze:$50 per year for unlimited storage, but their client only runs on Mac and Windows.
• Google Cloud Storage: four flavors available, where the interesting ones for backups are Nearline and Coldline. Low cost per gigabyte stored, but costs are incurred for each operation and transfer of data out.
• Backblaze B2: very low cost per gigabyte, but incurs costs for download.
• Online.net C14: very low cost per gigabyte, no cost for operations or data transfer in the “intensive” flavor.
• AWS Glacier: lowest cost for storage, but very high latency and cost for data retrieval.

The pricing is difficult to consume in this form, so I’ll make some estimates with an 8 terabyte backup archive. This somewhat exceeds my current needs, so should be a useful if not strictly accurate guide. The following table summarizes expected monthly costs for storage, addition of new data and the hypothetical cost of recovering everything from a backup stored with that service.

Service Storage cost Recovery cost Notes
Crashplan $10 0 “Unlimited” storage, flat fee. Backblaze$4.17 0 “Unlimited” storage, flat fee.
GCS Nearline $80 ~$80 Negligible but nonzero cost per operation. Download $0.08 to$0.23 per gigabyte depending on total monthly volume and destination.
GCS Coldline $56 ~$80 Higher but still negligible cost per operation. All items must be stored for at least 90 days (kind of).
B2 $40$80 Flat fee for storage and transfer per-gigabyte.
C14 €40 0 “Intensive” flavor. Other flavors incur per-operation costs.
Glacier $32$740 Per-gigabyte retrieval fees plus Internet egress. Reads may take up to 12 hours for data to become available. Negligible cost per operation. Minimum storage 90 days (like Coldline).

Note that for Google Cloud and AWS I’ve used the pricing quoted for the cheapest regions; Iowa on GCP and US East on AWS.

### Analysis

Backblaze is easily the most attractive option, but the availability restriction for their client (which is required to use the service) to Windows and Mac makes it difficult to use. It may be possible to run a Windows virtual machine on my Linux server to make it work, but that sounds like a lot of work for something that may not be reliable. Backblaze is out.

AWS Glacier is inexpensive for storage, but extremely expensive and slow when retrieving data. The pricing structure is complex enough that I’m not comfortable depending on this rough estimate for the costs, since actual costs for incremental backups would depend strongly on the details of how they were implemented (since the service incurs charges for reads and writes). The extremely high latency on bulk retrievals (up to 12 hours) and higher cost for lower-latency reads makes it questionable that it’s even reasonable to do incremental backups on Glacier. Not Glacier.

C14 is attractively priced, but because they are not widely known I expect backup packages will not (yet?) support it as a destination for data. Unfortunately, that means C14 won’t do.

Google Cloud is fairly reasonably-priced, but Coldline’s storage pricing is confusing in the same ways that Glacier is. Either flavor is better pricing-wise than Glacier simply because the recovery cost is so much lower, but there are still better choices than GCS.

B2’s pricing for storage is competitive and download rates are reasonable (unlike Glacier!). It’s worth considering, but Crashplan still wins in cost. Plus I’m already familiar with software for doing incremental backups on their service (their client!) and wouldn’t need to re-upload everything to a new service.

## Fallout

I conclude that the removal of Crashplan’s “Home” service effectively means a doubling of the monthly cost to me, but little else. There are a few extra things to consider, however.

First, my backup archive at Crashplan was larger than 5 terabytes so could not be migrated to their “Business” version. I worked around that by removing some data from my backup set and waiting a while for those changes to translate to “data is actually gone from the server including old versions,” then migrating to the new service and adding the removed data back to the backup set. This means I probably lost a few old versions of the items I removed and re-added, but I don’t expect to ever need any of them.

Second and more concerning in general is the newfound inability to do peer-to-peer backups from portable (and otherwise) computers to my own server. For Linux machines that are always Internet-connected Borg continues to do the job, but I needed a new package that works on Windows. I’ve eventually chosen Duplicati, which can connect to my server the same way Borg does (over SSH/SFTP) and will in general work over arbitrarily-restricted internet connections in the same way that Crashplan did.

## Concluding

I’m still using Crashplan, but converting to their more-expensive service was not quite trivial. It’s still much more inexpensive to back up to their service compared to others, which means they still have some significant freedom to raise the cost until I consider some other way to back up my data remotely.

As something of a corollary, it’s pretty clear that my high storage use on Crashplan is subsidized by other customers who store much less on the service; this is just something they must recognize when deciding how to price the service!

# Web history archival and WARC management

I’ve been a sort of ‘rogue archivist’ along the lines of the Archive Team for some time, but generally lack the combination of motivation and free time to directly take part in their activities. That said, I do sometimes go on bursts of archival since these things do concern me; it’s just a question of when I’ll get manic enough to be useful and latch onto an archival task as the one to do. An earlier public example is when I mirrored ticalc.org.

The historical record contains plenty of instances where people maintained copies of their communications or other documentation which has proven useful to study, and in the digital world the same is likely to be true. With the ability to cheaply store large amounts of data, it is also relatively easy to generate collections in the hope of their future utility.

Something I first played with back in 2014 was extracting lists of web pages to archive from web browser history. From a public perspective this may not be particularly interesting, but if maintained over a period of time this data could be interesting as a snapshot of a typical-in-some-fashion individual’s daily life, or for purposes I can’t foresee.

Today I’m going to write a little about how I collect this data and reduce the space requirements. The products of this work that are source code can be found on Bitbucket.

# Copyright is broken

I got a.. “fun” e-mail from mediafire a few weeks ago, saying that one of my files had been suspended due to suspected copyright infringement.

fb-hitler? Oh, that’s some Free Software I wrote. I disputed the claim, simply stating that fb-hitler.tar.bz2 is a piece of software that I created (and thus own the copyright to). As of tonight, I’ve heard nothing back about it, and the file is still inaccessible. Here’s the link to it, for future reference:

http://www.mediafire.com/?mhnmnjztyn3 (.tar.bz2, 477 KB)

And here’s the complete message I got. Notice it somehow got pulled in by somebody looking to get links to Dragonball Z downloads removed, and that the link to fb-hitler itself isn’t even in the (absurdly long) list of URLs given.

Dear MediaFire User:
MediaFire has received notification under the provisions of the Digital Millennium Copyright Act (“DMCA”) that your usage of a file is allegedly infringing on the file creator’s copyright protection.

The file named fb-hitler.tar.bz2 is identified by the key (mhnmnjztyn3).

As a result of this notice, pursuant to Section 512(c)(1)(C) of the DMCA, we have suspended access to the file.

The reason for suspension was:

Information about the party that filed the report:

Company Name: LeakID
Contact Address: 15 bis rue de chateaudun, 02250 La garenne colombes, France
Contact Name: HervÃ© Lemaire
Contact Phone:
Contact Email: herve.lemaire@leakid.com

Copyright infringement violates MediaFire’s Terms of Service. MediaFire accounts that experience multiple incidents of alleged copyright infringement without viable counterclaims may be terminated.

If you feel this suspension was in error, please submit a counterclaim by following the process below.

Step 1. Click on the following link to open the counterclaim webpage.

Step 2. Use this PIN on the counterclaim webpage to begin the process:

[removed by Tari]

Step 3. Fill in the fields on the counterclaim form with as much detail as possible.

This is a post-only mailing. Replies to this message are not monitored or answered.

So, what of it? In theory, the DMCA is pretty reasonable (discounting the criminalization of DRM circumvention). The safe harbor provisions for hosts are worthwhile, and the takedown process (that is, sending a request to the host) is reasonable. Problem is, it’s been twisted- there’s supposed to be a penalty for requesting the takedown of items that the requestor does not own copyright to, in order to deter trolls. In practice, there is no penalty and content creators go around freely demanding the removal of just about anything, with no repercussions. The automated systems on most service providers now just worsen the problem (though understandably, because the hosts have little ability to fight against the underlying policy that results in these things), because rightsholders can spew all kinds of takedown demands with minimal effort.

For those subject to these takedown demands, it’s unfair because many hosts will deactivate users’ accounts when they receive too many demands for removal of items uploaded by a given user, even if the user proves they have the right to upload the content. For example, YouTube suspends accounts after three copyright-related incidents, no matter the outcome.

To my mind, this situation is unacceptable, and it reeks of the “old media” clutching at straws to prop up an outdated business model, to the detriment of everyone else. As recent history has shown, legal force has little effect on the economic problem of media piracy, and utterly fails to address the economics that lead to this phenomenon (sounds a bit like the US government’s war on drugs, actually..).

Going forward, I would support greatly reducing the length of copyright terms (to somewhere around 20 years, perhaps). While I can’t comment much on what exactly that means to rightsholders and their profits (though I have little sympathy for them, whatever the situation is, due to such things as seen in this post), it would be hugely useful to anybody concerned with preserving history (whom I count myself among), because the time required before something can legally be reproduced without the creator’s consent is greatly reduced. With shorter time spans, it is much less likely that any given piece of content will be lost forever, which is the ultimate result that should be avoided.

Enough of a rant regarding my position on copyright, though. The real point here is that I was annoyed by a spurious copyright claim on something I created, and I will be avoiding mediafire for my future file storage needs (not that I ever used them for much).

# A few small projects

Going through some of my old projects this evening, I came across a couple little tools I wrote.  I’ve uploaded them here in the hope that others will find them useful.  They are the GCNClient GUI and RX BRR calculator.

I make no guarantees of the utility of these pieces of software, but they may be useful as examples in how to perform some task in the .NET framework (both are written in C# for .NET), or just for performing the very specific tasks which they are designed to perform.

# Back to wordpress

After about a year of running a purely static site here, I finally decided it would be worthwhile to move the site backend back to WordPress.

I moved away from WordPress early this year primarily because I was dissatisfied with the theming situation.  While lightword is certainly a well-designed piece of software and markup, I wanted a system that would be easier to customize.  Being written and configured in PHP (a language I don’t know have have little interest in learning), I decided WordPress didn’t offer the easy customizeability that I wanted in a web publishing platform, and made the switch to generating the site as a set of static pages with hyde.  I’ve now decided to make the switch back to WordPress, and the rest of this post outlines my thought process in doing so.

# Archiving

One of the things that I am most concerned about in life is the preservation of information.  To me, destruction of information, no matter the content, is a deeply regrettable action.  Deliberate destruction of data is fortunately rare, but too often it may still be lost, often through simple neglect.  For example, [science fiction author] Charlie Stross, in a recent discussion, noted that the web site belonging to Robert Bradbury has become inaccessible at some point since his death, and Mr. Stross was thus unable to find Bradbury’s original article on the subject of Matroishka brains.

That comment led me to realize quickly that this great distributed repository of our age (the world wide web) is a frighteningly ethereal thing- what exists on a server at one moment may disappear without warning for reasons ranging from legal intervention (perhaps because some party asserts the information is illegal to distribute) to the death of the author (in which one’s web hosting may be suspended due to unpaid bills).  Whatever the reason, it is impossible to guarantee that some piece of data will not be lost forever if the author’s copy disappears.

How can we preserve information on the web?  Historically, libraries have filled that role, and in that respect, things haven’t changed that much in the Internet age.  The Internet Archive is a nonprofit organization that works to be like a digital library, and they specifically note that huge swaths of cultural (and other) data might be lost to the depths of time if we do not take steps to preserve it now.  The Internet Archive’s wayback machine (which will probably be familiar to many readers who have needed to track down no-longer-online data) is a continually-updated archive of snapshots of the web.

It’s fairly slow to crawl, but most pages are eventually found by the wayback machine crawlers, so the challenge of data preservation is greatly reduced for site owners, to in most cases only requiring content to be online for a short time (probably less than a year in most cases) before it is permanently archived.  For non-textual content unfortunately, the wayback machine is useless, since it will only mirror web pages, and not images or other non-textual content.  To ensure preservation of non-textual content, however, the solution is also rather easy: upload it to the Internet Archive.  It’s not automatic like the wayback machine, but the end result is the same.

# Back to WordPress

This brings me back to my choice of using WordPress to host this web site, rather than a solution that I develop and maintain.  Quite simply, I decided that it is more important to get information I produce out in public so it can be disseminated and archived, rather than maintain fine-grained control over the presentation of the information.

While with Hyde I was able to easily control every aspect of the site design and layout, it also meant that I had to much write much of the the software to drive any additional features that might improve searchability or structure of the content.  When working with WordPress (or any out-of-the-box CMS really), however, I can concern myself with the things that are of real importance- the data, and let the presentation mostly take care of itself.

While Hyde put up barriers to disseminating information (the source being decoupled from presentation and requiring offline editing, for example), my new-old out-of-the-box CMS solution in WordPress makes it extremely easy to publicize information without getting tied up in details which are ultimately irrelevant.

## Filtering oneself

With ease of putting information out in public comes the challenge of searching it.  I try to be selective about what I make public, partially because I tend to be somewhat introverted, but also in order to ensure that the information I generate and publicize is that which is of interest to people in the future (although it seems I was only doing the latter subconsciously prior to now).  There are platforms to fill with drivel and day-to-day artifacts of life, but a site like this is not one of them- Twitter, Facebook, and numerous other ‘social’ web sites fill that niche admirably, but can never replace more carefully curated collections

## Ephemera

Preservation of ephemera is at the core of some of the large privacy concerns in today’s world.  Companies such as Facebook host huge amounts of arguably irrelevant content generated by their users, and mine the data to generate profiles for their users.  On its surface, this is an amazing piece of work, because these companies have effectively constructed automated systems to document the lives of everybody currently alive.  Let that sink in for a moment: Facebook is capable of generating a moderately detailed biography for each of this planet’s 7 billion people (provided they each were to provide Facebook with some basic data).

What would you do with a biography of someone distilled from advertising data (advertising data because that’s what Facebook exists to do- sell information about what you might like to buy to advertisers)?  I don’t know, but the future has a way of finding interesting ways to use existing data.  In some distant future, maybe a project might seek to reconstruct (even resurrect, by a way of thinking) everybody who ever lived.  There are innumerable possibilities for what might be done with the data (this goes for anything, not just biographical data like this), but it becomes impossible to use it if it gets destroyed.

The historical bane of all archives has been capacity.  With digital archives, this is a significantly smaller problem.  With multi-terabyte hard disks costing on the order of \$0.10 per gigabyte and solid-state memory continuing to follow the pace of Moore’s law (although probably not for much longer), it is easier than ever to store huge amounts of information, dwarfing the largest collections of yesteryear.  As long as storage capacity continues to grow (we’ve only recently scratched the surface of using quantum phenomena (holography) for data storage, for example), the sheer amount of data generated by nearly any process is not a concern.

# Back on topic

Returning from that digression, the point of switching this site back to a WordPress backend is to get data out to the public more reliably and faster, in order to preserve the information more permanently.  What finally pushed me back was a sudden realization that there’s nothing stopping me from customizing WordPress in a similar fashion to what I did on the Hyde-based site- it simply requires a bit of experience with the backend code.  While PHP is one language I tend to loathe, the immediate utility of a working system is more valuable than the potential utility of a system I need to program myself.

There’s another lesson I can derive from this experience, too: building a flexible system is good, but you should distribute it ready-to-go for a common use case.  Reducing the barrier to entry for a tool can make or break it, and tools that go unused are of no use- getting people using a new creation is the primary barrier to progress