This blog post by Mark Pilgrim, along with a few other incidents, got me thinking about backups again. I'm doing OK, but I can do better.
The thing that makes it all work is that the cost-per-byte of hard disk storage is declining at a constant rate of 50%/year -- and has done so pretty consistently for the last half-century. That's not to say that we won't hit a hard limit at some point, but I'd guess we're good for another 15 years at least. My data requirements aren't growing nearly that fast. So whenever I start getting low on disk, I buy two or three of whatever's cheap at Fry's. The "let's try to move this obsolete thing" price seems to be about $89 (presently applied to drives in the neighborhood of 250GB). (There's a second tier of "there's just a few of these left and we need room in the cage" pricing, at $69. I'll often buy my third drive at that price.)
The effect of this pricing is that, if I buy new disks every few years for a roughly constant average annual investment, I can make a clean copy of everything I've accumulated so far and have room for twice as much going forward.
Everything in the house worth backing up is on the fileserver. This
contains two identical disks, and I have a cron
job that
copies the main one to the backup every night using rsync
.
The backup drive stays spinning, but it's mounted read-only except when
I'm writing to it. It's partitioned identically to the main
drive, so it's basically a hot spare.
Once a week, (I have a calendar entry to remind me on Sunday morning) I
back up everything I care about to an external USB drive (on my desktop
machine) that's turned off except when I'm writing to it. This
longer-term backup is just the home directories, /etc
from
the servers, and the other shared filesystems. Everything that can be
easily re-installed (root, /usr
, most of /var
,
and things like my collection of Linux distribution ISOs) is left off, so
it's pretty small.
Before backing up onto the external drive I make a copy (using
cp -l
, which makes links to shadow ordinary files
instead of copying them) of the previous weekly backup. I save four of
these. After making the backup, if it's the first backup of the
month, I make yet another copy and save it "forever".
Every few years a drive starts acting up, at which point I replace it (and its mirror, if it was the main drive) with whatever's available that year at the usual price.
The current setup, with "hot" daily mirroring and "cold" weekly backups and monthly archives, works pretty well. It isn't disaster-proof, though.
The next thing that has to be done is to disaster-proof the system with off-site backups. An archive copy in the vault sounds good, until you stop to wonder just how many hard drives you can cram into a safe-deposit box -- and what shape they'd be in after a decade or so of disuse. No, the only safe thing is to put the offsite backups on another live system. Preferably one that someone else is backing up.
Unfortunately, network bandwidth is not doubling every year. The only way to bootstrap the process is to hand-carry another hard drive. OK, I can do that. After that, compressed incremental backups will probably work, at least for a few years. I may still have to hand-carry the occasional large file.
There are three natural places to put the offsite backups: my ISP (which has limited space), a hosting service (which I don't have), or an external drive on my desk at work. The security problems are obvious: do I really want to trust somebody else with my private files, some of which are very private? And work has some potential legal problems.
The solution is encryption. I'm going to use the time-tested technique of
encrypting every file (plus a header) separately, using its MD5 hash as
the key. I'll use the SHA-1 hash of the resulting encrypted "blob" as the
filename. Each directory will be represented by a list of filenames,
metadata (permissions, owner, group, and mod date), key, and blob ID,
stored in a file (blob) of its own. The astute reader may notice a distinct
resemblance to the Linux kernel-hackers' version-control system,
git
. That's deliberate, though it's not entirely clear at
this point just how much of git
I'll want to use. I will
probably use the same headers.
Currently a handful of personal configuration files, bookmarks, public keys, and so on are backed up in multiple places because I mirror them between home and work, and occasionally copy them to laptops and flash drives. I need to go through my home directories and make sure I move all of the sensitive ones to the encrypted blob system, at which point they'll stay encrypted all the time.
A lot of what I work on is public, or at least semi-public: websites, recorded songs, and the like. That gets offsite "backups" automatically, but it needs a little more work.
Public stuff is already "backed up" pretty well, just by being published. The problem is that there's no history -- it's under version control (presently CVS), but none of the repository files are mirrored off-site. Some of the larger files, notably the audio files, aren't under version control at the moment; that isn't much of a problem. The bigger problem is that the version repository isn't backed up offsite.
The obvious solution is to move to a version-control system that handles
large binary files gracefully, and that makes the entire repository
accessible via HTTP. Not surprisingly, git
has precisely
these two characteristics. So that's where I'm headed.
Audio files will mostly stay out of version control, mainly because I want the ability to withdraw them in favor of better versions. And there will be some additional complications because I keep my own copies of things like other peoples' lyrics, cached copies of copyrighted articles, and the like. The solution is probably to make them, and their corresponding blobs, inaccessible from the web by changing their permission bits appropriately.