Efficient Directory snapshots

Hi all,

I'm currently working on a method for making snapshots of directories. It basically copies a given directory into a special directory for snapshots in a directory named according to the date and time. It's useful for making periodic backups. The cunning thing is that if you snapshot the same directory twice no extra space is taken up by the 2nd snapshot (or hardly any). It's a lot like, and inspired by, the 'rsync snapshots' technique (see google).

This method uses a couple of perl scripts and improves on rsync snapshots slightly by ensuring that there is only ever one copy of a given files content by using SHA1 hashes of the content. There's more information here on my webpage, including downloads of the scripts.

At the moment I'm trying to think of a good way of handling MacOS X resource forks and of doing the snapshots to a remote (Linux) machine. I can transfer files using xtar, but that wouldn't be as efficient as rsync for doing updates...

Here's the code: catalog...

#!/usr/bin/perl

use Digest::SHA1 qw(sha1_hex sha1);

sub digest {
    my $line = shift;
    my $sha = new Digest::SHA1;
    open IN, "<", $line or die "\nfailed to open $_ for reading\n";
    binmode IN;
    $sha->addfile(*IN);
    my $hex = $sha->hexdigest;
    close IN;
    return $hex;    
}

while (<STDIN>) {
    chomp;
    $line = $_;
    @a = stat($line);
    if (-f $line) {
    $hex = digest($line);
    } else {
    $hex = '-';
    }

    print "@a $hex $line\n";
}
[download]

snap... (the main script)

#!/usr/bin/perl

# Snapshot a list of files. We must be passed a location for the snaps
+hots. The catalog data are read from STDIN and written to the CATALOG
+ file in the snapshot

sub makeDir {
    my $path = shift;
    my $u = umask(0);
    my $ret = mkdir $path;
    umask($u);
    return $ret;
}

# This function returns the path to the files content hash (ie the
# file in 'hashes' containing the same data) given the hash. Path is
# relative to 'hashes'. We may want to subdivide these you see, so we
# don't end up with a bazillion files in that directory.
sub pathToHash {
    my $h = shift;
    my $path = shift;
    $h=~/(..)(..)(.*)/;
    makeDir("$path/$1") or die "Couldn't write to hashes dir $path/$1!
+" unless -d "$path/$1";
    makeDir("$path/$1/$2") or die "Couldn't write to hashes dir $path/
+$1!" unless -d "$path/$1/$2";
    return "$path/$1/$2/$3";
}

$snaps = $ARGV[0];

print STDERR "Snapshoting to $snaps...\n";

# make the snapshot dir
die "Can't find snapshot dir $snaps" unless -d $snaps;

($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime;

$year+=1900;
$mon++;

makeDir "$snaps/$year";
makeDir "$snaps/$year/$mon";
makeDir "$snaps/$year/$mon/$mday-$hour-$min-$sec";

$snappath = "$snaps/$year/$mon/$mday-$hour-$min-$sec";

die "Can't find hashes dir" unless -d "$snaps/hashes";

# The catalog is important as it records correct permissions and owner
+s. These are not necessarily correct in the files since they are link
+ed to the hashes
open OUT, ">$snappath/CATALOG" or die "failed to open $snappath/CATALO
+G";

while (<STDIN>) {
    # dump the catalog entry to the catalog.
    print OUT $_;

    # Parse the catalog entry
    @a = /(\d+) (\d+) (\d+) (\d+) (\d+) (\d+) (\d+) (\d+) (\d+) (\d+) 
+(\d+) (\d+) (\d+) ([\da-z]+|-) (.*)/;

    print "Snap: $a[14]\n";

    $name = $a[14];
    # Trim leading dot
    $name=~s/^\.//;

    # now, what we do depends on the file type...
    if (-d $a[14]) {
    # make dir in the snapshot directory
    makeDir "$snappath/$name";
    }
    elsif (-f $a[14]) {
    # See if its hash file exists
    $pth = pathToHash($a[13], "$snaps/hashes");
    system "cp", $a[14], $pth unless -f $pth;
    # system "ln", $pth, "$snappath/$name";
        link $pth, "$snappath/$name";
    }
    else {
    # Just copy it across using the cp command (it must be a device or
+ some other special file)
    system "cp", $a[14], "$snappath/$name";
    }
}

close OUT;

# After generating the snapshot we should mark everything read only.
system "chmod", "-R", "-w", "$snappath";

# If we are passed a second argument it's a label and we do this...
if ($ARGV[1]) {
    $label = "$snaps/$ARGV[1]--$year--$mon--$mday--$hour-$min-$sec";

    symlink $snappath, $label;
}
[download]

And a shell script to tie it all together (snapshot)...


# Shell script

# Snapshot current directory. 

# This takes 1 or 2 args. The first one is the path to the snapshots D
+IR. 
# The 2nd (optional) is a label for the snapshot if desired, which wil
+l 
# create a symbolic link at the root of the snapshots DIR

# a simple pipeline. Find is used to generate a filename list
find . | catalog | snap $1 $2
[download]

UPDATE I've modified it to use the link and symlink perl functions instead of "system 'ln'...". It should use File::Copy too.

-- David

Comment on Efficient Directory snapshots Select or Download Code

Replies are listed 'Best First'.
Re: Efficient Directory snapshots by belg4mit (Prior) on Aug 19, 2004 at 14:39 UTC
Why not some simplifications like File::Find instead of a shell script with find and using internals such as symlink instead of system calls to ln? If you're feeling ambitious you could even use File::Find with built-in chmod to replace chmod -R. There's also File::Copy, etc. etc. `-- I'm not belgian but I play one on TV.`	[reply]