Sync via FTP

Tanktalus has asked for the wisdom of the Perl Monks concerning the following question:

Ideally, I'd be doing this via rsync, but my webhost doesn't have sshd or rsyncd running. I contemplated setting up an rsync daemon in perl, but I think that that's probably more work than the alternative... which is to write a synchronisation tool in perl that syncs via FTP.

Unlike the rsync project, I don't think that "minimal transfer" (as in, parts of files) is really that important. Largely, I just need to check if a file locally has been added or changed (and thus need to be uploaded), or deleted (and thus needs to be deleted on the server). I think comparing timestamps should be sufficient here, as long as I can remember to convert the remote time stamp to GMT in its own timezone (which is different from mine). I don't think this will be hard. It may upload a bit more than needed, but I'd rather upload an extra meg of data than all 50MB (pics, PDFs, etc.) each time. And I want it automated just because I'm absolutely terrible about remembering what I've changed, whether that's new files, modifications, or deletions. I figure that if I can teach the computer how to do it, then I don't have to remember the minutiae that I know I'm horrible at. First, of course, I have to teach myself how to do it ;-)

The algorithm I'm trying to wrap my head around at the moment is how best to approach the recursive matching on two machines in parallel - one local, one remote. Including deleting entire trees from the remote system, e.g., if I have a /www/pics/2005/01 directory, and I delete the www/pics/2005 directory locally (no longer want to show those pics), I need to go into all the subdirs of www/pics/2005 and delete their contents so I can delete the directories, recursively, but in reverse order (I know there are terms for pre- and post-something, but I don't remember the terms, only that they'd apply here in describing how I'd scan to the bottom and only do the deletion after returning from the lower level of recursion). Obviously, handling local directories is easy with File::Find - but I'm not really sure how to best approach it when I'm trying to sync two different directories.

Perhaps I should practice just by syncing to another local directory - translating the "destination" from readdir/glob to Net::FTP is probably easier than this - at least to me.

Note: in my case, the local is considered the master for all directories with some exceptions (e.g., software that gets installed by the webhost's control panel). Any directory on the webhost that is neither local nor in the exception list (hash) will be removed.

Something tells me that I'm just missing something simple to make the whole thing "click" in my head... either that, or I'm just completely out to lunch and am missing the entirety of the concept. Hopefully another monk will be able to tell me which ;-)

Comment on Sync via FTP

Replies are listed 'Best First'.
Re: Sync via FTP by bart (Canon) on Mar 10, 2007 at 20:15 UTC
e.g., if I have a /www/pics/2005/01 directory, and I delete the www/pics/2005 directory locally (no longer want to show those pics), I need to go into all the subdirs of www/pics/2005 and delete their contents so I can delete the directories, recursively, but in reverse order Just as a starting point, I'd like to point out that you can implement the logic for the filesystem, with File::Find, if you just use the function `finddepth`. That function will call the callbacks first for the deepest nesting level, and handle the parent directory later. So, perhaps you could reimplement File::Find but for working over Net::FTP, instead of on the local filesystem?	[reply] [d/l]
Re: Sync via FTP by Rhandom (Curate) on Mar 11, 2007 at 04:34 UTC
I needed something like this the other day. I used to use a program that came with KDE - but I couldn't find it. So I whipped up the following program. Perhaps I should CPAN it - but I don't know if it is worthy of going to that length. Anyway - it should be pretty easy to setup - just change the SRC_DR, DEST_DIR, DEST_HOST, and DEST_USER variables. You will be prompted for a password. Enjoy. #!/usr/bin/perl =head1 NAME ftp_sync.pl - Sync a directory with an ftp server =cut use strict; use warnings; use Net::FTP; use IO::Prompt qw(prompt); use File::Find qw(find); #use CGI::Ex::Dump qw(debug); sub debug { require Data::Dumper; print Data::Dumper::Dumper(@_) } ### times to run the daemon our $SRC_DIR = '/home/paul/somesite.com'; # source directory - full + path - no trailing slash our $DEST_DIR = ''; # destination directory - fu +ll path - no trailing slash our $DEST_HOST = 'somesite.com'; # remote hostname our $DEST_USER = 'someuser'; # remote username our $EXCLUDE = [ # list of files to exclude ( +that work with rsync patterns) ]; ###----------------------------------------------------------------### main(); sub main { ### setup config my $config = { src_dir => $SRC_DIR, dest_dir => $DEST_DIR, host => $DEST_HOST, user => $DEST_USER, exclude => $EXCLUDE, @ARGV, }; ### these must end with slash $config->{'dest_dir'} =~ s\|/+$\|\|; $config->{'src_dir'} =~ s\|/+$\|\|; $\| = 1; my $time = time; my $pass = $config->{'pass'} \|\| prompt("Please enter the password +for $config->{'user'}\@$config->{'host'}: ", -e => ''); print "Connect in...\n"; my $f = Net::FTP->new($config->{'host'}, Passive => 1, Debug => 0) + \|\| die "Couldn't ftp to $config->{'host'}: $@"; print "Logging in...\n"; $f->login($config->{'user'}, $pass) \|\| die "Couldn't login to $con +fig->{'host'}: ".$f->message; $f->binary; my $files = {}; ### grab local file list find(sub { my $file = $File::Find::name; $file =~ s\|^\Q$config->{'src_dir'}\E/\|\|; return if ! length($file); return if $file =~ /\bCVS\b/; my ($size, $mtime) = (stat $_)[7, 9]; $files->{$file} = { local_is_dir => (-d _ ? 1 : 0), local_size => $size, local_mtime => $mtime, }; }, $config->{'src_dir'}); #debug $files; #return; ### get remote files print "Getting files from \"$config->{'dest_dir'}\"\n"; my $get_files; $get_files = sub { my ($dir) = @_; my $old_dir = $f->pwd; $f->cwd($dir); my @files = $f->dir; foreach my $line (@files) { #drwx------ 3 ftpuser www 4096 Jun 9 2006 pe +rl5lib", #-rw------- 1 ftpuser www 4697 Dec 18 09:17 ph +oto2.html", next if $line =~ /^total\s+\d+$/; my ($perm, $i, $user, $grp, $size, $jon, $day, $yearhour, +$file) = split(/\s+/, $line, 9); next if $file =~ /^\.\.?$/; debug $line if ! $file; my $is_dir = $perm =~ /^d/ ? 1 : 0; my $full = "$dir/$file"; $full =~ s\|^\Q$config->{'dest_dir'}\E\|\|; $full =~ s\|^/+\|\|g; print "$full \r +"; my $ref = $files->{$full} \|\|= {}; $ref->{'remote_size'} = $size; $ref->{'remote_is_dir'} = $is_dir; if ($ref->{'local_size'} # only bother to get the mdtm if +the sizes are the same && ! $ref->{'remote_is_dir'} && $ref->{'local_size'} == $ref->{'remote_size'}) { $ref->{'remote_mtime'} = $f->mdtm($file); } if ($is_dir) { $get_files->("$dir/$file"); } } $f->cwd($old_dir); }; $get_files->($config->{'dest_dir'}); print "\n"; #debug $files; #return; ###--------------------------------------------------------------- +-### print "Syncing files...\n"; ### remove files that don't belong foreach my $full (reverse sort keys %$files) { # longest first wil +l remove files in a dir before the dir my $info = $files->{$full}; my $needs_update; my ($path, $file) = $full =~ m\|^(.?)([^/]+)$\| ? ($1, $2) : do + { debug $full; die "Bad file \"$full\"" }; $path =~ s\|/+$\|\|; my $cd = $config->{'dest_dir'} .'/'. $path; if ($info->{'remote_is_dir'} && ! $info->{'local_is_dir'}) { print "Removing dir $full\n"; $f->cwd($cd) \|\| die "Couldn't cwd to $cd: ".$f->messag +e; $f->rmdir($file) \|\| die "Couldn't remove $full: ".$f->mess +age; } elsif (! $info->{'remote_is_dir'} && defined $info->{'remote_size'} && ! defined $info->{'local_size'}) { print "Removing file $full\n"; $f->cwd($cd) \|\| die "Couldn't cwd to $cd: ".$f->messa +ge; $f->delete($file) \|\| die "Couldn't delete $full: ".$f->mes +sage; } } ### add or update files that are out of date foreach my $full (sort keys %$files) { # shortest first will creat +e dirs before files my $info = $files->{$full}; my $needs_update; my ($path, $file) = $full =~ m\|^(.?)([^/]+)$\| ? ($1, $2) : do + { debug $full; die "Bad file \"$full\"" }; $path =~ s\|/+$\|\|; my $cd = $config->{'dest_dir'} .'/'. $path; if ($info->{'local_is_dir'} && ! $info->{'remote_is_dir'}) { print "Creating dir $full\n"; $f->cwd($cd) \|\| die "Couldn't cwd to $cd: ".$f->messag +e; $f->mkdir($file) \|\| die "Couldn't mkdir $full: ".$f->messa +ge; } elsif (! $info->{'local_is_dir'}) { my $src = $config->{'src_dir'} .'/'. $full; if (defined $info->{'local_size'} && $info->{'local_size'} != ($info->{'remote_size'} \|\| + 0)) { print "". ($info->{'remote_size'} ? "Updating" : "Crea +ting")." file $full (because of size difference)\n"; $f->cwd($cd) \|\| die "Couldn't cwd to $cd: ".$f +->message; $f->put($src, $file) \|\| die "Couldn't put $full: ".$f- +>message; } elsif (defined $info->{'local_mtime'} && defined $info->{'remote_mtime'} && $info->{'local_mtime'} > $info->{'remote_mtime +'}) { print "Updating file $full (because of modtime)\n"; $f->cwd($cd) \|\| die "Couldn't cwd to $cd: ".$f +->message; $f->put($src, $file) \|\| die "Couldn't put $full: ".$f- +>message; } } } my $elapsed = time - $time; print "Done in $elapsed seconds\n"; } ###----------------------------------------------------------------### [download] Sorry there isn't a little more commenting - I mainly was just using it for myself - but I figured it may help you out also. my @a=qw(random brilliant braindead); print $a[rand(@a)];	[reply] [d/l]
Re: Sync via FTP by graff (Chancellor) on Mar 11, 2007 at 07:45 UTC
I'd be doing this via rsync, but my webhost doesn't have sshd or rsyncd running. I'm guessing you don't have login shell access on the remote server, since this would most likely mean that sshd would be available. Maybe you've got the wrong webhost, unless they can tell you specifically what sort of secure login / file-transfer protocol they support -- i.e. one that is not anonymous and does not involve the transmission of login transactions (username / password) as clear text over the public network. Maybe there's a way that this is done without using sshd? (I don't know -- everything I've come to be familiar with, ever since login sniffers became a known problem, has used sshd one way or another.) It's not so much a matter of protecting your data; it's a matter of protecting access to the storage -- i.e. the service provider should be at least as worried about this as any customer would be. If they don't care about it, they might not be worth whatever you may be paying for their service. (If the service is free, it might just be worth what you're paying for it.) That said, I'm wondering how closely you've looked at the problem you've described. In terms of using file dates as listed by the remote ftp server (and doing time-zone conversions on that), bear in mind that, depending on the remote OS, the Net::FTP "dir" method might return different strings for the same file over time. Unix(-like) machines will often show different date information (year vs. hour:minute) depending on whether the file is more than 6 months old. So today (March 11 2007, say) Net::FTP might show a file date as "Oct 24 14:45" (meaning Oct. 2006 -- but if you just pass that quoted string to a module like Date::Manip, it'll treat it as Oct. 2007). A month or so from now, it will be shown as "Oct 24 2006". On top of that, ftp transfers cannot force a copied file to assume the modification date of the original -- when you upload a file to the server, the file date on that copy reflects the time of transfer, not the modification time of the file on your local machine. This, combined with what appears to be lax security on the part of the service provider, suggests that when the remote file is more recent than the local, this might not be sufficient evidence that the remote version is identical to your local file. (Having the same size might not be sufficient proof, either.) So maybe you need to maintain a list of paths, names, dates and checksums, for both local and remote data sets. I would start by using the standard unix "find" utility to scan the local tree, list all the data files, and produce md5 checksums for them -- best done with a single command line: `find top_level -type f -print0 \| xargs -0 md5sum \| perl -lape 's/ /$d=(-M $F[1])360024;sprintf(" %9d %s ",-s _,sc +alar(localtime($^T-$d)))/e' > master.list` [download] (update: fixed error involving the return value of -M) (If you're local box is ms-windows and you haven't got unix tools installed yet, just download cygwin or uwin to get "find" and "xargs".) Note that the perl one-liner there is adding date and file size information, and doing the date string that way will always produce a consistent date format. (You could elaborate that a little, adding Date::Calc or some such to output dates that reflect the timezone of the remote server.) Next you would need a perl script to load that list and the list of "known exceptions", and then uses Net::FTP to scan the the remote directory tree recursively. Assuming the lists are hashes keyed by path/file.name, you can easily identify files on the server that shouldn't be there (not found in the hash lists), and local files that don't exist on the master (in the hash list, but not on the remote machine). For files with versions on both machines, you have more work to do. Using the ftp "dir" output to compare dates and sizes might suffice, especially if you've kept track of dates and times when you uploaded each of the data files (so you can see if any file date changes after you upload it). But I won't go into that, because this reply is already too long, and you really should just work on getting access to rsync or sshd on the server, even if it means switching providers.	[reply] [d/l]
Re: Sync via FTP - wget by varian (Chaplain) on Mar 11, 2007 at 07:56 UTC
Although a nice challenge for a Perl program this problem has been solved before. You may want to have a look at the open source program wget and its option -mirror that is able to mirror a directory tree located on a remote system provided that system has an ftp server running. Most distributions ship wget as a standard utility so it's likely on your system already. As usual you can embed wget in a Perl program to feed it directory trees to mirror etc. ;-)	[reply]
Re: Sync via FTP by clinton (Priest) on Mar 11, 2007 at 10:19 UTC
I haven't tried it, but there is an article on Linux Journal with a perl script called ftpsync, which may save you an hour or two.	[reply]
Re: Sync via FTP by Daneel S. Yaitskov (Initiate) on Aug 29, 2011 at 10:19 UTC
turbo-ftp-sync perl script (cpan Net::FTPTurboSync) is close to your demands. It doesn't reload entire folder tree every time. It is stable before often network denials because it remembers info about already uploaded files and folders.	[reply]