Re: Best practices for file synchronization? (Mod time vs. contents compare)
by samtregar (Abbot) on Jun 12, 2006 at 17:52 UTC
|
A combination of modification-time and file-size comparison is pretty strong, but it won't notice all potential changes. Files can be back-dated and changes to file attributes may or may not trigger a modification-time increase.
I think you should at least try using MD5s before concluding that they're too slow. Digest::MD5 is pretty fast, in my experience.
-sam | [reply] |
|
|
| [reply] |
|
|
He's doing "one-way synchronization of directory trees In Windows". If that means he's doing archiving, he only needs to compute the MD5 of the archived file once. That means that MD5 requires that one file be read fully, whereas File::Compare requires two files to be read in part or in full.
| [reply] |
|
|
|
|
|
|
|
|
OK, so I benched File::Comparing vs. using MD5 digests, vs. using modification times.
After four runs, I got an average time of about 42 seconds using full comparison, about 48 seconds for MD5 (calculating for just the source, as the destination presumablyalready has one), and about 11 seconds for just modification times using stat().
I think I am just going to go with modification time.
Thanks,
T.
_________________________________________________________________________________
I like computer programming because it's like Legos for the mind.
| [reply] |
Re: Best practices for file synchronization? (Mod time vs. contents compare)
by runrig (Abbot) on Jun 12, 2006 at 18:09 UTC
|
I wrote something just based on modification time and file size. I was comparing (over 2500) files over (slow) VPN, so I didn't want to read the file if I didn't have to, and even avoided perl's -X file test operators because it was slow to test every file. Instead I used Win32 API functions. | [reply] |
|
|
runrig++! This is exactly what I am doing (replacement for Briefcase!)
I'm going to give benchamarking MD5 vs File::Compare a try, and then look into your solution!
_________________________________________________________________________________
I like computer programming because it's like Legos for the mind.
| [reply] |
|
|
| [reply] |
|
|
Win32::API doesn't come with the standard install of ActiveState Perl, but it is easily installed with the ppm utility.
| [reply] |
Re: Best practices for file synchronization? (Mod time vs. contents compare)
by sgifford (Prior) on Jun 12, 2006 at 19:00 UTC
|
You should be able to get rsync installed without all of Cygwin; it should just need the executable and the Cygwin DLL. This site's cwRsync appears to be just that (though I haven't tried it). That is probably the most robust solution; rsync tries hard to minimize data communications while maximizing accuracy.
Using just modification time and size may mostly work OK if your clocks are in sync. If you want to try this, you should at least make sure everybody's running NTP. There's still a potential for conflicts if there's more than one person updating data, though.
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
I agree - rsynch is the way to go - the next feature you will want is bandwidth limiting, then after that you will want maybe ssh tunnelling with key authentication for security, and then after that you will want permission synchronisation, and then... well, you get my drift.
| [reply] |
Re: Best practices for file synchronization? (Mod time vs. contents compare)
by jdtoronto (Prior) on Jun 12, 2006 at 21:05 UTC
|
With the packaging tools around now you don't need to worry about what is in the core distribution. In fact, for most applications it is better that Perl is NOT INSTALLED at all on the target machines and the application is packaged using perl2exe, PAR or ActiveStates PerlApp so that you can have a precisely known set of modules distributed as part of your application. I use PaerlApp and I find it very helpful to be able to pack even text files, config files and selected necessary DLL's in the exe that I distribute.
jdtoronto
| [reply] |
Re: Best practices for file synchronization? (Mod time vs. contents compare)
by bart (Canon) on Jun 12, 2006 at 21:21 UTC
|
On my own computer, I use a combination of file size, modification time, and ctime (creation on Windows, inode change time on Unix). Unless people are particularly trying to hide changes, that should suffice. And you can get all three using just one stat call.
In fact, on Unix, I have the impression that just checking ctime is enough, as any change to the file changes that timestamp, and it cannot be changed manually. Unfortunately, that's not the case on Windows. | [reply] |
|
|
| [reply] |
|
|
| [reply] |
Re: Best practices for file synchronization? (Mod time vs. contents compare)
by OfficeLinebacker (Chaplain) on Jun 12, 2006 at 20:59 UTC
|
use strict;
use File::Copy;
use File::Find;
use File::Compare;
use Time::HiRes qw(time);
my $source = "G:\\contingency\\";
my $maindest = "D:\\";
( -r $source ) || die "Error: $source not readable\n";
( -e $maindest )
|| die
"Error: $maindest (memory stick) does not exist--is your memory stick
+ attached?\n";
print "Beginning scan of source directory for updated files...\n\n";
# for the convenience of &wanted calls, including -eval statements:
use vars qw/*name *dir *prune/;
*name = *File::Find::name;
*dir = *File::Find::dir;
*prune = *File::Find::prune;
my $newdirs = 0;
my $newfiles = 0;
my $changedfiles = 0;
my $unchangedfiles = 0;
my $orphans = 0;
my $startcompcopy = time;
my $status1 = find( \&wantedtime, $source );
my $endcompcopy = time;
printf "Synchronization took %f seconds.\n\n", $endcompcopy - $startco
+mpcopy;
my $totalfiles = $newdirs + $newfiles + $changedfiles + $unchangedfile
+s;
print "Summary report: \n";
print "$totalfiles entries scanned.\n";
print "$newdirs new directories found and created.\n";
print "$newfiles new files found and copied.\n";
print "$changedfiles updated files copied.\n";
print "$unchangedfiles unchanged files ignored.\n\n";
#..and then take care of orphans
print "searching for orphans:\n";
my $dest1 = $source;
$dest1 =~ s/G:/D:/;
my @filesizes;
#use finddepth here cos we're deleting
my $status1 = finddepth( \&wantedorphans, $dest1 );
print "$orphans orphans deleted from memory stick.\n\n";
print "This program will now pause for 10 seconds, then exit.\n";
sleep 10;
sub wantedtime {
my $dest = $name;
$dest =~ s/G:/D:/;
if ( not -e $dest ) {
if ( -d $name ) {
print "Making new directory $dest\n";
mkdir $dest;
++$newdirs;
}
else {
print "Copying previously nonexistent file $dest\n";
my $cstatus = copy( $name, $dest );
#copy returns 1 on success
++$newfiles;
($cstatus)
|| warn
"WARNING: Copy of $name failed with status $cstatus: $!
+\n";
}
}
else {
#Not doing this for directories as empty dirs are taken care o
+f above
#and copying an entire directory won't work with copy()
if ( -d $name ) {
print "scanning $name...\n";
}
else {
my $smtime = ( stat($name) )[9];
my $dmtime = ( stat($dest) )[9];
if ( ( $smtime > $dmtime ) ) {
print
"Updated file found!\n Copying $name\n to memory sti
+ck \n";
my $cstatus = copy( $name, $dest );
#copy returns 1 on success
($cstatus)
|| warn
"WARNING: Copy of $name failed with status $cstatus
+: $!\n";
++$changedfiles;
}
else {
++$unchangedfiles;
}
}
}
}
sub wantedorphans {
my $status = 1;
$source = $name;
$source =~ s/D:/G:/;
if ( not -e $source ) {
++$orphans;
print "$name is an orphan; deleting...\n";
if ( -d $name ) {
$status = rmdir $name;
}
else {
$status = unlink $name;
}
#rmidr returns true on success; unlink returns number of files deleted
+ (should be 1)
($status)
|| warn "WARNING: Status of removal of $name is $status : $
+!\n";
}
}
_________________________________________________________________________________
I like computer programming because it's like Legos for the mind.
| [reply] [d/l] |
Re: Best practices for file synchronization? (Mod time vs. contents compare)
by OfficeLinebacker (Chaplain) on Jun 15, 2006 at 17:14 UTC
|
Hey guys, thanks so much for your continued resonses. The ending of this story is pretty funny; the project is really for my boss, and he was hoping that the other people who have to do this syncing thing (all above him in rank) would be stoked over just having to click on one file instead of going through Windows Explorer (seems like the higher people get in an organization the less they want to have to interact with computers LOL). My boss also didn't want me spending much time on it, so I kept is as simple as possible, and it gets the job done. Sure, it could be more robust, but speed is also a factor. Anyway the person who does my job (sort of resident tech person for a group of 8-10 people), but for the rest of the people that do this syncing, wants nothing to do with it. I think that person feels threatened or lazy. One of the guys that my boss showed this to actually said to me, "Hey, I really like that thing you set up for X, that's neat, can you set that up for me?" when he saw me in the hall. I said sure and we planned to meet in 10 minutes. When I ran into this other person (who basically has the same relationship to this guy as I have to my boss) in the meantime, I ran the idea by her as a courtesy and she totally nixed the idea because she felt she didn't understand it. It kind of sucked because I basically had to break a promise to a very important person, but that was my fault.
Well, at least my boss knows what's up, and he has the handy-dandy script he can run. I think it was still worth it, even for one user. It was also a learning experience for Perl and to continue to try to figure out this other person with whom I have to work closely.
_________________________________________________________________________________
I like computer programming because it's like Legos for the mind.
| [reply] |