Best practices for file synchronization? (Mod time vs. contents compare)

OfficeLinebacker has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, esteemed monks!

I am writing a Perl script that does a one-way synchronization of directory trees In Windows, and I am wondering what in the industry is considered "good enough" for comparison purposes.

For the record, I'd like this to be a simply clickable icon on a desktop, that can simply be emailed to people. Cygwin is not installed on our desktops so rsynch is out. Also, using any modules that aren't in core Perl is out. The above are based on my understanding, and if you know of a fairly simple way to combine those things into a single clickable sript, please educate me.

However, I would like this discussion to be more general, as far as how to weight robustness against efiiciency (speed) in a production synching program.

I think most tools work on modification time. My current tool uses full file comparision, and of course suffers in the performance department.

Is there a happy medium between full file comparison and mod time compare? To that end, two thoughts I have had are a) do most comparison by mod time, but do a full comparison on a small randomly selected subset each time (the synch will occur daily, hopefully). b) use MD5 sums.

I know I could code the first fairly well, but is it worth it? I don't know much about MD5 sums. I am working with files with average size of about 123kb, and the maximum is about 3MB. A third option would be to initially use modification time, since a later time on the server will automatically trigger a copy over to the synched media, without having to go through the comparison process. However I do think there can be false positives with the approach, if the files are generated automatically every day but the contents don't necessarily change. Also, I'm not sure what proportion of the files are actually updated on the server ever 24 hours, so I don't know if it would save much.

Code available upon request (I feel we're at the algorithm stage and not at the actual coding stage yet--again, correct me if you think I should post code regardless)

TIA,

UPDATE:

I understand this node has been considered. In my mind it's Perl-centric as the discussion is only about how to implement solutions in Perl; perhaps I didn't make myself clear, in which case I apologize. Please let me know if this node does not belong on PM.

_________________________________________________________________________________

I like computer programming because it's like Legos for the mind.

Comment on Best practices for file synchronization? (Mod time vs. contents compare)

Replies are listed 'Best First'.
Re: Best practices for file synchronization? (Mod time vs. contents compare) by samtregar (Abbot) on Jun 12, 2006 at 17:52 UTC
A combination of modification-time and file-size comparison is pretty strong, but it won't notice all potential changes. Files can be back-dated and changes to file attributes may or may not trigger a modification-time increase. I think you should at least try using MD5s before concluding that they're too slow. Digest::MD5 is pretty fast, in my experience. -sam	[reply]
Re^2: Best practices for file synchronization? (Mod time vs. contents compare) by OfficeLinebacker (Chaplain) on Jun 12, 2006 at 18:00 UTC
Thanks, I'll give MD5 a try, assuming it's available. Also, by my understanding, File::Compare stops comparing once a difference is found; as I understand it calculating an MD5 sum requires reading in the entire file. It will be interesting to see how the two bench out. Thanks, T _________________________________________________________________________________ I like computer programming because it's like Legos for the mind.	[reply]
Re^3: Best practices for file synchronization? (Mod time vs. contents compare) by ikegami (Patriarch) on Jun 12, 2006 at 18:42 UTC
He's doing "one-way synchronization of directory trees In Windows". If that means he's doing archiving, he only needs to compute the MD5 of the archived file once. That means that MD5 requires that one file be read fully, whereas File::Compare requires two files to be read in part or in full.	[reply]
Re^4: Best practices for file synchronization? (Mod time vs. contents compare) by OfficeLinebacker (Chaplain) on Jun 12, 2006 at 18:46 UTC
Re^5: Best practices for file synchronization? (Mod time vs. contents compare) by ikegami (Patriarch) on Jun 12, 2006 at 18:47 UTC
Re^4: Best practices for file synchronization? (Mod time vs. contents compare) by OfficeLinebacker (Chaplain) on Jun 12, 2006 at 19:19 UTC
Re^2: Best practices for file synchronization? (Mod time vs. contents compare) by OfficeLinebacker (Chaplain) on Jun 12, 2006 at 20:27 UTC
OK, so I benched File::Comparing vs. using MD5 digests, vs. using modification times. After four runs, I got an average time of about 42 seconds using full comparison, about 48 seconds for MD5 (calculating for just the source, as the destination presumablyalready has one), and about 11 seconds for just modification times using stat(). I think I am just going to go with modification time. Thanks, T. _________________________________________________________________________________ I like computer programming because it's like Legos for the mind.	[reply]
Re: Best practices for file synchronization? (Mod time vs. contents compare) by runrig (Abbot) on Jun 12, 2006 at 18:09 UTC
I wrote something just based on modification time and file size. I was comparing (over 2500) files over (slow) VPN, so I didn't want to read the file if I didn't have to, and even avoided perl's -X file test operators because it was slow to test every file. Instead I used Win32 API functions.	[reply]
Re^2: Best practices for file synchronization? (Mod time vs. contents compare) by OfficeLinebacker (Chaplain) on Jun 12, 2006 at 18:44 UTC
runrig++! This is exactly what I am doing (replacement for Briefcase!) I'm going to give benchamarking MD5 vs File::Compare a try, and then look into your solution! _________________________________________________________________________________ I like computer programming because it's like Legos for the mind.	[reply]
Re^2: Best practices for file synchronization? (Mod time vs. contents compare) by OfficeLinebacker (Chaplain) on Jun 12, 2006 at 19:48 UTC
Argh, Win32::API not present on the desktops. I'll be posting benchmarks on the other ways of doing it in a bit. _________________________________________________________________________________ I like computer programming because it's like Legos for the mind.	[reply]
Re^3: Best practices for file synchronization? (Mod time vs. contents compare) by runrig (Abbot) on Jun 12, 2006 at 21:49 UTC
Win32::API doesn't come with the standard install of ActiveState Perl, but it is easily installed with the ppm utility.	[reply]
Re: Best practices for file synchronization? (Mod time vs. contents compare) by sgifford (Prior) on Jun 12, 2006 at 19:00 UTC
You should be able to get `rsync` installed without all of Cygwin; it should just need the executable and the Cygwin DLL. This site's cwRsync appears to be just that (though I haven't tried it). That is probably the most robust solution; `rsync` tries hard to minimize data communications while maximizing accuracy. Using just modification time and size may mostly work OK if your clocks are in sync. If you want to try this, you should at least make sure everybody's running NTP. There's still a potential for conflicts if there's more than one person updating data, though. -- sgifford's Web page	[reply] [d/l] [select]
Re^2: Best practices for file synchronization? (Mod time vs. contents compare) by DrHyde (Prior) on Jun 13, 2006 at 09:06 UTC
What he said. To do anything other than use rsync would be really stupid. Spend your valuable time on something more productive than reinventing this nasty wheel.	[reply]
Re^2: Best practices for file synchronization? (Mod time vs. contents compare) by jaa (Friar) on Jun 13, 2006 at 08:25 UTC
I agree - rsynch is the way to go - the next feature you will want is bandwidth limiting, then after that you will want maybe ssh tunnelling with key authentication for security, and then after that you will want permission synchronisation, and then... well, you get my drift. Regards, Jeff	[reply]
Re: Best practices for file synchronization? (Mod time vs. contents compare) by jdtoronto (Prior) on Jun 12, 2006 at 21:05 UTC
With the packaging tools around now you don't need to worry about what is in the core distribution. In fact, for most applications it is better that Perl is NOT INSTALLED at all on the target machines and the application is packaged using perl2exe, PAR or ActiveStates PerlApp so that you can have a precisely known set of modules distributed as part of your application. I use PaerlApp and I find it very helpful to be able to pack even text files, config files and selected necessary DLL's in the exe that I distribute. jdtoronto	[reply]
Re: Best practices for file synchronization? (Mod time vs. contents compare) by bart (Canon) on Jun 12, 2006 at 21:21 UTC
On my own computer, I use a combination of file size, modification time, and ctime (creation on Windows, inode change time on Unix). Unless people are particularly trying to hide changes, that should suffice. And you can get all three using just one stat call. In fact, on Unix, I have the impression that just checking ctime is enough, as any change to the file changes that timestamp, and it cannot be changed manually. Unfortunately, that's not the case on Windows.	[reply]
Re^2: Best practices for file synchronization? (Mod time vs. contents compare) by brd (Acolyte) on Jun 15, 2006 at 01:56 UTC
> In fact, on Unix, I have the impression that just checking ctime is enough, as any change to the file changes that timestamp, and it cannot be changed manually. I believe this is incorrect, but I'm not sure how it is done.	[reply]
Re^3: Best practices for file synchronization? (Mod time vs. contents compare) by bart (Canon) on Jun 15, 2006 at 06:57 UTC
I've looked at some freely available C source code to change ctime of files, and wht it does is change the clock time, change a file's ctime to the "current time", and change the clock back. (I'll link to it if I can find it back.) If this hack is the only way to achieve it, I think you can state it can't be done.	[reply]
Re: Best practices for file synchronization? (Mod time vs. contents compare) by OfficeLinebacker (Chaplain) on Jun 12, 2006 at 20:59 UTC
Greetings, esteemed monks! Here's what I've come up with: use strict; use File::Copy; use File::Find; use File::Compare; use Time::HiRes qw(time); my $source = "G:\\contingency\\"; my $maindest = "D:\\"; ( -r $source ) \|\| die "Error: $source not readable\n"; ( -e $maindest ) \|\| die "Error: $maindest (memory stick) does not exist--is your memory stick + attached?\n"; print "Beginning scan of source directory for updated files...\n\n"; # for the convenience of &wanted calls, including -eval statements: use vars qw/name dir prune/; name = File::Find::name; dir = File::Find::dir; prune = *File::Find::prune; my $newdirs = 0; my $newfiles = 0; my $changedfiles = 0; my $unchangedfiles = 0; my $orphans = 0; my $startcompcopy = time; my $status1 = find( \&wantedtime, $source ); my $endcompcopy = time; printf "Synchronization took %f seconds.\n\n", $endcompcopy - $startco +mpcopy; my $totalfiles = $newdirs + $newfiles + $changedfiles + $unchangedfile +s; print "Summary report: \n"; print "$totalfiles entries scanned.\n"; print "$newdirs new directories found and created.\n"; print "$newfiles new files found and copied.\n"; print "$changedfiles updated files copied.\n"; print "$unchangedfiles unchanged files ignored.\n\n"; #..and then take care of orphans print "searching for orphans:\n"; my $dest1 = $source; $dest1 =~ s/G:/D:/; my @filesizes; #use finddepth here cos we're deleting my $status1 = finddepth( \&wantedorphans, $dest1 ); print "$orphans orphans deleted from memory stick.\n\n"; print "This program will now pause for 10 seconds, then exit.\n"; sleep 10; sub wantedtime { my $dest = $name; $dest =~ s/G:/D:/; if ( not -e $dest ) { if ( -d $name ) { print "Making new directory $dest\n"; mkdir $dest; ++$newdirs; } else { print "Copying previously nonexistent file $dest\n"; my $cstatus = copy( $name, $dest ); #copy returns 1 on success ++$newfiles; ($cstatus) \|\| warn "WARNING: Copy of $name failed with status $cstatus: $! +\n"; } } else { #Not doing this for directories as empty dirs are taken care o +f above #and copying an entire directory won't work with copy() if ( -d $name ) { print "scanning $name...\n"; } else { my $smtime = ( stat($name) )[9]; my $dmtime = ( stat($dest) )[9]; if ( ( $smtime > $dmtime ) ) { print "Updated file found!\n Copying $name\n to memory sti +ck \n"; my $cstatus = copy( $name, $dest ); #copy returns 1 on success ($cstatus) \|\| warn "WARNING: Copy of $name failed with status $cstatus +: $!\n"; ++$changedfiles; } else { ++$unchangedfiles; } } } } sub wantedorphans { my $status = 1; $source = $name; $source =~ s/D:/G:/; if ( not -e $source ) { ++$orphans; print "$name is an orphan; deleting...\n"; if ( -d $name ) { $status = rmdir $name; } else { $status = unlink $name; } #rmidr returns true on success; unlink returns number of files deleted + (should be 1) ($status) \|\| warn "WARNING: Status of removal of $name is $status : $ +!\n"; } } [download] _________________________________________________________________________________ I like computer programming because it's like Legos for the mind.	[reply] [d/l]
Re: Best practices for file synchronization? (Mod time vs. contents compare) by OfficeLinebacker (Chaplain) on Jun 15, 2006 at 17:14 UTC
Hey guys, thanks so much for your continued resonses. The ending of this story is pretty funny; the project is really for my boss, and he was hoping that the other people who have to do this syncing thing (all above him in rank) would be stoked over just having to click on one file instead of going through Windows Explorer (seems like the higher people get in an organization the less they want to have to interact with computers LOL). My boss also didn't want me spending much time on it, so I kept is as simple as possible, and it gets the job done. Sure, it could be more robust, but speed is also a factor. Anyway the person who does my job (sort of resident tech person for a group of 8-10 people), but for the rest of the people that do this syncing, wants nothing to do with it. I think that person feels threatened or lazy. One of the guys that my boss showed this to actually said to me, "Hey, I really like that thing you set up for X, that's neat, can you set that up for me?" when he saw me in the hall. I said sure and we planned to meet in 10 minutes. When I ran into this other person (who basically has the same relationship to this guy as I have to my boss) in the meantime, I ran the idea by her as a courtesy and she totally nixed the idea because she felt she didn't understand it. It kind of sucked because I basically had to break a promise to a very important person, but that was my fault. Well, at least my boss knows what's up, and he has the handy-dandy script he can run. I think it was still worth it, even for one user. It was also a learning experience for Perl and to continue to try to figure out this other person with whom I have to work closely. _________________________________________________________________________________ I like computer programming because it's like Legos for the mind.	[reply]