File Diff'ing

dimes has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: File Diff'ing by Corion (Patriarch) on Jun 12, 2002 at 15:07 UTC
Even though you talk about "Diff'ing", you don't want the (minimal) set of changes to get from one file to the second, you only want to know whether two files are identical or not (or that's the interpretation I lay into your words). There are several ways to achieve what you want. The easiest way would be to use Digest::MD5, which comes with Perl 5.6 in the core. If the two files have an identical MD5 hash, they most likely are the same. If your version of Perl dosen't have Digest::MD5, you might want do do the check manually, first checking whether the two files have the same file size (via the tell function or the `-s` function (`perldoc -f -X`), and then slurping the two files into memory and doing an `eq` comparision on them. If the files are too large to be held in memory at one time, you might want to compare little chunks of the two files one at a time, starting either from the beginning or the end of the file, whichever part has the more likely chance of being different. `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l]
Re: File Diff'ing by robobunny (Friar) on Jun 12, 2002 at 15:11 UTC
i don't know of anything that comes with the standard distribution, but you probably want to do a checksum instead of a diff. it will be much faster if you don't care at that point what the actual differences are. you can store the checksum when you download the file, so that you don't have to recompute it when you download the next one.	[reply]
Re: File Diff'ing by Abigail-II (Bishop) on Jun 12, 2002 at 15:13 UTC
What's this problem with using a module outside of the standard distribution? As you said, there's Algorithm::Diff. If you have a xenomodule phobia, I doubt the license prevents you from just pasting the content of Algorithm::Diff into your file. Abigail	[reply]
Re: Re: File Diff'ing by dimes (Novice) on Jun 12, 2002 at 15:25 UTC
I am not particulary phobic of modules...but for something that I "see" as pretty basic....it seemed that I shouldn't have to go external to stock perl just to test to see if two files are the same or not. Thanks all for the tips...I was leaning towards "fingerprinting" the files via md5 et. al. and now it seems pretty clear that it is the way to go. Thanks again Dimes	[reply]
Re: File Diff'ing by kvale (Monsignor) on Jun 12, 2002 at 19:22 UTC
Depending on the exact nature of the problem, even Digest::MD5 might be overkill. If your probelm is just to compare an old file to a new file once to see if they differ, then MD5 is unnecessary. Simply compare their sizes. If they differ, you are done. If they are the same, open the two files and compare line by line, breaking out of the loop at the first difference. Easy and faster than hashing both files first. If your problem is to compare a new file against many old files (to reject duplicates) or to compare many new files to an old file (sample and do something when a file is updated) then hashing to an MD5 signature is the fastest approach. Although it depends on differing files generating differing signatures, the chance of a collision is, welll, you should live so long :) -Mark	[reply]