Even though you talk about "Diff'ing", you don't want
the (minimal) set of changes to get from one file to the second,
you only want to know whether two files are identical or not (or that's the interpretation
I lay into your words).
There are several ways to achieve what you want. The easiest
way would be to use Digest::MD5, which comes
with Perl 5.6 in the core. If the two files have an identical
MD5 hash, they most likely are the same.
If your version of Perl dosen't have Digest::MD5, you might
want do do the check manually, first checking whether the two
files have the same file size (via the tell function
or the -s function (perldoc -f -X), and then slurping the two
files into memory and doing an eq comparision on them.
If the files are too large to be held in memory at one time,
you might want to compare little chunks of the two files one
at a time, starting either from the beginning or the end of the file,
whichever part has the more likely chance of being different.
perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The
$d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider
($c = $d->accept())->get_request(); $c->send_response( new #in the
HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
| [reply] [d/l] |
i don't know of anything that comes with the standard distribution, but you probably want to do a checksum instead of a diff. it will be much faster if you don't care at that point what the actual differences are. you can store the checksum when you download the file, so that you don't have to recompute it when you download the next one. | [reply] |
What's this problem with using a module outside of the
standard distribution? As you said, there's Algorithm::Diff.
If you have a xenomodule phobia, I doubt the license
prevents you from just pasting the content of Algorithm::Diff
into your file.
Abigail | [reply] |
I am not particulary phobic of modules...but for something that I "see" as pretty basic....it seemed that I shouldn't have to go external to stock perl just to test to see if two files are the same or not.
Thanks all for the tips...I was leaning towards "fingerprinting" the files via md5 et. al. and now it seems pretty clear that it is the way to go.
Thanks again
Dimes
| [reply] |
Depending on the exact nature of the problem, even Digest::MD5
might be overkill.
If your probelm is just to compare an old file to a new file
once to see if they differ, then MD5 is unnecessary. Simply
compare their sizes. If they differ, you are done. If they
are the same, open the two files and compare line by line,
breaking out of the loop at the first difference. Easy and
faster than hashing both files first.
If your problem is to compare a new file against many old
files (to reject duplicates) or to compare many new files to
an old file (sample and do something when a file is updated) then
hashing to an MD5 signature is the fastest approach. Although
it depends on differing files generating differing signatures,
the chance of a collision is, welll, you should live so long :)
-Mark | [reply] |