Re^2: Fingerprinting text documents for approximate comparison

That isn't going to be useful. the md5 algorithm is expressly design to detect differences, not similarity:

use Digest::MD5 qw[md5_hex];
my $s = 'the quick brown fox jumps over the lazy dog';
print md5_hex $s;
77add1d5f41223d5582fca736a5cb335

print md5_hex $s . 's';
5e48a737eaff799917707b2815af10fc

print md5_hex $s . 'S';
d02763729a741eed14417a1051ec228c
[download]

Even the addition of a single character, or changing a single bit produces a (numerically) completely unrelated digest--exactly as it should for the purposes for which md5 is designed, but completely wrong for this application.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco.

Rule 1 has a caveat! -- Who broke the cabal?

Comment on Re^2: Fingerprinting text documents for approximate comparison Download Code

Replies are listed 'Best First'.
Re^3: Fingerprinting text documents for approximate comparison by gam3 (Curate) on Mar 25, 2005 at 03:11 UTC
The MD5 is only turning a list of words into a number. It is the list of words that is the fingerprint of the file. You could just compare the words. The MD5 is just being used as a checksum. -- gam3 A picture is worth a thousand words, but takes 200K.	[reply]