in reply to Re: Fingerprinting text documents for approximate comparison
in thread Fingerprinting text documents for approximate comparison

That isn't going to be useful. the md5 algorithm is expressly design to detect differences, not similarity:

use Digest::MD5 qw[md5_hex]; my $s = 'the quick brown fox jumps over the lazy dog'; print md5_hex $s; 77add1d5f41223d5582fca736a5cb335 print md5_hex $s . 's'; 5e48a737eaff799917707b2815af10fc print md5_hex $s . 'S'; d02763729a741eed14417a1051ec228c

Even the addition of a single character, or changing a single bit produces a (numerically) completely unrelated digest--exactly as it should for the purposes for which md5 is designed, but completely wrong for this application.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?

Replies are listed 'Best First'.
Re^3: Fingerprinting text documents for approximate comparison
by gam3 (Curate) on Mar 25, 2005 at 03:11 UTC
    The MD5 is only turning a list of words into a number. It is the list of words that is the fingerprint of the file. You could just compare the words. The MD5 is just being used as a checksum.
    -- gam3
    A picture is worth a thousand words, but takes 200K.