Re: Fingerprinting text documents for approximate comparison

Here is a very simplistic way you might try.

#!/usr/bin/perl
use strict;

our $stoplist = {
  and => 0,
  or => 0,
  but => 0,
};

use Digest::MD5 qw(md5 md5_hex md5_base64);

sub main {
    my $line;
    my $words = {};

    while ($line = shift) {
        chomp;
        my @data = split(/\b/, $line);
        for my $word (@data) {
            $word =~ s/\s*//g;
            chomp($word);
            next if length $word < 2;
            $words->{lc($word)}++;
        }
    }

    my @out;

    for my $key (keys %$words) {
        next if $words->{$key} < 2;
        if (defined  $stoplist->{$key}) {
            next if $stoplist->{$key} == 0;
            next if $stoplist->{$key} > $words->{$key};
        }
        push @out, $key;
    }

    print join('', sort @out), "\n";
    print md5_base64(join(' ', sort @out)), "\n";
}

main(<<EOP);
The "Digest::MD5" module allows you to use the RSA Data Security Inc. 
+MD5 Message Digest
algorithm from within Perl programs.  The algorithm takes as input a m
+essage of arbitrary
length and produces as output a 128-bit "fingerprint" or "message dige
+st" of the input.
EOP

main(<<EOP);
The "digest::MD5" Module allows you to use the RSA Data Security Inc.
MD5 Message Digest algorithm from within Perl programs. The algorithm
takes as input a message of arbitrary length and produces as output a
128-bit "fingerprint" and "message digest" of the input. EOP
EOP
[download]

The output:

algorithmasdigestinputmd5messageofthe
PyVzoLxidA4SklaM0RsrhQ
algorithmasdigestinputmd5messageofthe
PyVzoLxidA4SklaM0RsrhQ

But looking at spam code is probably the way to go.

Update: While the idea was sound the code did not run correctly.

-- gam3
A picture is worth a thousand words, but takes 200K.

Comment on Re: Fingerprinting text documents for approximate comparison Download Code

Replies are listed 'Best First'.
Re^2: Fingerprinting text documents for approximate comparison by BrowserUk (Patriarch) on Mar 25, 2005 at 02:01 UTC
That isn't going to be useful. the md5 algorithm is expressly design to detect differences, not similarity: `use Digest::MD5 qw[md5_hex]; my $s = 'the quick brown fox jumps over the lazy dog'; print md5_hex $s; 77add1d5f41223d5582fca736a5cb335 print md5_hex $s . 's'; 5e48a737eaff799917707b2815af10fc print md5_hex $s . 'S'; d02763729a741eed14417a1051ec228c` [download] Even the addition of a single character, or changing a single bit produces a (numerically) completely unrelated digest--exactly as it should for the purposes for which md5 is designed, but completely wrong for this application. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply] [d/l]
Re^3: Fingerprinting text documents for approximate comparison by gam3 (Curate) on Mar 25, 2005 at 03:11 UTC
The MD5 is only turning a list of words into a number. It is the list of words that is the fingerprint of the file. You could just compare the words. The MD5 is just being used as a checksum. -- gam3 A picture is worth a thousand words, but takes 200K.	[reply]