Here is a very simplistic way you might try.
#!/usr/bin/perl use strict; our $stoplist = { and => 0, or => 0, but => 0, }; use Digest::MD5 qw(md5 md5_hex md5_base64); sub main { my $line; my $words = {}; while ($line = shift) { chomp; my @data = split(/\b/, $line); for my $word (@data) { $word =~ s/\s*//g; chomp($word); next if length $word < 2; $words->{lc($word)}++; } } my @out; for my $key (keys %$words) { next if $words->{$key} < 2; if (defined $stoplist->{$key}) { next if $stoplist->{$key} == 0; next if $stoplist->{$key} > $words->{$key}; } push @out, $key; } print join('', sort @out), "\n"; print md5_base64(join(' ', sort @out)), "\n"; } main(<<EOP); The "Digest::MD5" module allows you to use the RSA Data Security Inc. +MD5 Message Digest algorithm from within Perl programs. The algorithm takes as input a m +essage of arbitrary length and produces as output a 128-bit "fingerprint" or "message dige +st" of the input. EOP main(<<EOP); The "digest::MD5" Module allows you to use the RSA Data Security Inc. MD5 Message Digest algorithm from within Perl programs. The algorithm takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" and "message digest" of the input. EOP EOP
The output:
algorithmasdigestinputmd5messageofthe
PyVzoLxidA4SklaM0RsrhQ
algorithmasdigestinputmd5messageofthe
PyVzoLxidA4SklaM0RsrhQ
But looking at spam code is probably the way to go.

Update: While the idea was sound the code did not run correctly.

-- gam3
A picture is worth a thousand words, but takes 200K.

In reply to Re: Fingerprinting text documents for approximate comparison by gam3
in thread Fingerprinting text documents for approximate comparison by Mur

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.