in reply to Fingerprinting text documents for approximate comparison
The output:#!/usr/bin/perl use strict; our $stoplist = { and => 0, or => 0, but => 0, }; use Digest::MD5 qw(md5 md5_hex md5_base64); sub main { my $line; my $words = {}; while ($line = shift) { chomp; my @data = split(/\b/, $line); for my $word (@data) { $word =~ s/\s*//g; chomp($word); next if length $word < 2; $words->{lc($word)}++; } } my @out; for my $key (keys %$words) { next if $words->{$key} < 2; if (defined $stoplist->{$key}) { next if $stoplist->{$key} == 0; next if $stoplist->{$key} > $words->{$key}; } push @out, $key; } print join('', sort @out), "\n"; print md5_base64(join(' ', sort @out)), "\n"; } main(<<EOP); The "Digest::MD5" module allows you to use the RSA Data Security Inc. +MD5 Message Digest algorithm from within Perl programs. The algorithm takes as input a m +essage of arbitrary length and produces as output a 128-bit "fingerprint" or "message dige +st" of the input. EOP main(<<EOP); The "digest::MD5" Module allows you to use the RSA Data Security Inc. MD5 Message Digest algorithm from within Perl programs. The algorithm takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" and "message digest" of the input. EOP EOP
algorithmasdigestinputmd5messageofthe PyVzoLxidA4SklaM0RsrhQ algorithmasdigestinputmd5messageofthe PyVzoLxidA4SklaM0RsrhQBut looking at spam code is probably the way to go.
Update: While the idea was sound the code did not run correctly.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Fingerprinting text documents for approximate comparison
by BrowserUk (Patriarch) on Mar 25, 2005 at 02:01 UTC | |
by gam3 (Curate) on Mar 25, 2005 at 03:11 UTC |