Re: Fingerprinting text documents for approximate comparison

Here is a very simple-minded approach that is at least well motivated and is easy to implement and comprehend. Think of each letter as a "vector" that takes you some distance in a particular direction in a 26-dimension space; the more often you encounter a particular letter, the farther you go in that direction. If two documents contain the exact same set of letters, you'll end up at the same place after traversing each one. (If you like, you can keep case distinctions and/or include digits and even punctuation, to add more dimensions.)

Most people who do this sort of thing do it terms of some sort of cosine or other trig-like metric, but for a bone-head approach that I'm making up from scratch, that's a bit beyond my ken (I never got that far in math -- or if I did, I've forgotten). So instead, I'll just assign a distinct "distance" (a distinct power of 2) to each letter, so that if two documents contain the same set of letters, I'll just traverse the same distance.

Because it seems sensible, I'll assign distances to the letters in order of their likelihood of occurrence. Then, for each document, I just build up a character histogram, and sum up a distance score by multiplying each character's frequency of occurrence by it's assigned magnitude.

To make things more efficient, I'll prepare the input data by building a lookup table for the docs: a simple list that gives a file name, a doc-ID string, and a starting and ending byte offset in the file. (Whether a given file contains one doc or many is not relevant -- I just need to make sure that each doc has a unique ID.)

#!/usr/bin/perl

use strict;

die "Usage: $0 < doctable > sig.list\n" if ( @ARGV );

# a simple-minded attempt to generate a "signature" value
# that could be used as a means of numerically measuring
# document similarity

# @letters is ordered according to relative letter frequency in
# English news text (based on a 1-month sample of AP newswire):

my @letters = qw/e a t i n o s r h l d c u m p g f w y b v k j x z q/;

# %vectors stores a distinct power of 2 for each letter;
# more frequent letters are assigned lower powers:

my $i;
my %vectors = map { $_ => 1<<$i++ } @letters;

my ($openfn,$offset);

# Input on stdin is one doc entry per line:
#   file.name  doc_id  begin_offset  end_offset

while (<>) {  
    my ( $fn, $id, $so, $eo ) = split;
    if ( $fn ne $openfn ) {
        open IN, $fn or die "$fn: $!";
        $openfn = $fn;
        $offset = 0;
    }
    seek( IN, $so, 0 ) if ( $so != $offset );
    read( IN, $_, $eo-$so );
    $offset = $eo;

    tr/A-Z/a-z/;
    tr/a-z//cd;

    my %c_hist = ();
    $c_hist{$_}++ for ( split // );

    my $sig = 0;
    $sig += $c_hist{$_} * $vectors{$_} for ( keys %c_hist );

    printf "%s\t%f\n", $id, $sig;
}
[download]

The output will be such that shorter docs will have smaller signature numbers and longer docs will have larger numbers; docs with equal numbers have the exact same inventory of letters, and ones with small differences in their scores will have very few letters different.

(Well, there is a possibility that two completely unrelated docs might be anagrams of one another. And of course, a document with 2 "e"s will score the same as one with a single "a", other things being equal, and likewise for other equivalences. But for a first-pass discriminator, it works pretty well. I also tried it with prime numbers instead of powers of 2, and found that the primes yielded more "collisions" -- fewer distinct values on the same set of about 25K documents.)

Another thing to point out is that this would be easy enough to implement in C to be worthwhile for the run-time you'd save.

But that's the simple part. The hard part, especially if your handling docs that are being pulled off of web sites, is to make sure that you only compare the parts that matter.

I presume you are already taking care to strip out markup, HTML comments, javascript, yadda yadda. But once you reduce the pages down to visible text content, you still need to strip away or simply avoid all the site-specific framework (page headers, footers, side bars, nav. bars, ad nauseum). If you're treating pages from multiple web sites, you need filters adapted to each one (and each filter is likely to have to change over time, as the web sites change their formats at whim). Good luck with that.

Comment on Re: Fingerprinting text documents for approximate comparison Download Code