in reply to Fingerprinting text documents for approximate comparison

I would look at creating a fingerprint file for each document (you will need to refine the parameters you use).

In this file I would put perhaps:
number of significant words
average number of letters of the top 5 most common words
The three least common significant words (alphabetized)
The three most common significant words (alphabetic)

You can either use your current checksum, or create a checksum on the fingerprint files.

use similar checksums to select fingerprint files to compare, those fingerprints that are within a tolerance you set would be deemed matches.

Jsut my 2 cents worth, good luck! <!--

Enjoy!
Dageek

  • Comment on Re: Fingerprinting text documents for approximate comparison