in reply to Fingerprinting text documents for approximate comparison

Keep the small words. If you want to drop stuff, then get a table of the N most common words in your target language (or corpus) and drop those. But you probably needn't bother.

Pick some fairly short "chunk size". It'll need to be a little longer if you don't drop common words, but you'll probably just need to tune it experimentally.

Hash every substring of that chunk size. That'll give you one hash per token in your input (approximately -- actually, a few less). If you use a rolling hash function, you can do this fast. For now, ignore the problem that this will be about four times larger than your original input.

What you do next depends on whether you want to do this pairwise, or you want to find all of the most similar documents in a large corpus.

Pairwise: hash the two documents and count up the percentage of hashes in common.

Corpus: Make a giant hash table mapping each of these hashes to some id representing the document they came from. Hash the incoming document and look up each of the hashes in the table. Generate another table mapping document id => count of matching hashes. Sort this by count.

This is still impractical because of the number of hashes. So randomly drop some proportion of them. Make the drop decision based only upon the value of the hash, so that you drop the same hashes from all the documents. (Instead of comparing entire elephants to elephants, you want to compare elephant legs and trunks to elephant legs and trunks. You do not want to compare legs to sometimes legs, sometimes tails.)

There is a further refinement that eliminates degenerate cases, but it's patented and so you're probably better off not knowing about it. Perhaps you'll rediscover it on your own, but then you won't have this post sitting here to prove willful infringement.

Search for "MOSS" and "agrep" for more algorithmic details. Useful names are "Alex Aiken" and "Udi Manber".

Except that your name looks very familiar, so you probably already know all this. I think you may be the guy who asked me to call you a few months back. Sad how I'll happily reply to questions posted on this site, but never get around to replying to my personal email...

  • Comment on Re: Fingerprinting text documents for approximate comparison