Keep the small words. If you want to drop stuff, then get a table of the N most common words in your target language (or corpus) and drop those. But you probably needn't bother.

Pick some fairly short "chunk size". It'll need to be a little longer if you don't drop common words, but you'll probably just need to tune it experimentally.

Hash every substring of that chunk size. That'll give you one hash per token in your input (approximately -- actually, a few less). If you use a rolling hash function, you can do this fast. For now, ignore the problem that this will be about four times larger than your original input.

What you do next depends on whether you want to do this pairwise, or you want to find all of the most similar documents in a large corpus.

Pairwise: hash the two documents and count up the percentage of hashes in common.

Corpus: Make a giant hash table mapping each of these hashes to some id representing the document they came from. Hash the incoming document and look up each of the hashes in the table. Generate another table mapping document id => count of matching hashes. Sort this by count.

This is still impractical because of the number of hashes. So randomly drop some proportion of them. Make the drop decision based only upon the value of the hash, so that you drop the same hashes from all the documents. (Instead of comparing entire elephants to elephants, you want to compare elephant legs and trunks to elephant legs and trunks. You do not want to compare legs to sometimes legs, sometimes tails.)

There is a further refinement that eliminates degenerate cases, but it's patented and so you're probably better off not knowing about it. Perhaps you'll rediscover it on your own, but then you won't have this post sitting here to prove willful infringement.

Search for "MOSS" and "agrep" for more algorithmic details. Useful names are "Alex Aiken" and "Udi Manber".

Except that your name looks very familiar, so you probably already know all this. I think you may be the guy who asked me to call you a few months back. Sad how I'll happily reply to questions posted on this site, but never get around to replying to my personal email...


In reply to Re: Fingerprinting text documents for approximate comparison by sfink
in thread Fingerprinting text documents for approximate comparison by Mur

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.