Re: Fingerprinting text documents for approximate comparison

The crux of your problem is you want to compare a numerical distance from page A to page B without a lot of processing.

Nilsimsa seems useless if you can't compare a numerical distance.

Levenshtein seems useless if you can only compare relative distances by processing each pair of pages.

When I hear "distance" I think Pythagorus.

Some folks mentioned making a massive bit vector for nontrivial unique words, but having done something like this before, you quickly end up with bit vectors of thousands of words. And the number of times that 'vector' appears in a page may be important, so then you end up with count-vectors of thousands of words.

You're on the right track. My method is closer to the way DNA gets compared, in comparing the prevalence of smaller and smaller substring fragments. You only need a small vector (e.g., 26 ints) for each page's "absolute" position in the global space. You then can compare two absolute positions in O(1) time.

# for each page,
   # romanize anything you can romanize
   # remove subjective words if possible
   # canonicalize the page, strip formatting
   # remove all low-value words
   # dissolve the page into letters
   # form an absolute 26-space vector from remaining letters
   # do NOT normalize the 26-space vector
   # save the 26-space vector for this page
[download]

Then, the "distance" or "difference" between two pages is just a simple comparison of two 26-space vectors. You may scale the importance of some letters higher than others, and you may choose or base your function on Manhattan, Euclidean or Pythagorean methods of measuring the distance.

Don't normalize the 26-space vector, or you'll find that any non-trivial page matches an encyclopedia in terms of overall letter distribution.

You can also do this on a chain approach: each natural section of a page (sentence, paragraph, title-to-title, etc.) can be fingerprinted separately, depending on the granularity you can afford to store and search. You can then search for fragments as well as whole pages. For example, one paragraph may be plagiarized from a different source.

Naive searches are O(n) with n fingerprints, if you count the distance formula as O(1). I would think you could b-tree or octree or duodecasexigesima-tree (?) the vector search space for faster exclusions of very different fingerprints, and reduce that search pretty far.

--
[ e d @ h a l l e y . c c ]

Comment on Re: Fingerprinting text documents for approximate comparison Download Code