The crux of your problem is you want to compare a numerical distance from page A to page B without a lot of processing.

Nilsimsa seems useless if you can't compare a numerical distance.

Levenshtein seems useless if you can only compare relative distances by processing each pair of pages.

When I hear "distance" I think Pythagorus.

Some folks mentioned making a massive bit vector for nontrivial unique words, but having done something like this before, you quickly end up with bit vectors of thousands of words. And the number of times that 'vector' appears in a page may be important, so then you end up with count-vectors of thousands of words.

You're on the right track. My method is closer to the way DNA gets compared, in comparing the prevalence of smaller and smaller substring fragments. You only need a small vector (e.g., 26 ints) for each page's "absolute" position in the global space. You then can compare two absolute positions in O(1) time.

# for each page, # romanize anything you can romanize # remove subjective words if possible # canonicalize the page, strip formatting # remove all low-value words # dissolve the page into letters # form an absolute 26-space vector from remaining letters # do NOT normalize the 26-space vector # save the 26-space vector for this page

Then, the "distance" or "difference" between two pages is just a simple comparison of two 26-space vectors. You may scale the importance of some letters higher than others, and you may choose or base your function on Manhattan, Euclidean or Pythagorean methods of measuring the distance.

Don't normalize the 26-space vector, or you'll find that any non-trivial page matches an encyclopedia in terms of overall letter distribution.

You can also do this on a chain approach: each natural section of a page (sentence, paragraph, title-to-title, etc.) can be fingerprinted separately, depending on the granularity you can afford to store and search. You can then search for fragments as well as whole pages. For example, one paragraph may be plagiarized from a different source.

Naive searches are O(n) with n fingerprints, if you count the distance formula as O(1). I would think you could b-tree or octree or duodecasexigesima-tree (?) the vector search space for faster exclusions of very different fingerprints, and reduce that search pretty far.

--
[ e d @ h a l l e y . c c ]


In reply to Re: Fingerprinting text documents for approximate comparison by halley
in thread Fingerprinting text documents for approximate comparison by Mur

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.