understood...
What about compressing the document with something like huffman encoding which would then shorten all of the words, replacing them with keys for repeated instances so that would really compress the text. You could even then compare the 'keys' it uses as replacements for comparison of like text. Going further, but this might be pushing it is just store the keys it uses (i.e. the header from the compression) as these would be replaced based on frequency of use and then you could eliminate all of the short words then.
Just a thought.. =)

Regards Paul

Update:
You still would have to compare the 'signatures' which would still be very time consuming, but the only way I can see around this with this method is say you use this header that is generated and pick the top 10 most used chunks (as it can be words or phrases), and then sort them alphabetically. Then store your document in directories and subdirectories based off each word. i.e.

\stars\moon\rocks\
has RocksoftheMoon.doc and GanymedeGeology.doc
for instance....
This way you evaluate you document when you get it and store it in a specific place and like documents just end up in the same directory....or at least nearby..

Regards Paul

In reply to Re^3: Fingerprinting text documents for approximate comparison by thekestrel
in thread Fingerprinting text documents for approximate comparison by Mur

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.