I am going to assume the documents differ wildly, that you have excel sheets, html files, pdfs, images, simple text documents.
I would suggest possibly, and this is a hack.. To first weed out by much less specific and cpu intense methods..
How about:
- comparing A to B, first, i get the filesize of A and the filesize of B, if the difference is greater then 90 percent between the larger and the smaller file, then you decide these documents are much too different, you do not make further tests.
- Could you also say to yourself that if the filename is close enough (using String::Similarity) you can give that some weight to weed a file to or from similarity?
- If you have many file types, could you use Mime::Type to deem that a pdf and a ppt file are not alike at all?
- Could you first run similarity on a *portion* of the text FIRST and then the rest?
- If all these simple conditions you conjure up still suggest the documents could be similar, then you run your expensive Text::Compare procedure.
Like I said, this is a total hack, overall- if all your documents *were* similar, this would greatly slow down the whole process. However if some of these kinds of simple conditions *can* be deemed authoritative with your document archive, then it could be what you need to do.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.