Re: Comparing text documents

I am going to assume the documents differ wildly, that you have excel sheets, html files, pdfs, images, simple text documents.

I would suggest possibly, and this is a hack.. To first weed out by much less specific and cpu intense methods.. How about:

comparing A to B, first, i get the filesize of A and the filesize of B, if the difference is greater then 90 percent between the larger and the smaller file, then you decide these documents are much too different, you do not make further tests.
Could you also say to yourself that if the filename is close enough (using String::Similarity) you can give that some weight to weed a file to or from similarity?
If you have many file types, could you use Mime::Type to deem that a pdf and a ppt file are not alike at all?
Could you first run similarity on a *portion* of the text FIRST and then the rest?
If all these simple conditions you conjure up still suggest the documents could be similar, then you run your expensive Text::Compare procedure.

Like I said, this is a total hack, overall- if all your documents *were* similar, this would greatly slow down the whole process. However if some of these kinds of simple conditions *can* be deemed authoritative with your document archive, then it could be what you need to do.

Comment on Re: Comparing text documents