Comparing text documents

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Comparing text documents by leocharre (Priest) on Jul 02, 2007 at 16:34 UTC
I am going to assume the documents differ wildly, that you have excel sheets, html files, pdfs, images, simple text documents. I would suggest possibly, and this is a hack.. To first weed out by much less specific and cpu intense methods.. How about: comparing A to B, first, i get the filesize of A and the filesize of B, if the difference is greater then 90 percent between the larger and the smaller file, then you decide these documents are much too different, you do not make further tests. Could you also say to yourself that if the filename is close enough (using String::Similarity) you can give that some weight to weed a file to or from similarity? If you have many file types, could you use Mime::Type to deem that a pdf and a ppt file are not alike at all? Could you first run similarity on a portion of the text FIRST and then the rest? If all these simple conditions you conjure up still suggest the documents could be similar, then you run your expensive Text::Compare procedure. Like I said, this is a total hack, overall- if all your documents were similar, this would greatly slow down the whole process. However if some of these kinds of simple conditions can be deemed authoritative with your document archive, then it could be what you need to do.	[reply]
Re: Comparing text documents by BrowserUk (Patriarch) on Jul 02, 2007 at 13:36 UTC
What are you going to do as a result of this? Ie. What decisions are you going to take on the basis that A cmp B = 0.919, and A cmp C = 0.875? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: Comparing text documents by Anonymous Monk on Jul 02, 2007 at 14:07 UTC
By choosing one document, try to find the X most relevant documents.	[reply]
Re^3: Comparing text documents by BrowserUk (Patriarch) on Jul 02, 2007 at 15:17 UTC
Ignoring the performance issue for now, how happy are you with the 'relevance' of the documents selected this way? The reason for asking this question is that the mechanisms involved in Text::Compare and it's underlying module Search::VectorSpace are complex, and therefore almost inevitably slow. In brief, open the document, break it into words, excluded 'stopwords', build a vector of numbers representing the remaining words, having 'stemmed' them. Now repeat the process for the second document. Now perform some math and reduce the result to a single number. The match is done using PDL, but everything else is done using standard perl. Regexes, hashes etc. It's fine for comparing two documents, but repeating this for the O(N²) process of comparing 100,000 docs one against the other is crazy. You are repeating the entire indexing of all the files in `@old_docs` for every file in `@new_docs`. If you dropped the Text::Compare wrapper, and moved to using the underlying Search::VectorSpace directly, then you could build up an in memory database of vectors for all of the old documents, and then compare each new document against all of the old in turn. Ie. an O(N) process rather than an O(N²). That would produce a significant time saving. Of course, whether you have enough memory to hold it all is another concern. However, Search::VectorSpace is intended to search and rate it's database of documents for a single phrase or set of keywords, not entire other documents. I am really curious to know how much 'relevance' can be derived from the comparison of entire documents against entire documents based upon a single number representing the words they contain? There is also the issue that each time you do this processing, even if you moved to using S::VS directly, you would still need to re-index every document even if it hasn't changed. That's also crazy. PDL supports the reading/writing of piddles to/from disk. It would be much better (faster) to have one program that you point at a set of files, it loads each file in turn, generates it's vector and then saves it to disk. Once you have vectorised all the documents (only re-vectoring them if they change), then you can perform doc-v-doc comparisons by manipulating the vectors alone. And as the vectors will be on disk, you will not be bound by memory limitations. However, I think that the vectors produced by S::VS are based upon the composite dictionary generated from all of the words (minus stop words) in all of the doc-set in the in-memory database. Ie. It loads and tokenises all of the documents and builds up it dictionary before it can produce the first vector. That means the dictionaries for different sets of documents would be different, therefore the numbers in the vectors representing the words would be different. And that means that the vectors produced from different runs of the module would not be comparable. To be able to get the O(N) rather than O(N²) for the very expensive, slow parts of the process described above, you would need to generate a common dictionary from the entire corpus of documents, before producing the vectors. Or, you could move to using a 'standard' dictionary. This idea has some merit, especially if you are working with scientific or technical documents because it means that you could specifically tailor that dictionary to contain (and possibly only contain) specialist words and terms of your field. However, S::VS (and much less Text::Compare) has no provision for such mechanisms and is not architected to accommodate external dictionaries and disk based databases. Essentially, you would have to steal the techniques used internally by these modules, but write your own code to put those techniques together in a way that suite your application requirements. The benefit you would gain from doing so is a huge speed up in your processing time and a much finer granularity in your corpus. Ie. When a new document arrives, you only need vectorise that against the standard dictionary and then compare it against each of the other stored vectors, rather than having to go through the entire process each time a word changes in any document. The bottom line is, cpan modules are great if they do exactly what you need to do, but if the authors requirements diverge from yours even quite slightly, you may well pay a very high price for the reuse. And you have to look inside the modules and see how they do what they do to understand if you are making best use of the underlying algorithms. That especially true when you have one module layered on top of another as here. The underlying S::VS module has the inherent ability to search a database of documents quickly. T::C, doesn't make use of that because it only seeks to compare two documents. You come along and want to compare 100,000 documents each against the other, but by using the T::C wrapper over S::VS, you end up re-vectorise your documents over and over again. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: Comparing text documents by Anonymous Monk on Jul 02, 2007 at 15:26 UTC
Re^4: Comparing text documents by Anonymous Monk on Jul 08, 2007 at 17:30 UTC
Re^5: Comparing text documents by BrowserUk (Patriarch) on Jul 08, 2007 at 19:26 UTC
Re^3: Comparing text documents by Anonymous Monk on Jul 02, 2007 at 14:12 UTC
...ie. the most similar documents based on Text::Compare's vector based rating (which looks quite good so far).	[reply]