Ignoring the performance issue for now, how happy are you with the 'relevance' of the documents selected this way?
The reason for asking this question is that the mechanisms involved in Text::Compare and it's underlying module Search::VectorSpace are complex, and therefore almost inevitably slow.
In brief,
- open the document, break it into words, excluded 'stopwords', build a vector of numbers representing the remaining words, having 'stemmed' them.
- Now repeat the process for the second document.
- Now perform some math and reduce the result to a single number.
The match is done using PDL, but everything else is done using standard perl. Regexes, hashes etc. It's fine for comparing two documents, but repeating this for the O(N2) process of comparing 100,000 docs one against the other is crazy.
You are repeating the entire indexing of all the files in @old_docs for every file in @new_docs.
If you dropped the Text::Compare wrapper, and moved to using the underlying Search::VectorSpace directly, then you could build up an in memory database of vectors for all of the old documents, and then compare each new document against all of the old in turn. Ie. an O(N) process rather than an O(N2). That would produce a significant time saving. Of course, whether you have enough memory to hold it all is another concern.
However, Search::VectorSpace is intended to search and rate it's database of documents for a single phrase or set of keywords, not entire other documents. I am really curious to know how much 'relevance' can be derived from the comparison of entire documents against entire documents based upon a single number representing the words they contain?
There is also the issue that each time you do this processing, even if you moved to using S::VS directly, you would still need to re-index every document even if it hasn't changed. That's also crazy. PDL supports the reading/writing of piddles to/from disk.
It would be much better (faster) to have one program that you point at a set of files, it loads each file in turn, generates it's vector and then saves it to disk. Once you have vectorised all the documents (only re-vectoring them if they change), then you can perform doc-v-doc comparisons by manipulating the vectors alone. And as the vectors will be on disk, you will not be bound by memory limitations.
However, I think that the vectors produced by S::VS are based upon the composite dictionary generated from all of the words (minus stop words) in all of the doc-set in the in-memory database. Ie. It loads and tokenises all of the documents and builds up it dictionary before it can produce the first vector. That means the dictionaries for different sets of documents would be different, therefore the numbers in the vectors representing the words would be different. And that means that the vectors produced from different runs of the module would not be comparable.
To be able to get the O(N) rather than O(N2) for the very expensive, slow parts of the process described above, you would need to generate a common dictionary from the entire corpus of documents, before producing the vectors. Or, you could move to using a 'standard' dictionary. This idea has some merit, especially if you are working with scientific or technical documents because it means that you could specifically tailor that dictionary to contain (and possibly only contain) specialist words and terms of your field. However, S::VS (and much less Text::Compare) has no provision for such mechanisms and is not architected to accommodate external dictionaries and disk based databases. Essentially, you would have to steal the techniques used internally by these modules, but write your own code to put those techniques together in a way that suite your application requirements.
The benefit you would gain from doing so is a huge speed up in your processing time and a much finer granularity in your corpus. Ie. When a new document arrives, you only need vectorise that against the standard dictionary and then compare it against each of the other stored vectors, rather than having to go through the entire process each time a word changes in any document.
The bottom line is, cpan modules are great if they do exactly what you need to do, but if the authors requirements diverge from yours even quite slightly, you may well pay a very high price for the reuse. And you have to look inside the modules and see how they do what they do to understand if you are making best use of the underlying algorithms. That especially true when you have one module layered on top of another as here.
The underlying S::VS module has the inherent ability to search a database of documents quickly. T::C, doesn't make use of that because it only seeks to compare two documents. You come along and want to compare 100,000 documents each against the other, but by using the T::C wrapper over S::VS, you end up re-vectorise your documents over and over again.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.