Re^3: Comparing text documents

Ignoring the performance issue for now, how happy are you with the 'relevance' of the documents selected this way?

The reason for asking this question is that the mechanisms involved in Text::Compare and it's underlying module Search::VectorSpace are complex, and therefore almost inevitably slow.

In brief,

open the document, break it into words, excluded 'stopwords', build a vector of numbers representing the remaining words, having 'stemmed' them.
Now repeat the process for the second document.
Now perform some math and reduce the result to a single number.

The match is done using PDL, but everything else is done using standard perl. Regexes, hashes etc. It's fine for comparing two documents, but repeating this for the O(N²) process of comparing 100,000 docs one against the other is crazy.

You are repeating the entire indexing of all the files in @old_docs for every file in @new_docs.

If you dropped the Text::Compare wrapper, and moved to using the underlying Search::VectorSpace directly, then you could build up an in memory database of vectors for all of the old documents, and then compare each new document against all of the old in turn. Ie. an O(N) process rather than an O(N²). That would produce a significant time saving. Of course, whether you have enough memory to hold it all is another concern.

However, Search::VectorSpace is intended to search and rate it's database of documents for a single phrase or set of keywords, not entire other documents. I am really curious to know how much 'relevance' can be derived from the comparison of entire documents against entire documents based upon a single number representing the words they contain?

There is also the issue that each time you do this processing, even if you moved to using S::VS directly, you would still need to re-index every document even if it hasn't changed. That's also crazy. PDL supports the reading/writing of piddles to/from disk.

It would be much better (faster) to have one program that you point at a set of files, it loads each file in turn, generates it's vector and then saves it to disk. Once you have vectorised all the documents (only re-vectoring them if they change), then you can perform doc-v-doc comparisons by manipulating the vectors alone. And as the vectors will be on disk, you will not be bound by memory limitations.

However, I think that the vectors produced by S::VS are based upon the composite dictionary generated from all of the words (minus stop words) in all of the doc-set in the in-memory database. Ie. It loads and tokenises all of the documents and builds up it dictionary before it can produce the first vector. That means the dictionaries for different sets of documents would be different, therefore the numbers in the vectors representing the words would be different. And that means that the vectors produced from different runs of the module would not be comparable.

To be able to get the O(N) rather than O(N²) for the very expensive, slow parts of the process described above, you would need to generate a common dictionary from the entire corpus of documents, before producing the vectors. Or, you could move to using a 'standard' dictionary. This idea has some merit, especially if you are working with scientific or technical documents because it means that you could specifically tailor that dictionary to contain (and possibly only contain) specialist words and terms of your field. However, S::VS (and much less Text::Compare) has no provision for such mechanisms and is not architected to accommodate external dictionaries and disk based databases. Essentially, you would have to steal the techniques used internally by these modules, but write your own code to put those techniques together in a way that suite your application requirements.

The benefit you would gain from doing so is a huge speed up in your processing time and a much finer granularity in your corpus. Ie. When a new document arrives, you only need vectorise that against the standard dictionary and then compare it against each of the other stored vectors, rather than having to go through the entire process each time a word changes in any document.

The bottom line is, cpan modules are great if they do exactly what you need to do, but if the authors requirements diverge from yours even quite slightly, you may well pay a very high price for the reuse. And you have to look inside the modules and see how they do what they do to understand if you are making best use of the underlying algorithms. That especially true when you have one module layered on top of another as here.

The underlying S::VS module has the inherent ability to search a database of documents quickly. T::C, doesn't make use of that because it only seeks to compare two documents. You come along and want to compare 100,000 documents each against the other, but by using the T::C wrapper over S::VS, you end up re-vectorise your documents over and over again.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^3: Comparing text documents Select or Download Code

Replies are listed 'Best First'.
Re^4: Comparing text documents by Anonymous Monk on Jul 02, 2007 at 15:26 UTC
I have to admit that I as of now have just briefly read through your posting, but I have to say that I'm impressed. This seems to be exactly the "guidelines" I was looking for, and that you took the time to write such as thorough reply is fantastic. I will evaluate my options based on your posting and eventually get back to you in this thread. Thanks a lot!	[reply]
Re^4: Comparing text documents by Anonymous Monk on Jul 08, 2007 at 17:30 UTC
I've been playing around with this one a little, but there is no way in h... I'm gonna be able to cope with the amount of documents we have in a traditional way. For the record, I'm quite happy with the similarity score that Text::Compare produces. I would love to have some more information about this dictionary approach. Does it mean that I create one vector for ie. 100K documents, and then compare each document against that "dictionary record", or is it something else?	[reply]
Re^5: Comparing text documents by BrowserUk (Patriarch) on Jul 08, 2007 at 19:26 UTC
I would love to have some more information about this dictionary approach. Does it mean that I create one vector for ie. 100K documents, and then compare each document against that "dictionary record", [...]? Almost. One vector for each of your 100k docs. Then you can produce the one number from each of the vectors using the same mechanism as S::VS currently uses. Something to do with cosines, but I won't pretend to understand that bit--that you're happy with the results is all that really matters. Once you have that single numeric value for each doc, you can sort them and so group similar documents together. And when new documents arrive, you can perform the same process and produce a number that can be compared against the existing ones directly without re-processing everything else. The way Search::VectorSpace works, is to build up a dictionary of the words & frequencies of all the words in all the documents supplied on the constructor. It then builds a 'vector' for each document which is an array (actually PDL piddle), where each position in the array represents one of the words in the combined dictionary and the value at each position is the occurrence count for that word in the document being vectorised. It then performs some math on that array of word/occurrence counts and comes up with a number that represents that documents contents relative to all the other documents in the original document set. The problem is, the vectors (and therefore the numeric values) are only valid for comparing documents within that doc-set. As soon as you add, remove or change any document, you have to go through the whole process over again. Using the Text::Compare wrapper, the doc-set consists of only two documents each time, so you have to re-split, index, vectorise and calculate every document against every other document. Theoretically, it would be far more efficient to bypass Text::Compare and supply your entire doc-set to S::VS in one go. It would then build a single dictionary from all the words in all the docs and the vectorisation and calculation steps would only be performed once for each document. The problem is the size of your doc-set would almost certainly blow your memory long before you got around to trying to use it. Even if you didn't, you still have to go through the entire process again every time a document was added, removed or changed. The idea I was alluding to is to break the processing S::VS does out into separate steps. Process each of the documents in turn, splitting, de-stopping and stemming it, and adding the resulting words to a permanently stored dictionary. A Berkeley DB would be ideal. A second pass would reprocess each document and perform the vectorisation against the permanent dictionary. Again, the vectors would be placed into a permanent DB, keyed by document name. Again, Berkeley seems a natural fit. It is important that the ordering of the words does not change when new words are added. See later. Further time saving can be achieved by storing the splitted, stemmed, uniqued word list from step one for re-use in step 2. Having stored the vectors, a 3rd pass can perform the math magic on each of the vectors and produce a 3rd DB of numeric values representing each of the documents. The result of this third pass is a list on numeric values that once sorted will 'categorise' the documents. Those containing similar words, and similar counts of similar words will be grouped together in the sort list. Finding the list of document similar to any given document is just a case of looking the document up in the list and then picking the group of 5 or 10 documents positioned either side of it. The reduction to a numeric value could be done at the second stage, but if you discarded the vectors at that stage, the entire vectorisation would need to be re-done whenever a new document, or at least any new document with a word that didn't previously appear in the dictionary was added. By segregating the two steps, when a new document is processed at stage one, any new words it contains will be added to the end of the dictionary, and therefore the end of vectors produced using that dictionary which means previously generated vectors for documents that didn't contain that word would not need to be re-processed. Most of the operations for each of these steps can remain pretty much identical to the existing functions and methods within S::VS, and could be stolen from that module for use. The real work involved in coding the idea is the saving of the intermediate results to DBs at each stage. HTH. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]