Re^5: Comparing text documents

I would love to have some more information about this dictionary approach. Does it mean that I create one vector for ie. 100K documents, and then compare each document against that "dictionary record", [...]?

Almost. One vector for each of your 100k docs. Then you can produce the one number from each of the vectors using the same mechanism as S::VS currently uses. Something to do with cosines, but I won't pretend to understand that bit--that you're happy with the results is all that really matters. Once you have that single numeric value for each doc, you can sort them and so group similar documents together. And when new documents arrive, you can perform the same process and produce a number that can be compared against the existing ones directly without re-processing everything else.

The way Search::VectorSpace works, is to build up a dictionary of the words & frequencies of all the words in all the documents supplied on the constructor. It then builds a 'vector' for each document which is an array (actually PDL piddle), where each position in the array represents one of the words in the combined dictionary and the value at each position is the occurrence count for that word in the document being vectorised. It then performs some math on that array of word/occurrence counts and comes up with a number that represents that documents contents relative to all the other documents in the original document set.

The problem is, the vectors (and therefore the numeric values) are only valid for comparing documents within that doc-set. As soon as you add, remove or change any document, you have to go through the whole process over again. Using the Text::Compare wrapper, the doc-set consists of only two documents each time, so you have to re-split, index, vectorise and calculate every document against every other document.

Theoretically, it would be far more efficient to bypass Text::Compare and supply your entire doc-set to S::VS in one go. It would then build a single dictionary from all the words in all the docs and the vectorisation and calculation steps would only be performed once for each document. The problem is the size of your doc-set would almost certainly blow your memory long before you got around to trying to use it. Even if you didn't, you still have to go through the entire process again every time a document was added, removed or changed.

The idea I was alluding to is to break the processing S::VS does out into separate steps.

Process each of the documents in turn, splitting, de-stopping and stemming it, and adding the resulting words to a permanently stored dictionary.
A Berkeley DB would be ideal.
A second pass would reprocess each document and perform the vectorisation against the permanent dictionary.
Again, the vectors would be placed into a permanent DB, keyed by document name. Again, Berkeley seems a natural fit.
It is important that the ordering of the words does not change when new words are added. See later.
Further time saving can be achieved by storing the splitted, stemmed, uniqued word list from step one for re-use in step 2.
Having stored the vectors, a 3rd pass can perform the math magic on each of the vectors and produce a 3rd DB of numeric values representing each of the documents.
The result of this third pass is a list on numeric values that once sorted will 'categorise' the documents. Those containing similar words, and similar counts of similar words will be grouped together in the sort list. Finding the list of document similar to any given document is just a case of looking the document up in the list and then picking the group of 5 or 10 documents positioned either side of it.

The reduction to a numeric value could be done at the second stage, but if you discarded the vectors at that stage, the entire vectorisation would need to be re-done whenever a new document, or at least any new document with a word that didn't previously appear in the dictionary was added.

By segregating the two steps, when a new document is processed at stage one, any new words it contains will be added to the end of the dictionary, and therefore the end of vectors produced using that dictionary which means previously generated vectors for documents that didn't contain that word would not need to be re-processed.

Most of the operations for each of these steps can remain pretty much identical to the existing functions and methods within S::VS, and could be stolen from that module for use. The real work involved in coding the idea is the saving of the intermediate results to DBs at each stage. HTH.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^5: Comparing text documents