Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for advice on how to deal with comparing (in regards to similarity) documents. I've been checking out Text::Compare, and it seems to do the job quite nicely. However, my plead for advice and knowledge is a bit more "abstract". :)

First of, I want to describe my approach;
# Create a list of files/documents that haven't been # compared to the rest of the documents my @new_docs = $foo->get_new_documents(); my @old_docs = $foo->get_old_documents(); # Process my $Compare = Text::Compare->new( memoize => 1 ); foreach my $new_doc ( @new_docs ) { $Compare->first( $new_docs ); foreach my $old_doc ( @old_docs ) { $Compare->second( $old_doc ); $foo->save_similarity( $new_doc, $old_doc, $Compare->similarit +y() ); } }
This works quite well, but you all understand that if there are 100,000 documents to compare against, it will take a lot of time. Is it possible to something "smart" to make it go faster?

My other question is; what is the best way to store the similarity info? Right now I have a document_similarity table which contains a pointer doc1_id and doc2_id (int, combined primary key) and a similarity field (float). This works, but are there smarter ways here as well?

Thanks in advance for any input!

Replies are listed 'Best First'.
Re: Comparing text documents
by leocharre (Priest) on Jul 02, 2007 at 16:34 UTC

    I am going to assume the documents differ wildly, that you have excel sheets, html files, pdfs, images, simple text documents.

    I would suggest possibly, and this is a hack.. To first weed out by much less specific and cpu intense methods.. How about:

    • comparing A to B, first, i get the filesize of A and the filesize of B, if the difference is greater then 90 percent between the larger and the smaller file, then you decide these documents are much too different, you do not make further tests.
    • Could you also say to yourself that if the filename is close enough (using String::Similarity) you can give that some weight to weed a file to or from similarity?
    • If you have many file types, could you use Mime::Type to deem that a pdf and a ppt file are not alike at all?
    • Could you first run similarity on a *portion* of the text FIRST and then the rest?
    • If all these simple conditions you conjure up still suggest the documents could be similar, then you run your expensive Text::Compare procedure.

    Like I said, this is a total hack, overall- if all your documents *were* similar, this would greatly slow down the whole process. However if some of these kinds of simple conditions *can* be deemed authoritative with your document archive, then it could be what you need to do.

Re: Comparing text documents
by BrowserUk (Patriarch) on Jul 02, 2007 at 13:36 UTC
      By choosing one document, try to find the X most relevant documents.

        Ignoring the performance issue for now, how happy are you with the 'relevance' of the documents selected this way?

        The reason for asking this question is that the mechanisms involved in Text::Compare and it's underlying module Search::VectorSpace are complex, and therefore almost inevitably slow.

        In brief,

        1. open the document, break it into words, excluded 'stopwords', build a vector of numbers representing the remaining words, having 'stemmed' them.
        2. Now repeat the process for the second document.
        3. Now perform some math and reduce the result to a single number.

        The match is done using PDL, but everything else is done using standard perl. Regexes, hashes etc. It's fine for comparing two documents, but repeating this for the O(N2) process of comparing 100,000 docs one against the other is crazy.

        You are repeating the entire indexing of all the files in @old_docs for every file in @new_docs.

        If you dropped the Text::Compare wrapper, and moved to using the underlying Search::VectorSpace directly, then you could build up an in memory database of vectors for all of the old documents, and then compare each new document against all of the old in turn. Ie. an O(N) process rather than an O(N2). That would produce a significant time saving. Of course, whether you have enough memory to hold it all is another concern.

        However, Search::VectorSpace is intended to search and rate it's database of documents for a single phrase or set of keywords, not entire other documents. I am really curious to know how much 'relevance' can be derived from the comparison of entire documents against entire documents based upon a single number representing the words they contain?

        There is also the issue that each time you do this processing, even if you moved to using S::VS directly, you would still need to re-index every document even if it hasn't changed. That's also crazy. PDL supports the reading/writing of piddles to/from disk.

        It would be much better (faster) to have one program that you point at a set of files, it loads each file in turn, generates it's vector and then saves it to disk. Once you have vectorised all the documents (only re-vectoring them if they change), then you can perform doc-v-doc comparisons by manipulating the vectors alone. And as the vectors will be on disk, you will not be bound by memory limitations.

        However, I think that the vectors produced by S::VS are based upon the composite dictionary generated from all of the words (minus stop words) in all of the doc-set in the in-memory database. Ie. It loads and tokenises all of the documents and builds up it dictionary before it can produce the first vector. That means the dictionaries for different sets of documents would be different, therefore the numbers in the vectors representing the words would be different. And that means that the vectors produced from different runs of the module would not be comparable.

        To be able to get the O(N) rather than O(N2) for the very expensive, slow parts of the process described above, you would need to generate a common dictionary from the entire corpus of documents, before producing the vectors. Or, you could move to using a 'standard' dictionary. This idea has some merit, especially if you are working with scientific or technical documents because it means that you could specifically tailor that dictionary to contain (and possibly only contain) specialist words and terms of your field. However, S::VS (and much less Text::Compare) has no provision for such mechanisms and is not architected to accommodate external dictionaries and disk based databases. Essentially, you would have to steal the techniques used internally by these modules, but write your own code to put those techniques together in a way that suite your application requirements.

        The benefit you would gain from doing so is a huge speed up in your processing time and a much finer granularity in your corpus. Ie. When a new document arrives, you only need vectorise that against the standard dictionary and then compare it against each of the other stored vectors, rather than having to go through the entire process each time a word changes in any document.

        The bottom line is, cpan modules are great if they do exactly what you need to do, but if the authors requirements diverge from yours even quite slightly, you may well pay a very high price for the reuse. And you have to look inside the modules and see how they do what they do to understand if you are making best use of the underlying algorithms. That especially true when you have one module layered on top of another as here.

        The underlying S::VS module has the inherent ability to search a database of documents quickly. T::C, doesn't make use of that because it only seeks to compare two documents. You come along and want to compare 100,000 documents each against the other, but by using the T::C wrapper over S::VS, you end up re-vectorise your documents over and over again.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        ...ie. the most similar documents based on Text::Compare's vector based rating (which looks quite good so far).