in reply to Fingerprinting text documents for approximate comparison

I did a quick search, and came up with String::Approx, but its documentation states that:

NOTE: String::Approx has been designed to work with strings, not with text. In other words, when you want to compare things like text or source code, consisting of words or tokens and phrases and sentences, or expressions and statements, you should probably use some other tool than String::Approx, like for example the standard UNIX diff(1) tool, or the Algorithm::Diff module from CPAN, or if you just want the Levenshtein edit distance (explained below), the Text::Levenshtein module from CPAN. See also Text::WagnerFischer and Text::PhraseDistance.

So that might give you some other ideas.
  • Comment on Re: Fingerprinting text documents for approximate comparison

Replies are listed 'Best First'.
Re^2: Fingerprinting text documents for approximate comparison
by Mur (Pilgrim) on Mar 24, 2005 at 18:31 UTC
    Hmm. Again, I want a "absolute" fingerprint (like a checksum), rather than a way to compare two given documents. LevenshteinXS ran for over two minutes comparing two documents.
    --
    Jeff Boes
    Database Engineer
    Nexcerpt, Inc.
    vox 269.226.9550 ext 24
    fax 269.349.9076
     http://www.nexcerpt.com
    ...Nexcerpt...Connecting People With Expertise