in reply to Fingerprinting text documents for approximate comparison

++! Fascinating project, so please tolerate (or skip) this un-perlish reply (ie, algorithmic only) and request for comments:

For articles in which a piece of wirecopy has a paragraph or so of text inserted by a newspaper's local editors or reporters into the (otherwise unchanged) wire source the fingerprint matching will FAIL (which may be what's desired, if failed matches, are, for example, flagged to human attention or to more-sophisticated-processes).

Further study of failed matches can be a waste -- if the local insert merely "localizes" (in the newroom use of that term) the story without adding anything significant, as, for example, "Sheriff Numnutz of our_own_county said he had instituted an even better program last duration_entity...."

Update - OTOH ...but could be highly valuable if the local insert to a gushing, bloviating wire service story on some hyper-hyped tech story were to say "...person_in_main_story was convicted of releasing buggy software in local_court last duration_entity."

Also a "waste" would be instances where a story gets flagged as non-matching because the jump line ("Continued on page 24" or "see 'Numnutz reacts'" etc) occurs at a different location in the story (ie, paper A gave it 5 para their webpage before linking to "full text" while paper B gave it only 3.

So the "nearness" test has a lot to commend it... but it seems to me that the plan by which you determine WHICH SITES to sample and the vagaries of their layouts is going to overwhelm the "nearness" test -- at least in a good many cases).

And, pursuing the good Monks' suggestions above, the CPAN synopsis re Digest::Nilsimsa says, in part, "A nilsimsa signature is a statistic of n-gram occurance in a piece of text. It is a 256 bit value usually represented in hex." Does that 256 bit limit reduce its effectiveness as the length of the text increases?

updated definition:"An N-gram is a sequence of length N. For example, sequences of words within texts are N-grams but the idea is much more general."
...and

"An N-gram is like a moving window over a text, where N is the number of words in the window.
igrams: two consecutive words
Trigrams: three consecutive words
Quadrigrams: four consecutive words
And so on.
N-gram analysis calculates the probability of a given sequence of words occurring: if you know what the first word of a pair is, how confidently can you predict the next one, using the conditional bigram probability $ P(w_n \vert w_{n-1})$."
).

Possibly supporting my naive hypothesis that the worth of the nilsimsa value goes down as the length of the source text increases is this, from A site referenced in Digest::Nilsimsa:
"The nilsimsa of these two codes is 92 on a scale of -128 to +128. That means that 36 bits are different and 220 bits the same. Any nilsimsa over 24 (which is 3 sigma) indicates that the two messages are probably not independently generated." because ( arguing against own previous) comparing two large and similar texts, one of which is distinguished only by a relatively small insert (and possible formatting changes which OP is seeking to exclude from the testing), would seem likely to yield a nilsims of well over 24.

  • Comment on Re: Fingerprinting text documents for approximate comparison

Replies are listed 'Best First'.
Re^2: Fingerprinting text documents for approximate comparison
by Mur (Pilgrim) on Mar 24, 2005 at 19:13 UTC
    I downloaded Digest::Nilsimsa and started to toy around with it. What I don't understand, and the referenced site does not explain, is how to get from two 256-bit values to the scale of -128 to +128. Anyone have any thoughts? (I emailed the module author, but since this code is a couple of years old I have no expectation that I will reach him.)
    --
    Jeff Boes
    Database Engineer
    Nexcerpt, Inc.
    vox 269.226.9550 ext 24
    fax 269.349.9076
     http://www.nexcerpt.com
    ...Nexcerpt...Connecting People With Expertise
      Jeff
      back to a doc I don't fully grok... but...

      The scale -128 to +128 appears to be arbitrary and could just as well be 0-255.

      That inference comes from the author's statement re his example output from two slightly variant sources: "The nilsimsa of these two codes is 92 on a scale of -128 to +128. That means that 36 bits are different and 220 bits the same. Any nilsimsa over 24 (which is 3 sigma) indicates that the two messages are probably not independently generated."

      BTW, I attach high value to the observations (below) from BrowserUK and sfink (but am not sure I'm ready to buy (no offense intended, BrowserUK!) BUK's "no easy way" as (1)gospel nor (2, and more important) as any reason not to search for a way.

        no offense intended, BrowserUK!)

        None taken :)

        My statement "no easy way" derives from several attempts that I have had at doing this. I've achieved a degree of success, but it is very easily thrown.

        The problem with reducing your signature to a numerical value is that two identical pieces of text, except one has a spelling error in an otherwise insignificant word, promotes that word to the realms of a 'rare' word. When you now derive your numeric value, that typo can completely change the value derived and so, the two pieces of text no longer compare favourably.

        My best attempt at working around that was to effectively spell check the document. That is to say, I only used words that appeared in my dictionary file when building my lists of N-rare words.

        Whilst that removes typos of common words from being considered rare words, it also excludes many proper nouns and similar that would otherwise form very good components of a signature.

        For example, people's names and place names often form a very strong correlative indicator for news stories, but if you only accept "correctly spelt words" as candidates, then you exclude many names that will likely not appear in your dictionary.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco.
        Rule 1 has a caveat! -- Who broke the cabal?
        Yeah ... I don't get where he gets "92" from. Like I said, I thought of ANDing the two 256-bit numbers together and counting 1s.
        --
        Jeff Boes
        Database Engineer
        Nexcerpt, Inc.
        vox 269.226.9550 ext 24
        fax 269.349.9076
         http://www.nexcerpt.com
        ...Nexcerpt...Connecting People With Expertise