in reply to Fingerprinting text documents for approximate comparison

To do this with any accuracy you need a pre-existing corpus of "typical data".

The best signature of a given piece if text is the N-rarest words it contains, where 'rarest' is defined in terms of the frequency with which each word appears in your corpus of typical data.

However, there is no easy way to convert that to a single numerical value that will allow your 'likelyhood" approximation. Even if you took two identical pieces of text that each carried the addition of, say, the transmitting bodies--eg. 'Reuters' and 'CNN'--then those additions will likely affect any reduction to a numerical value in a way that will make comparison very hard.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?
  • Comment on Re: Fingerprinting text documents for approximate comparison