in reply to Fingerprinting text documents for approximate comparison
To do this with any accuracy you need a pre-existing corpus of "typical data".
The best signature of a given piece if text is the N-rarest words it contains, where 'rarest' is defined in terms of the frequency with which each word appears in your corpus of typical data.
However, there is no easy way to convert that to a single numerical value that will allow your 'likelyhood" approximation. Even if you took two identical pieces of text that each carried the addition of, say, the transmitting bodies--eg. 'Reuters' and 'CNN'--then those additions will likely affect any reduction to a numerical value in a way that will make comparison very hard.
|
|---|