in reply to Re^2: Brainstorming session: detecting plagiarism
in thread Brainstorming session: detecting plagiarism
To find statistically improbable word pairs, one method is trivial: you take the product of word frequencies for each consecutive pair of words, and search for the smallest results. For example, "statistically=0.0004" and "improbable=0.0003" would give a very statistically improbable 0.00000012, and yet, this posting uses that phrase more than once. It's a pretty good indicator of a work's overall topics and themes.
--
[ e d @ h a l l e y . c c ]
|
|---|