in reply to Re^2: Brainstorming session: detecting plagiarism
in thread Brainstorming session: detecting plagiarism

There are many lexicons out there, and they often include a ranking by frequency found in a large source such as the Bible or the New York Times. One such popular lexicon for English is the Moby Project, and it includes two such rankings. Google will give you hints there.

To find statistically improbable word pairs, one method is trivial: you take the product of word frequencies for each consecutive pair of words, and search for the smallest results. For example, "statistically=0.0004" and "improbable=0.0003" would give a very statistically improbable 0.00000012, and yet, this posting uses that phrase more than once. It's a pretty good indicator of a work's overall topics and themes.

--
[ e d @ h a l l e y . c c ]

  • Comment on Re^3: Brainstorming session: detecting plagiarism