Re^3: Brainstorming session: detecting plagiarism

You might also want to check out Ted Pedersen's Ngram Statistics Package, with regard to the problem of improbable word pairs. The output can be easily sorted to highlight least likely occurrences. Of course you would want to compare to a corpus (of written English, say), to get a fairly good idea of "normal" parameters.

Good luck, and keep us posted, please!

planetscape

Comment on Re^3: Brainstorming session: detecting plagiarism