Re^2: Brainstorming session: detecting plagiarism

By the way, do you have any information about calculating statistically improbable word pairs? I would be most fascinated with that. I'd like to create an architechture whereby people could, at the potential cost of performance, pick and choose which features they would like to use when comparing. This sounds like a great choice.

Cheers,
Ovid

New address of my CGI Course.

Comment on Re^2: Brainstorming session: detecting plagiarism

Replies are listed 'Best First'.
Re^3: Brainstorming session: detecting plagiarism by halley (Prior) on Jun 09, 2005 at 00:57 UTC
There are many lexicons out there, and they often include a ranking by frequency found in a large source such as the Bible or the New York Times. One such popular lexicon for English is the Moby Project, and it includes two such rankings. Google will give you hints there. To find statistically improbable word pairs, one method is trivial: you take the product of word frequencies for each consecutive pair of words, and search for the smallest results. For example, "statistically=0.0004" and "improbable=0.0003" would give a very statistically improbable 0.00000012, and yet, this posting uses that phrase more than once. It's a pretty good indicator of a work's overall topics and themes. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re^3: Brainstorming session: detecting plagiarism by planetscape (Chancellor) on Jun 09, 2005 at 05:48 UTC
You might also want to check out Ted Pedersen's Ngram Statistics Package, with regard to the problem of improbable word pairs. The output can be easily sorted to highlight least likely occurrences. Of course you would want to compare to a corpus (of written English, say), to get a fairly good idea of "normal" parameters. Good luck, and keep us posted, please! planetscape	[reply]


Don't ask to ask, just ask
	PerlMonks