Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: Brainstorming session: detecting plagiarism

by Ovid (Cardinal)
on Jun 08, 2005 at 22:01 UTC ( [id://464864]=note: print w/replies, xml ) Need Help??


in reply to Re: Brainstorming session: detecting plagiarism
in thread Brainstorming session: detecting plagiarism

By the way, do you have any information about calculating statistically improbable word pairs? I would be most fascinated with that. I'd like to create an architechture whereby people could, at the potential cost of performance, pick and choose which features they would like to use when comparing. This sounds like a great choice.

Cheers,
Ovid

New address of my CGI Course.

  • Comment on Re^2: Brainstorming session: detecting plagiarism

Replies are listed 'Best First'.
Re^3: Brainstorming session: detecting plagiarism
by halley (Prior) on Jun 09, 2005 at 00:57 UTC
    There are many lexicons out there, and they often include a ranking by frequency found in a large source such as the Bible or the New York Times. One such popular lexicon for English is the Moby Project, and it includes two such rankings. Google will give you hints there.

    To find statistically improbable word pairs, one method is trivial: you take the product of word frequencies for each consecutive pair of words, and search for the smallest results. For example, "statistically=0.0004" and "improbable=0.0003" would give a very statistically improbable 0.00000012, and yet, this posting uses that phrase more than once. It's a pretty good indicator of a work's overall topics and themes.

    --
    [ e d @ h a l l e y . c c ]

Re^3: Brainstorming session: detecting plagiarism
by planetscape (Chancellor) on Jun 09, 2005 at 05:48 UTC

    You might also want to check out Ted Pedersen's Ngram Statistics Package, with regard to the problem of improbable word pairs. The output can be easily sorted to highlight least likely occurrences. Of course you would want to compare to a corpus (of written English, say), to get a fairly good idea of "normal" parameters.

    Good luck, and keep us posted, please!

    planetscape

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://464864]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (8)
As of 2024-04-25 11:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found