Re: Brainstorming session: detecting plagiarism
by halley (Prior) on Jun 08, 2005 at 19:35 UTC
|
Professors use essay-comparing software already. It's not new, but it does a pretty effective job. It's pretty straightforward to look for matching fragments, even tiny fragments like Markov chains or statistically improbable word pairs. However, if you compare some feeds, say, Reuters to Associated Press on the same day, even those criteria would fall apart.
Separately, you might need to refine your personal definition of plagiarism. I'd say that the two examples below show that a fairly mechanical paraphrasing is going on, but it's not clear that it would rise to the definition of plagiarism. Also, how do you allow for proper quotations?
Macbeth is presented as a mature man of definitely established cha
+racter.
Macbeth is shown as an empowered man of well- established cha
+racter.
-- [ e d @ h a l l e y . c c ]
| [reply] [Watch: Dir/Any] [d/l] |
|
Eventually, this code will be released with full documentation. One of the first things the documentation would make clear is that matching text does not necessarily mean plagiarism. Instead, the person looking at the text would have to compare the two documents (with the HTML linking I hope to provide) and determine for themselves if plagiarism took place. My software will not be able to tell whether or not someone gave proper credit for a particular passage.
If the above was the only sentence in a 10,000 word document, I wouldn't say it's plagiarism. If that and several other sentences grouped together in one paragraph have a decent match and there's no attribution, then that's something which merits further study. Deciding whether or not plagiarism has occurred is not something software can do. It can merely flag likely candidates and will always have false positives and negatives.
And I'm aware that professors already have software to do this. The free software I've seen is very limited. (One merely does a "longest substring" match.) I'd like to provide free tools for them.
| [reply] [Watch: Dir/Any] |
|
By the way, do you have any information about calculating statistically improbable word pairs? I would be most fascinated with that. I'd like to create an architechture whereby people could, at the potential cost of performance, pick and choose which features they would like to use when comparing. This sounds like a great choice.
| [reply] [Watch: Dir/Any] |
|
There are many lexicons out there, and they often include a ranking by frequency found in a large source such as the Bible or the New York Times. One such popular lexicon for English is the Moby Project, and it includes two such rankings. Google will give you hints there.
To find statistically improbable word pairs, one method is trivial: you take the product of word frequencies for each consecutive pair of words, and search for the smallest results. For example, "statistically=0.0004" and "improbable=0.0003" would give a very statistically improbable 0.00000012, and yet, this posting uses that phrase more than once. It's a pretty good indicator of a work's overall topics and themes.
-- [ e d @ h a l l e y . c c ]
| [reply] [Watch: Dir/Any] |
|
You might also want to check out Ted Pedersen's Ngram Statistics Package, with regard to the problem of improbable word pairs. The output can be easily sorted to highlight least likely occurrences. Of course you would want to compare to a corpus (of written English, say), to get a fairly good idea of "normal" parameters.
Good luck, and keep us posted, please!
| [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by blokhead (Monsignor) on Jun 08, 2005 at 20:10 UTC
|
I like the idea of comparing each pair of sentences for similarity. There are several metrics for sentence similarity that come to mind:
Edit distance & longest-common subsequence. These two are pretty similar: look for an edit distance less than a certain percent of the sentence length, or an LCS larger than a certain percent.
As you are doing now, I would do this on a word-by-word basis, and not character-by-character. However, these algorithms can be generalized a bit, so that instead of each pair of words either agreeing or disagreeing, each pair can have a fractional level of agreement between 0 and 1. If you implemented a "synonym measurer", you could easily plug this into such generalized LCS/Levenshtein algorithms. (This could also quite easily encompass changes in word stemming as well as synonimity.)
Another metric you can use for sentence similarity is Zaxo's favorite method: using compression & information theory, although you may not be able to pull out as much information about *how* the sentences are similar as in the algorithms above.
| [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by BrowserUk (Patriarch) on Jun 08, 2005 at 22:02 UTC
|
You might be interested in a technique I was playing with a year or so ago call I-match signatures. This involves performing similarity-based duplicate detection by using rolling "shingles" to produce a single hash value for a document. The technique is claimed (and seemed to be so to me) to be much less sensitive to simple transpositions of word ordering, than the distance-based values you are using.
Basically it builds a lexicon and rates document closeness in terms of the ratio of rare terms relative to that terms frequency within the lexicon.
Anyway, there is a web page with an extensive bibliography and one paper that I had bookmarked as very interesting,that you might like to read.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
| [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by CountZero (Bishop) on Jun 08, 2005 at 21:13 UTC
|
An interesting approach! I particularly liked the simple and elegant way of "hashing" the sentences and calculating the "distance" between them.Another approach I was thinking of is moving a "sliding window" with a length of 3 or 4 words over the texts to be compared. Every list of 3 or 4 words is stored in a hash (with these words concatenated into one as the key and the number of occurences as the value) and thereafter the hashes of both texts are compared to each other; every "match" would give plagiarism-points and every non-match will give originality points: comparing the ratio of the plagiarism-score to the originality-score one can perhaps use this as another metric. It would be a more abstract metric than your comparison, but it would perhaps be less prone to small deliberate changes or use of synonyms. In that respect one could vary the "sliding window" to look at e.g. first, second and fourth words (skipping the third word) so as to account for substitutions.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by zentara (Archbishop) on Jun 08, 2005 at 20:05 UTC
|
Recently I discovered a couple of articles on a news Web site where one author had clearly plagiarized from another. After searching manually for other instances, I realized that this was something that needs to be automated :-) It sounds like you are saying the plagiarism needs automating. I think alot of students agree. ;-)
I'm not really a human, but I play one on earth.
flash japh
| [reply] [Watch: Dir/Any] |
|
Just what I was thinking! Great idea..!
No doubt, as more intelligent anti-plagiarising software is developed there will be greater demand for auto-plagiarising software (that runs a thesauraus on every noun and verb and perhaps re-arranges sentences gramatically).
There's a fine line, however, between detecting plagiarising and matching on anything approximate.. being too clever can sometimes cause headaches.
At university I found it was mostly English-As-A-Second-Language students that cut and paste from other text books. So much it was painfully obvious to any native speaker of English (where text book language often differs from summary/report language).
I don't envy the exam marker's job. Imagine having to read essays anyway!
| [reply] [Watch: Dir/Any] |
|
You are right... an improved "anti-plagiarism software" will result
in advances in "plagiarism-detection-avoidance". It's sort of like a "cold-war" between teachers and students. I was watching one of those "TV-exposes" on "Buying term papers", and I can just see them selling "plagiarism-detection insurance" as an extra add-on cost. :-) Plus its a better business to be in.....there are always a fresh supply of "student-customers", but a limited market of teachers. ;-)
I'm not really a human, but I play one on earth.
flash japh
| [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by rob_au (Abbot) on Jun 09, 2005 at 00:04 UTC
|
There was an excellent article on this very topic, using perl to detect plagiarism (specifically for perl code), in the March 2004 edition of The Perl Journal - If you have a login, it can be downloaded from http://www.tpj.com/documents/tpj0403b/
perl -le "print unpack'N', pack'B32', '00000000000000000000001000000000'"
| [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by ww (Archbishop) on Jun 09, 2005 at 03:49 UTC
|
a few more on-line references, FWTW (and I think the first is best, tho it is NOT the news story I saw recently, which id'ed several commercial providers of anti-plagarism services to the education business.
Much of this lines up nicely with your schema; perhaps some will be useful in terms of additional or alternate views. | [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by sfink (Deacon) on Jun 09, 2005 at 00:36 UTC
|
| [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by bluto (Curate) on Jun 08, 2005 at 22:00 UTC
|
++. FWIW, over ten years ago a former grad professor of mine talked about detecting plagiarized source code, and it looks like he's published papers about it in the mean time. I imagine there is still a lot of research interest in that area as well (i.e. mainly because success brings $$$ to lawyers :-). | [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
Todd Proebsting a compiler guy. I'm pretty sure I found a reference to plagiarism detection just the other day, but am having trouble finding it now. So either I'm going senile or Google is playing mind games with me (probably both). In any case, there is a paper at that link about recovering java source from byte code, which is mildly on-topic. Also, check out his discussion of "Proebsting's Law" for compiler optimizations.
| [reply] [Watch: Dir/Any] |
|
|
Re: Brainstorming session: detecting plagiarism
by salva (Canon) on Jun 09, 2005 at 07:41 UTC
|
Maybe, Google "Similar pages" links, could be used to get a first set of pages where to look for some specific plagiarized content. | [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by samizdat (Vicar) on Jun 09, 2005 at 12:37 UTC
|
First off, Ovid, ++ for taking on a worthy challenge. I really like the thrust of the thread you've started and the code you've presented.
My impression of plagiarism is that it's usually localized, and that it is very worthwhile to look for hot paragraphs. I can see that this would be an easy step forward for you from where you've left off.
Kudos! :D | [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by Your Mother (Archbishop) on Jun 10, 2005 at 01:53 UTC
|
Vector space comparisons, like I think the doc BrowserUK pointed to shows, are pretty easy. I played with them a bit (2 years back) and got really cool and easily tweaked results.
I learned about it and got going from a nice article on
Perl.com, "Building a Vector Space Search Engine in Perl."
I really hope you'll keep us updated if you make progress into this. I was plagiarized at school once and it really stung. Go get 'em.
| [reply] [Watch: Dir/Any] |
Re: Brainstorming session: detecting plagiarism
by Anonymous Monk on Jun 11, 2005 at 01:32 UTC
|
There is a little experiment that is instructive in this case: Take a text (any text), get the histogram of its characters (that means how many of each are there), then of every pair (please note "abc" has the pairs "ab" and "bc"), then of any triple and so on. Obviously if you go all the way to the length of the text, it will be possible to reconstruct the text from the set of histograms. Now the real test begins: How large a set (how many histograms) do you need to reconstruct the text (approximately)?
To reconstruct the text use a random number generator to output letters checking that all statistic properties of the set of histograms are met by the constructed string.
The interesting result is that most texts need only 9 histograms. What if you only compare the histograms? | [reply] [Watch: Dir/Any] |