in reply to Brainstorming session: detecting plagiarism
You might be interested in a technique I was playing with a year or so ago call I-match signatures. This involves performing similarity-based duplicate detection by using rolling "shingles" to produce a single hash value for a document. The technique is claimed (and seemed to be so to me) to be much less sensitive to simple transpositions of word ordering, than the distance-based values you are using.
Basically it builds a lexicon and rates document closeness in terms of the ratio of rare terms relative to that terms frequency within the lexicon.
Anyway, there is a web page with an extensive bibliography and one paper that I had bookmarked as very interesting,that you might like to read.
|
---|