Re: Brainstorming session: detecting plagiarism

You might be interested in a technique I was playing with a year or so ago call I-match signatures. This involves performing similarity-based duplicate detection by using rolling "shingles" to produce a single hash value for a document. The technique is claimed (and seemed to be so to me) to be much less sensitive to simple transpositions of word ordering, than the distance-based values you are using.

Basically it builds a lexicon and rates document closeness in terms of the ratio of rare terms relative to that terms frequency within the lexicon.

Anyway, there is a web page with an extensive bibliography and one paper that I had bookmarked as very interesting,that you might like to read.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Comment on Re: Brainstorming session: detecting plagiarism