Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

how to get rid of cut-and-paste sins?

by hexcoder (Curate)
on Feb 08, 2008 at 22:15 UTC ( [id://667084] : perlmeditation . print w/replies, xml ) Need Help??

Dear monks,

as a fan of quality assurance modules like Perl::Critic I am dreaming meditating of a plugin, that detects cut-and-past fragments in Perl code and suggests to isolate each duplicate in a separate subroutine/method. That could be handy, when I suddenly have to maintain a large chunk of foreign code. AFAIK Perl::Critic has no such plugin yet.

What would be the best strategy and infrastructure to look at code duplicates?

  • on the source code level? Then only verbatim copies (omitting whitespace) would qualify with a moderately complex recognizer.
  • or on the op tree level? Then nonverbatim copies sharing the same code structure would also qualify. Differently named variables and syntactical equivalent code fragments would not void the duplicate detection. (Reminder: i need to check out what can be done within Perl::Critic and what needs B::* modules)
  • There should probably be a minimum length (?) as well as a minimum frequency (2) threshold for the duplicate to be considered.

    How to detect the duplicates? The plugin should be able to recognize duplicates at 'long' distances, even better across files/modules. I am thinking of attaching a MD4-checksum to clusters of statements for fast recognition (hashing clusters with MD4 keys).

    Setting the right cluster size might be tricky. Natural places to make the cuts would have a minimal 'statefulness', that is eg before scopes are opened and after they are closed. The duplicate's badness could be determined by its frequency and code length.

    The problem then is to find all duplicate-subsets in a set. That sounds like autocorrelation or a diff of the whole with parts of itself. Any suggestions for good algorithms or even existing modules?

    I welcome your feedback, thanks.

    Replies are listed 'Best First'.
    Re: how to get rid of cut-and-paste sins?
    by moritz (Cardinal) on Feb 08, 2008 at 22:28 UTC
      To put your question in different terms: How do I detect plagiarism, even if it's done by me? ;-)

      There's a paper on that topic here, it's about a program called moss.

      There are other code similarity analyzers out there, it's surely worth a look.

      If you want to detect blatant copy & paste a simple similarity search should be enough, for anything more elaborate you need a parse tree or an AST on which you can perform similarity checks.

      There's much research done on that topic, you should fine some useful papers and implementations with your favorite search engine ;-)

    Re: how to get rid of cut-and-paste sins?
    by hossman (Prior) on Feb 08, 2008 at 23:58 UTC

      As noted, there has been some fairly extensive research into "Copy Paste Detection" (Side note: Alex Aiken was by far my favorite professor in College)

      The big problem with a lot of naive approaches to copy paste detection is that it's very rare for whole chunks of code to be duplicated verbatim ... frequently one version gets modified, variable names are changed, lines are inserted, etc.

      The PMD project (a Java corollary for Perl::Critic) has a CPD sub project that has gone through several iterations and algorithms. It's implemented in Java, and doesn't seem to currently support Perl - but it is free and adding new language support is (in theory) rally straightforward if you know some Java and implement a simple Tokenizer Interface.

    Re: how to get rid of cut-and-paste sins?
    by planetscape (Chancellor) on Mar 22, 2008 at 20:54 UTC
    Re: how to get rid of cut-and-paste sins?
    by goibhniu (Hermit) on Feb 11, 2008 at 16:48 UTC

      Check out antirice's (Embed perl into your clipboard). Your goals seem to overlap with what he accomplished in some ways. You may even be able to use his infrastructure and reduce your dupe logic / Perl::Critic logic to a macro in his system.

      I had problems getting it going, but once I got IPC::Run working (on WINXP), it worked well and I've found it to be a much appreciated small convenience at times.

      #my sig used to say 'I humbly seek wisdom. '. Now it says:
      use strict;
      use warnings;
      I humbly seek wisdom.