rhesa has asked for the wisdom of the Perl Monks concerning the following question:

I'm analyzing our code base as an aid in refactoring. We incurred a lot of technical debt in the very early stages of our startup, and it's time we start paying it off.

The first step I took was to run everything through perltidy, so that I have consistent formatting throughout.

The second step was running Perl::Metrics::Simple on our code, which already reveals a lot of refactoring targets. The largest subs and the ones with the highest complexity will be first on my list.

The third step is the one I could use help with. The idea is to find duplicated code resulting from old-fashioned copy/paste-programming. What I think I need for this, is the inverse of a diff tool: something that finds blocks of code that are identical.

Do tools exist, that can find blocks of code occurring more than once in a batch of files?

I've begun with a naive indexing of every line, marking the file name and line number of every occurrence. This allows me to query the result in a variety of interesting ways. For example, here are the six most common lines:
+----------------------------+-------+ | loc | refs | +----------------------------+-------+ | | 11381 | | } | 4358 | | { | 4121 | | ); | 746 | | else | 475 | | 1; | 404 | +----------------------------+-------+
As you can see, we're using the BSD/Allman brace style.

If anyone is interested, I can show my data model, crossreferencing script, and the web-based browser I have made for this.

Replies are listed 'Best First'.
Re: Refactoring tools for copy/paste jobs
by Corion (Patriarch) on Oct 09, 2006 at 13:29 UTC

    Jarkko Hietaniemi used http://pmd.sourceforge.net/cpd.html to do duplicate code detection against the Perl source code, but I'm not sure that this tool will work well with Perl as it does a lot of parsing. Google found me another paper on lightweight detection of duplicated code, which sounds easy to implement and use with Perl. The trick in the paper seems to be to use a sliding window over sections of code and then to do a bag comparison, to protect against swapped lines etc.

      Thanks, Corion. That paper certainly gives me a good starting point. For some reason, I could not come up with useful keywords to search for on http://search.cpan.org.
Re: Refactoring tools for copy/paste jobs
by philcrow (Priest) on Oct 09, 2006 at 14:31 UTC
    Though I haven't used it for this purpose, the docs for Algorithm::Diff suggest that you can call its Same method instead of the Diff method to get the similarities instead of the differences.

    Phil

Re: Refactoring tools for copy/paste jobs
by jbert (Priest) on Oct 09, 2006 at 15:10 UTC
    I would imagine that your line-based approach will - once you exclude the standard-for-your-coding-style lines above - show you most of your copy-and-paste.

    I'd actually start from the other end, looking for infrequently-occuring-but-nonunique lines. A line which is used in exactly 2 or 3 places is more likely to be part of a block of cut and paste than one used in 100.

    When you are refactoring, a nice option you have in perl is to make use of first class functions/closures/anonymous subs.

    This allows you to have more "fuzzy matching" between blocks of similar-but-not-identical code. You can pull out the shared boilerplate as a new method/sub and pass into it a closure which does the 'bit which is different for each occurence'.

    It's a lightweight alternative to putting together an inheritance hierarchy to share code, which you may see in design patterns etc. For example:

    sub print_tree { my $tree = shift; print_tree($tree->left); print $tree->value; print_tree($tree->right); } sub sum_tree { my $tree = shift; return sum_tree($tree->left) + $tree->node + sum_tree($tree->right); }
    could become (sorry, this is untested - also, this exact example might be better as a Tree class, which probably already exists in CPAN, etc etc., but I hope the point survives):
    sub walk_tree { my $tree = shift; my $per_node = shift; walk_tree($tree->left); $per_node->($tree->node); walk_tree($tree->right); } sub print_tree { my $tree = shift; walk_tree($tree, sub { print $_[0]; }); } sub sum_tree { my $tree = shift; my $total = 0; walk_tree($tree, sub { $total += $_[0]; }); return $total; }
      I'd actually start from the other end, looking for infrequently-occuring-but-nonunique lines. A line which is used in exactly 2 or 3 places is more likely to be part of a block of cut and paste than one used in 100.

      That's a great suggestion, I hadn't thought of that yet. Thanks!

      I'm well aware of the various refactoring techniques, and I do use both OO and functional approaches where appropriate. Your example is illustrative enough though, although your walk_tree should probably read:

      sub walk_tree { my $tree = shift; my $per_node = shift; walk_tree($tree->left); $per_node->($tree->node); walk_tree($tree->right); }
        Thanks for the correction (which I've incorporated into the original node - this reply is partly to record that fact).

        And I'm sorry if I seemed to imply that you might not be aware of such techniques - no offence intended.