as a fan of quality assurance modules like Perl::Critic I am dreaming meditating of a plugin, that
detects cut-and-past fragments in Perl code and suggests
to isolate each duplicate in a separate subroutine/method.
That could be handy, when I suddenly have to maintain a large chunk of foreign code. AFAIK Perl::Critic has no such plugin yet.
What would be the best strategy and infrastructure to look at code duplicates?
There should probably be a minimum length (?) as well as a minimum frequency (2) threshold for the duplicate to be considered.
How to detect the duplicates? The plugin should be able to recognize duplicates at 'long' distances, even better across files/modules. I am thinking of attaching a MD4-checksum to clusters of statements for fast recognition (hashing clusters with MD4 keys).
Setting the right cluster size might be tricky. Natural places to make the cuts would have a minimal 'statefulness', that is eg before scopes are opened and after they are closed. The duplicate's badness could be determined by its frequency and code length.
The problem then is to find all duplicate-subsets in a set. That sounds like autocorrelation or a diff of the whole with parts of itself. Any suggestions for good algorithms or even existing modules?
I welcome your feedback, thanks.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: how to get rid of cut-and-paste sins?
by moritz (Cardinal) on Feb 08, 2008 at 22:28 UTC | |
Re: how to get rid of cut-and-paste sins?
by hossman (Prior) on Feb 08, 2008 at 23:58 UTC | |
Re: how to get rid of cut-and-paste sins?
by planetscape (Chancellor) on Mar 22, 2008 at 20:54 UTC | |
Re: how to get rid of cut-and-paste sins?
by goibhniu (Hermit) on Feb 11, 2008 at 16:48 UTC |