Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.

Finding duplicated code in Perl

by shushu (Scribe)
on Nov 26, 2003 at 12:49 UTC ( [id://310219] : perlquestion . print w/replies, xml ) Need Help??

shushu has asked for the wisdom of the Perl Monks concerning the following question:

Hallo monks,
I am a new owner of a huge project, that currently contain 2.5M lines of code.
The first version was written in an internal language we developed, which was VERY simple - no loops, no subroutines, no modules, no conditions, and when we moved to the new version we created a converter, which converted the code to Perl.
The convertion was simple, therefore I had 3000 flat script, and today I have 3000 flat script, just in Perl. My new task is to take those scripts, find code duplications, and produce modules and subroutines.
I am looking for a tool that will find the duplications for me.

Looking in the net I found several tools:
and some commercial ones.

The problem is - they do not support Perl.
Before I run and create a support for Perl in those projects (as suggested in, I would like to know whether there is another solution, or another tool that already has Perl support.


Replies are listed 'Best First'.
Re: Finding duplicated code in Perl
by dragonchild (Archbishop) on Nov 26, 2003 at 13:08 UTC
    As someone who's taken over a similar project in the past, I'm going to give you a piece of advice you're not going to want to hear. Do it by hand. Do not use a tool to do your work.
    • Tools don't understand architecture
    • Tools cannot give you proper naming
    • Tools cannot tell you how to optimize code
    • Tools cannot give you new features by noticing commonalities between dissimilar areas
    • Tools cannot comment confusing code
    • Tools cannot create the third and fourth levels of modules/objects - the basic infrastructure that the code users run depends on, but that users never see

    As if that isn't enough, you're probably going to also need a whole bunch of documentation. I'll bet you don't have most of the following:

    • Project description
    • Architecture document
    • Test plan / Test suite
    • Use cases (or other user-design tools)
    • Design documents (both high- and low-level)

    Those documents are at least as important than the code, because they tell you what the code is supposed to do. The code just tells you what it currently does. Are you sure that what it does right now is correct? How much are you willing to bet?

    Furthermore, most tools aren't able to use many of the reasons to use Perl in the first place. For example, I doubt a tool could reasonably handle

    • Array operations
    • Hashes in their non-trivial uses, especially as a way of passing named parameters
    • References (especially scalar, glob, and sub references)
    • Multiple return values from a subroutine
    • Context-aware subroutines
    • Complex data structures, especially dispatch tables and the like
    • Objects, in all their glory
    • The use of die, eval, and sub
    • Correct scoping of variables

    I hope you really choose to do this by hand. It will take about 3-6 man-months to do it. (I'm not kidding - it won't take that long.)

    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      dragonchild, if I could ++ this node any more I would. So, consider this a virtual ++ x 100. :-)

      Although, you lose a few points because your signature is so long...*hint* *hint*

Re: Finding duplicated code in Perl
by l3nz (Friar) on Nov 26, 2003 at 13:56 UTC
    As far as I know, there is no way to reliably automate this process. My experience with automatic code reformatters and analyzers is - to date - rather disappointing. But maybe I'm wrong.

    You could think about building up a generalized library for your area of interest and to build by hand each of the scripts so that it's generally built of a few library calls.

    Something I would do on the other side is that it's probably way too messy to test and document all existing code; and even if you don't know for sure, you can reasonably assume most of what existing scripts do is correct. Therefore I'd:

    • Write the new scripts as simply and fast as possible
    • Run them on the very same production cases old scripts are run upon
    • Find an automated way to keep track of script executions, number of times the scripts executed correctly and number of times, inputs and outputs of the times the behavoiur was different
    This way you'll:
    • save a lot of time testing
    • spot a lot of errors in both old and new scripts, not only in coding but more interestingly in the functional behaviour
    • create test cases (better: automatic test cases) for existing stuff, so you'll know if something modified here breaks something there
    At the point you are now, it's the right time to implement something like this. It won't cost so much now but will be very useful in the future.
Re: Finding duplicated code in Perl
by talexb (Chancellor) on Nov 26, 2003 at 13:11 UTC

    Wow. Neat question.

    Without being able to look at a few of the scripts, it's really hard to tell. My guess is that it's probably going to be easier to eyeball the code and say, "Right, here's another x => y conversion". Following from that idea, you'd probably want to make sure that the Perl has pretty much the same standard formatting .. so perltidy would be a handy tool to use.

    Good luck.

    --t. alex
    Life is short: get busy!
      Regarding to perltidy - thanks, we already use it on our code, and it is very useful.

      I don't see a way to attach files (or maybe I am wrong...), and I don't want to \<code\> them since they are huge.
      Any idea where to put them ? (I don't have an external web server available)
Re: Finding duplicated code in Perl
by adrianh (Chancellor) on Nov 26, 2003 at 13:11 UTC

    You might want to take a look at shred.

    I've a recollection that I read somebody was working on a Perl API/version - but for the life of me I can't remember where.

Re: Finding duplicated code in Perl
by Roger (Parson) on Nov 26, 2003 at 13:18 UTC
    Before going any further, can you post some sample code of your 'flat' scripts please. Just post one or two in the original language, along with the converted perl scripts. So we can get a feel of the complexity of the scripts.

    My gut feeling is that it might be easier to eliminate duplicates from your original scripts, which you have described as VERY simple. So the idea is to eliminate duplicate from your original scripts before converting them to Perl.

    Otherwise we could have a look at the complexity of the Perl code generated, and we could come up with a generic enough solution to your problem.

Re: Finding duplicated code in Perl
by diotalevi (Canon) on Nov 26, 2003 at 14:08 UTC

    My first thought is to run your code through B::Concise, perhaps use Idealized optrees from B::Concise to simplify the output and then look for subtrees that are equal. From there you are free to do something useful with the filename and line number hints scattered at every ';' node. In fact, you could automate this by writing a script to run B::Concise, find equal trees and then annotate the original code with comments that match.

      True, but I understand it means the answer to "do you know any existing tool" is "no"..
      From I got this:
      "CPD could be adapted to work with C, C++, PHP, Ruby, Perl, or any other language for which a tokenizer exists. There could be a runtime toggle to select which language to parse."

      In case I can fit to the existed interface of CPD I won't need to develop a tool of my own.

      I believe what they mean in tokenizer can be some kind of modified B::* module.
      Am I on the right track ?
        They mean something that can parse the language. You could adapt B::Deparse to output something that produces the output requested by CPD. The thing is though... the work I suggested in the previous node is very do-able. That's a relatively short script. I have no need for this and so won't write it but I could see this being a relatively minor thing.
        FWIW, PPI::Tokenizer?

        It's one part of the PPI project thats stable now.

        Or by tokenizer do they mean, "tokenizer written in our language to our interfaces"
Re: Finding duplicated code in Perl
by podian (Scribe) on Nov 26, 2003 at 15:47 UTC
    I am just curious. What kind of useful application can you write without loops and conditions? Not a single loop or condition in 2.5M code?
    May be it is using some other packages like Expect or something?
    update: I am still waiting for a reply. But here are some more questions about this new language:

    a) can this language solve any real life applications?

    1) given two numbers, find the max

    2) sort a list of numbers

    Just curious mind wants an answer!

Re: Finding duplicated code in Perl
by TomDLux (Vicar) on Nov 27, 2003 at 01:01 UTC

    The code uses system to sleep 2 or 5 seconds. Is it waiting for something to happen? Giving some code a chance to complete?

    It uses a shell method to copy files and perform other system commands. Are these complicated? on a remote platform? Or simply not built-ins in the original language?

    I had a similar problem last winter, miles and miles of bad linear code, though in my case the problem was the original codeers, rather than the original language. I keep thinking about writing an article like the ones MJD has, how 3000 lines of bad code can collapse into 137 lines of good code.

    Begin by reading a few files, getting the feel for what they do. Take one file, edit to reduce repetition, using conditions, loops, etc. Rename poorly named variables, to increase clarity. Refactor code into routines: Top level code shoudl consist of built-in commands, object constrruction, and subroutine calls. Subroutines that actually do the work should be short, a dozen lines, two dozen at the most, and should do one thing. Those routines are used in more complicated routines. This way, it will be easier to share subroutines between files. Don't worry about making one file perfect. Just simplify it and go on to the next; you'll be back to share code, anyway.


      Hi all,
      Thanks for the interest in the problem and in the language.

      First, regarding to CPD tokenizer requirement - looking inside CPD sources I see they need the yokenized data in java structure. heir own PHP and C++ tokenizers are written in Java, but it doe snot mean we cannot execute Perl tokenizer, and they import the data (though it will take time to fit the information, I guess).

      Meanwhile I executed CPD over some C code and got very nice results. When changing my Perl files into .java I got some limited answers as well, so I might just use what I got. It depends whether I will get the reources (tiem and people) to work on it or not.

      Regarding to the language, and the primary language -
    • It is called QTL, and it comes for Qa Testing Language
    • Using this language we were able to execute distributed tests over any number of machines in a multi platform environment. All, BTW, written in Perl.
    • The new version, already working for a year, reduced the amount of special QTL commands, and gave the developer (almost) full Perl capabiliites.
    • Two main objects are use in QTL/Perl - a machine, and a label
    • $machine->command() will execute the command on the remote machine, and will return a $label.
    • $label->attribute (such as result, state, exitcode) will check the executed command and give the developer updated status all of the time
    • The converter from the old language to the new syntax was rather simple, and was executed a year ago. This means we cannot go back to the old sources, since we had many changes already.

    • I cannot give more information without management approval (which I won't get, I believe) - this is not an open source project. On the other hand, I plan to give a lecture in YAPC::Israel::2004 which will take place on February 2004, so you are all welcome.

      Back to our business - the duplication detector - although it is very interesting stuff, unless I come up with a fully automatic tool that find duplications over several files and give the user proper information on what to do with them, I believe I will need to take some of your advices, and work manually on it.

      Question is - what do you think will take longer, and what do you think will be more reliable ?

Re: Finding duplicated code in Perl
by warthurton (Sexton) on Nov 26, 2003 at 21:46 UTC
    I really interested in the primary language for this project. Is there any more details you can send us? I can't even imagine such a beast.

    Have you looked at the original language -> Perl conversion to see if you can trap the duplication there?

Re: Finding duplicated code in Perl
by planetscape (Chancellor) on Mar 22, 2008 at 20:56 UTC
Re: Finding duplicated code in Perl
by toma (Vicar) on Nov 27, 2003 at 04:28 UTC
    my %code; my @lines; while(<>) { push @lines, $_; $code{$_}++; } for (@lines) { print "$code{$_}:\t $_"; }
    It should work perfectly the first time! - toma