define analyser - performance problem

grizzley has asked for the wisdom of the Perl Monks concerning the following question:

Hi all!

Short introduction: I have to do at work some analysis of a project filled with macros (over 13.000) in such way no one is able to read it anymore, so I decided to write simple Perl script, which takes some C++ code, and does not so complicated find/replace of all macros. It works pretty well, but long. About 30 seconds.

I have got familiar with Devel::NYTProf and it gave a response, that function initDefinesList() is bottleneck. I have isolated it into script below and started further checks. I copied whole function body to the script body and it took only 1s to run. Copied it back to the function - 30s. NYTprof showed, that push @$rresult_list, ... line is consuming 29 out of total 30s, so I commented it out. One more run, and surprisingly my $balloon = genDivBalloon(...) line indicated as guilty. So I removed everything inside while($str =~ /.../) loop and according to next report now this regexp consumes all the time. I have written simple one liner with $str filled and while loop and it takes less than second... Another one liner producing huge hash of hashes - run in miliseconds. So every piece of code separately is running in a blink of eye, together it forms a tortoise.

In __DATA__ section I added few example defines just to show you how it looks like.

Has someone an idea what can be wrong and where?

#!perl

use 5.10.0;
use re 'eval';

use strict;
use warnings;

use HTML::Entities qw/encode_entities decode_entities/;
use DefineAnalyser;

our $parens;
$parens = qr{
        (?:
            (?: [^()](?!<_MY_STRING_TO_REPLACE>) )+
            |
            \( (??{$parens}) \)
        )*
    }x;

sub initDefinesList
{
    my %args = @_;
    my $filename = $args{-filename};
    my $rresult_list;

#     if(!open FH, $filename)
#     {
#         print STDERR "Couldn't open define base($filename): ", $!;
#         return ();
#     }
#     
#     my $str = join"", <FH>;
#     close FH;
    my $str = "#define abc(def) /*sourcepath: abc.h*/\\\n def*2+5\n"x 
+10000;
    my $begintime=time;
    print STDERR "begin:", $begintime, "\n";
    
    while($str =~ /\#define  \s+  (?<n>\w+)   (?<p>\([^)]*\))?    (?<c
+>(.*\\\n)*.*\n)/gx)
    {
        my ($name, $params, $content) = ($+{n}, $+{p}, $+{c});
        if(!defined($params))
            { $params='' }
        $content =~ s/\\\n/\n/g;
        $content =~ s/^\s+//;
        $content =~ s/\s+$//;
        $content =~ s!(/\*sourcepath: .*?\*/)!!;
        my $source = $1;
        my @params = $params=~/\w+/g;
        my $macro_def = $name;
        my $pattern = "\\b$name\\b";
        if(@params>0)
        {
            $macro_def .= '('.join(', ', @params).')';
            $pattern .= qr{\((?<p>$parens)\)};
        }
        else
        {
            $pattern .= qr{(\(\s*\))?};
        }
        my $balloon = genDivBalloon(
                -divid => $name,
                -sourcefilename => $source,
                -macrodefinition => $macro_def,
                -balloontext => $content
            );
        
        push @$rresult_list,
            {
                source=>$source,
                name=>$name,
                params=>[@params],
                content=>$content,
                macro_def => $macro_def,
                pattern => $pattern,
                balloon => $balloon
            };
    }
    my $endtime=time;
    print STDERR "end:", $endtime, "\n";
    print STDERR "total time: ", $endtime-$begintime, "\n";
    return $rresult_list
}

sub genDivBalloon
{
    my %args = @_;
    my ($divid, $sourcefilename, $macro_def, $balloontext) = @args{-di
+vid, -sourcefilename, -macrodefinition, -balloontext};
    
    if(!defined $divid)
        { say STDERR "divid not defined" }
    if(!defined $sourcefilename)
        { say STDERR "$divid: sourcefilename not defined" }
    if(!defined $macro_def)
        { say STDERR "$divid: macro_def not defined" }
    if(!defined $balloontext)
        { say STDERR "$divid: balloontext not defined" }
    
    return "<div id=\"$divid\" class=\"balloonstyle\">"
                . encode_entities($sourcefilename)
                . "<BR />#define <b>" . $macro_def . "</b> <pre>"
                . encode_entities($balloontext)
                . "</pre></div>";
}    

initDefinesList(-filename => 'all_defines_tidy.txt');

__DATA__
#define _APS_NEXT_COMMAND_VALUE         32768  /*sourcepath: ./src/Res
+ource.h*/
#define _APS_NEXT_CONTROL_VALUE         201  /*sourcepath: ./src/Resou
+rce.h*/
#define COUNT_STR_LEN_FROM_BSRELAYDATA(start, count)   /*sourcepath: .
+/src/ImplXX.cpp*/\
    count = 0; \
    for (i = start; i < bsRelayData.data.length && \
                    i < bsRelayData.data.MAX_LENGTH_C && \
                    bsRelayData.data.user_data[i] != 0; i++) \
    { \
        count++; \
    } \
    dwCountStart += count+1;
[download]

Comment on define analyser - performance problem Select or Download Code

Replies are listed 'Best First'.
Re: define analyser - performance problem by jethro (Monsignor) on Jun 29, 2009 at 11:24 UTC
gcc (which you could use even if the c++ source was written for a different compiler) called with -E will preprocess all your macros and write the resulting source code to standard output. And maybe the compiler you use has a similar parameter UPDATE: About your profiling problem: You might try to insert a `print "$name\n"` statement into the loop and watch the output to see if the time is wasted in every loop or if there are specific macros that waste a lot of time. My (wild, unsubstantiated) guess would be the loop regex because the profiler might have problems pinpointing the time wasted in loop expressions. Also regexes can waste a lot of time if they need to backtrack a lot. I can see that your loop regex has to backtrack when it gets to the last line of a regex, but that shouldn't be enough for the delay you are seeing	[reply] [d/l]
Re^2: define analyser - performance problem by Marshall (Canon) on Jun 29, 2009 at 13:01 UTC
I like the idea of using the compiler itself! I use this -E option when working with macros and gcc. I think this is a standard feature of most C compilers (the option may of course be called something else on yours). Another thought...sounds like you are into a maintenance problem. Not sure what kind of application could generate that many macros!!! But anyway you might consider converting a bunch of these things into "inline" subroutines. That yields performance of a macro but with type checking etc on the args. Lots of compilers (including gcc) support this. I presume you are driven into writing this code because its hard to figure what these macros are really doing! Maybe a bigger project that "fixes" the source code is in order? I have worked on projects before where Perl actually writes code and .h files as part of pre-compile step - this is a weird idea, but in right app, it can work. 13,000 macros is a mind boggling number - even on a big ASM project! I have found that using some of the special variables like $+ can slow things down a lot with regex - some of these things can introduce extra overhead - sorry can't find a code example of last case where I found this. The most likely suspect is the regex related code. I would use a subset of test data to try to find out: a)if exe time scales linearly or exponentially, b)maybe by hacking around, you can find some types of macros that take WAY longer than others. Sorry that I can't be of more help right now.	[reply]
Re^3: define analyser - performance problem by grizzley (Chaplain) on Jun 30, 2009 at 06:55 UTC
Yeah, the code is a nightmare. It was Win32 COM application, and at some point people decided to port it to linux. Instead of rewriting code, they have written some classes and macros to imitate COM behaviour (hiding unix functions inside). It is now really a mess and company I am working in, takes it over subsystem after subsystem and tries to clean it up...	[reply]
Re^4: define analyser - performance problem by Marshall (Canon) on Jun 30, 2009 at 07:41 UTC
Re^5: define analyser - performance problem by grizzley (Chaplain) on Jun 30, 2009 at 09:02 UTC
Re^2: define analyser - performance problem by grizzley (Chaplain) on Jun 30, 2009 at 06:46 UTC
OMG! That option is really something! It makes my script useless... I wish I could ++ more than one time. Regarding regexp: I suspected it too, I tried to //o, but the time was the same. Then I tried to change it to `/#define\s+(\w+)(.)(.)/` but the results were still the same. Don't really know why is that. Maybe some buffering of input data? No idea.	[reply] [d/l]