Mur has asked for the wisdom of the Perl Monks concerning the following question:

Wondering if anyone has a bit of magic, or knows of a CPAN module, to do something like this: A module does a bunch of regular-expression evaluations to alter a string. They are scattered around in the code. I'd really like a way to profile them, so that from code that looks like this:
... $a =~ s/foo/bar/g; ... $b =~ s/xxx/yyy/g; ...
might result in a report that looks like this:
Regex Count Time s/foo/bar/g 2,704 4.92 sec s/xxx/yyy/g 1,192 2.15 sec
I could possible wring this out of Devel::DProf, but that adds a lot of overhead.
--
Jeff Boes
Database Engineer
Nexcerpt, Inc.
vox 269.226.9550 ext 24
fax 269.349.9076
 http://www.nexcerpt.com
...Nexcerpt...Connecting People With Expertise

Replies are listed 'Best First'.
Re: Profiling regular expressions
by BrowserUk (Patriarch) on Jan 31, 2003 at 01:46 UTC

    This seems like an ideal application for Filter::Simple. Here's what I came up with.

    The filter package

    This requires that you add one line at the top of your program, and one at the bottom/end to generate the report. It also requires the installation of Filter::Simple which has a dependancy on Text::Balanced, but they are both pure-Perl modules and install easily. It also uses benchmark::Timer, but I think this is a component of most standard distributions.

    The test program

    Some sample output

    C:\test>test abcdefghijklmnopqrstuvwxyz ====================Timing of regexs in C:\test\test.pl=============== +===== 2000 trials of pqr (290.000ms total), 145us/trial 5000 trials of (\G(?:.{3})+?)(?<=...)(.)(.) (820.000ms total), 164us/t +rial

    Note: This is tested as far as you see it above. I intend to do much more testing and maybe package it up for CPAN if it proves usable and useful, but that may take some time.

    A couple of caveats.

    Filter::Simple seems to have trouble with s/// if you use two set of identical, balanced delimiters eg.  s[....][...]. If your using this style you may have to change your source slightly. Other limitations like this are bound to exists.

    The filter actually embeds the timer code at the front and back of the regex itself. Even though the code is embedded using zero-width code assertions, it's quite possible--even likely--that their presence may change the meaning of some regexes. I haven't encountered one yet, but it could. If the output of your code changes, wrap the line suspected in no My::Filter; use My::Filter;. I haven't had occasion to tested this work-around yet.

    It's worth pointing out that the code profiled is the regex itself. Not the statement it is a part of, nor even the whole s///. Only the left-hand side of these statements will be profiled. Hopefully, this is the most useful information anyway. It does seem to count and profile each iteration of those regexes using the /g modifier successfully.

    Relating the regex back to the source code is currently a manual effort. Unfortunately, when the code is evaluated, the __LINE__ macro is not set:(. I haven't thought of a work-around for this yet. This has the unfortunate side-effect that Lexically identical regexes in different lines of the source get counted and timed as the same thing. This is usually fairly easy to work around, wrapping one of the m[pqr]'s in non-capturing brackets for instance m[(?:pqr)] will in most if not all cases, make no difference to the function of the regex, but allow them to be distinguished in the timings.

    I haven't tested this with qr[...] style regexes yet.

    If anyone has any suggestions for determining the line numbers at which the regexes appear, I'd be pleased to hear of them. Or anything else for that matter.


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      Well, I finally got around to trying this. Unfortunately, it coughs and dies on things like this:
      $tmp =~ s/((?:<line>\s*(?:.{1,$short_line_threshold})<\/line>\s*){$s +hort_line_counter,})(<line>\s*(?:.{$long_line_threshold,}?)<\/line>)/ +$1<\/para><para>$2/gs;
      And of course I've no idea why ... Apparently the resulting source is syntactically incorrect.
      --
      Jeff Boes
      Database Engineer
      Nexcerpt, Inc.
      vox 269.226.9550 ext 24
      fax 269.349.9076
       http://www.nexcerpt.com
      ...Nexcerpt...Connecting People With Expertise

        An error msg, and a small sample of test data would have been nice.

        The problem appears to be caused by the fact that when using source filters, the regex is eval'd. As your regexes contain embedded vars that require interpolation, and interpolation in eval'd regexes is prohibited by default, we need to add

        use re 'eval';

        to the program under test. I hoped that I could add it to the filter module itself, but that doesn't work. (Obvious why once you tried it but...). Anyway, adding that line to the top of the program under test and the filter seems to work fine again without modification from the version presented above.

        A quick test prog

        I'd like to suggest using the /x option on your regexes to make them a little more readable, but I tried it and whilst they still work, it has a significant effect upon the performance. Which as that's presumably what your trying to improve.

        One minor improvement to the readablility of the output report can be obtained by changing

        $My::Filter::t->start('$_')

        to $My::Filter::t->start('$/$_$/')

        Make sure your make the same change to the stop() line as well.

        I also tried a version of the filter that used a simple numbering scheme for the start/stop labels which makes the output more readable, but makes relating the number in the report back to the individual regex in the code considerably harder. Post a reply if you want a copy of that version

        I still think that if I could find a way of using the __LINE__ macro as the timer label, it would be better option than the text of the regex itself, but that doesn't work for obvious reasons.


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        You could try a less ambitious filter that timed lines containing an "=~", something like this:
        package Filt; use Filter::Simple; FILTER { my @set = split /\n/,$_; my $new = ''; foreach my $line (@set) { if ($line =~ /\=\~/) { $new .= "\$t->start('$line');\n"; $new .= $line . "\n"; $new .= "\$t->stop();\n"; } else { $new .= $line . "\n"; } } $_ = $new; }; 1 ;
        I tried it on the following simple case, and it worked.
        use strict; use Benchmark::Timer; use Filt; our $t = Benchmark::Timer->new(); my $short_line_threshold = 2; my $short_line_counter = 1; my $long_line_threshold = 7; my $tmp = "Abcdef"; $tmp =~ s/((?:<line>\s*(?:.{1,$short_line_threshold})<\/line>\s*){$sho +rt_line_counter,})(<line>\s*(?:.{$long_line_threshold,}?)<\/line>)/$1 +<\/para><para>$2/gs; print $t->report();
Re: Profiling regular expressions
by tall_man (Parson) on Jan 31, 2003 at 00:27 UTC
    Here's my suggestion: wrap the expressions in subroutines, like this:
    package Regs; use strict; require Exporter; our @ISA = qw(Exporter); our @EXPORT = (); our %EXPORT_TAGS = ( 'all' => [ qw( re1 re2 ) ] ); our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } ); our $VERSION = '0.10'; sub re1 { $_[0] =~ s/foo/bar/g; } sub re2 { $_[0] =~ s/xxx/yyy/g; }
    It's a bit ugly, but a Text::Template could generate it for you from a list of expressions extracted from your code.

    Then you could use Devel::AutoProfiler on a modified version of your script, like this:

    use strict; use Devel::AutoProfiler; use Regs qw(:all); my $a = "fooxxxfoo"; my $b = "foo foo foo"; re1($a); re2($a); print "a is $a\n"; re1($b); re2($b); print "b is $b\n";
    Devel::Autoprofiler will automatically generate profile wrappers (you should make sure Regs is the only module after it). Just run the program and it will generate a usage and timing report at the end of the output.