Profiling regular expressions

Mur has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Profiling regular expressions
by BrowserUk (Patriarch) on Jan 31, 2003 at 01:46 UTC

This seems like an ideal application for Filter::Simple. Here's what I came up with.

The filter package

Some sample output

C:\test>test
abcdefghijklmnopqrstuvwxyz
====================Timing of regexs in C:\test\test.pl===============
+=====
2000 trials of pqr (290.000ms total), 145us/trial

5000 trials of (\G(?:.{3})+?)(?<=...)(.)(.) (820.000ms total), 164us/t
+rial
[download]

Note: This is tested as far as you see it above. I intend to do much more testing and maybe package it up for CPAN if it proves usable and useful, but that may take some time.

A couple of caveats.

Filter::Simple seems to have trouble with s/// if you use two set of identical, balanced delimiters eg. s[....][...]. If your using this style you may have to change your source slightly. Other limitations like this are bound to exists.

The filter actually embeds the timer code at the front and back of the regex itself. Even though the code is embedded using zero-width code assertions, it's quite possible--even likely--that their presence may change the meaning of some regexes. I haven't encountered one yet, but it could. If the output of your code changes, wrap the line suspected in no My::Filter; use My::Filter;. I haven't had occasion to tested this work-around yet.

It's worth pointing out that the code profiled is the regex itself. Not the statement it is a part of, nor even the whole s///. Only the left-hand side of these statements will be profiled. Hopefully, this is the most useful information anyway. It does seem to count and profile each iteration of those regexes using the /g modifier successfully.

Relating the regex back to the source code is currently a manual effort. Unfortunately, when the code is evaluated, the __LINE__ macro is not set:(. I haven't thought of a work-around for this yet. This has the unfortunate side-effect that Lexically identical regexes in different lines of the source get counted and timed as the same thing. This is usually fairly easy to work around, wrapping one of the m[pqr]'s in non-capturing brackets for instance m[(?:pqr)] will in most if not all cases, make no difference to the function of the regex, but allow them to be distinguished in the timings.

I haven't tested this with qr[...] style regexes yet.

If anyone has any suggestions for determining the line numbers at which the regexes appear, I'd be pleased to hear of them. Or anything else for that matter.

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

[reply]
[d/l]
[select]

Re: Re: Profiling regular expressions

by Mur (Pilgrim) on Feb 04, 2003 at 20:40 UTC

  $tmp =~ s/((?:<line>\s*(?:.{1,$short_line_threshold})<\/line>\s*){$s
+hort_line_counter,})(<line>\s*(?:.{$long_line_threshold,}?)<\/line>)/
+$1<\/para><para>$2/gs;
[download]

Jeff Boes

Database Engineer

Nexcerpt, Inc.

vox 269.226.9550 ext 24

fax 269.349.9076

http://www.nexcerpt.com

...Nexcerpt...Connecting People With Expertise

[reply]
[d/l]
[select]

Re: Re: Re: Profiling regular expressions

by BrowserUk (Patriarch) on Feb 05, 2003 at 07:31 UTC

An error msg, and a small sample of test data would have been nice.

The problem appears to be caused by the fact that when using source filters, the regex is eval'd. As your regexes contain embedded vars that require interpolation, and interpolation in eval'd regexes is prohibited by default, we need to add

use re 'eval';

to the program under test. I hoped that I could add it to the filter module itself, but that doesn't work. (Obvious why once you tried it but...). Anyway, adding that line to the top of the program under test and the filter seems to work fine again without modification from the version presented above.

A quick test prog

Read more... (3 kB)

I'd like to suggest using the /x option on your regexes to make them a little more readable, but I tried it and whilst they still work, it has a significant effect upon the performance. Which as that's presumably what your trying to improve.

One minor improvement to the readablility of the output report can be obtained by changing

$My::Filter::t->start('$_')

to $My::Filter::t->start('$/$_$/')

Make sure your make the same change to the stop() line as well.

I also tried a version of the filter that used a simple numbering scheme for the start/stop labels which makes the output more readable, but makes relating the number in the report back to the individual regex in the code considerably harder. Post a reply if you want a copy of that version

I still think that if I could find a way of using the __LINE__ macro as the timer label, it would be better option than the text of the regex itself, but that doesn't work for obvious reasons.

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

[reply]
[d/l]
[select]

Re^3: Profiling regular expressions

by tall_man (Parson) on Feb 04, 2003 at 22:37 UTC

package Filt;
use Filter::Simple;

FILTER {
   my @set = split /\n/,$_;
   my $new = '';
   foreach my $line (@set) {
      if ($line =~ /\=\~/) {
         $new .= "\$t->start('$line');\n";
         $new .= $line . "\n";
         $new .= "\$t->stop();\n";
      } else {
         $new .= $line . "\n";
      }
   }
   $_ = $new;
};

1 ;
[download]

use strict;
use Benchmark::Timer;
use Filt;

our $t = Benchmark::Timer->new();
my $short_line_threshold = 2;
my $short_line_counter = 1;
my $long_line_threshold = 7;
my $tmp = "Abcdef";
$tmp =~ s/((?:<line>\s*(?:.{1,$short_line_threshold})<\/line>\s*){$sho
+rt_line_counter,})(<line>\s*(?:.{$long_line_threshold,}?)<\/line>)/$1
+<\/para><para>$2/gs;

print $t->report();
[download]

[reply]
[d/l]
[select]

Re: Profiling regular expressions
by tall_man (Parson) on Jan 31, 2003 at 00:27 UTC

package Regs;
use strict;
require Exporter;
our @ISA = qw(Exporter);
our @EXPORT = ();
our %EXPORT_TAGS = ( 'all' => [ qw(
 re1
 re2
) ] );
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our $VERSION = '0.10';

sub re1 {
  $_[0] =~ s/foo/bar/g;
}

sub re2 {
  $_[0] =~ s/xxx/yyy/g;
}
[download]

Text::Template

Then you could use Devel::AutoProfiler on a modified version of your script, like this:

use strict;
use Devel::AutoProfiler;
use Regs qw(:all);

my $a = "fooxxxfoo";
my $b = "foo foo foo";

re1($a);
re2($a);
print "a is $a\n";

re1($b);
re2($b);
print "b is $b\n";
[download]

[reply]
[d/l]
[select]