in reply to Re: Re: Profiling regular expressions
in thread Profiling regular expressions

An error msg, and a small sample of test data would have been nice.

The problem appears to be caused by the fact that when using source filters, the regex is eval'd. As your regexes contain embedded vars that require interpolation, and interpolation in eval'd regexes is prohibited by default, we need to add

use re 'eval';

to the program under test. I hoped that I could add it to the filter module itself, but that doesn't work. (Obvious why once you tried it but...). Anyway, adding that line to the top of the program under test and the filter seems to work fine again without modification from the version presented above.

A quick test prog

#! perl -slw use strict; use re 'eval'; #! <<< ADD THIS LINE use My::Filter; my ($short_line_threshold, $short_line_counter, $long_line_threshold) += (40,2,50); my $data = q[ <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx +xxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx +xxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> ]; my $tmp; for (1..1000) { $tmp = $data; $tmp =~ s/((?:<line>\s*(?:.{1,$short_line_threshold})<\/line>\s*){ +$short_line_counter,})(<line>\s*(?:.{$long_line_threshold,}?)<\/line> +)/$1<\/para><para>$2/gs; } print $tmp; print '=' x 20, 'Timing of regexs in ', $0, '=' x 20; print My::Filter::report(); __END__ C:\test>testmyfilter <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> </para><para><line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx +xxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> </para><para><line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx +xxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> <line> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</line> ====================Timing of regexs in C:\test\testMyFilter.pl======= +============= 2000 trials of ((?:<line>\s*(?:.{1,$short_line_threshold})</line>\s*){ +$short_line_counter,})(<line>\s*(?:.{$long_line_threshold,}?)< /line>) (460.000ms total), 230us/trial

I'd like to suggest using the /x option on your regexes to make them a little more readable, but I tried it and whilst they still work, it has a significant effect upon the performance. Which as that's presumably what your trying to improve.

One minor improvement to the readablility of the output report can be obtained by changing

$My::Filter::t->start('$_')

to $My::Filter::t->start('$/$_$/')

Make sure your make the same change to the stop() line as well.

I also tried a version of the filter that used a simple numbering scheme for the start/stop labels which makes the output more readable, but makes relating the number in the report back to the individual regex in the code considerably harder. Post a reply if you want a copy of that version

I still think that if I could find a way of using the __LINE__ macro as the timer label, it would be better option than the text of the regex itself, but that doesn't work for obvious reasons.


Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.