mobiusinversion has asked for the wisdom of the Perl Monks concerning the following question:

Brethren,

In a recent post it was asked how to split a string against an escaped delimiter.

The subject of unsupported variable length lookbehind was broached, and then solved in a novel way, using variable length lookahead and reverse. My own solution accomplished the task without lookaround assertions at all, but rather uses a straightforward application of unrolling the loop.

So I decided to Benchmark my solution against the clever one, with the usual module, and reported the results here. (I was sorta hoping that mine was faster since contemplating a daily habit of regex matching on reversed strings made me feel sea-sick).

The results were interesting! It turns out, at least with regard to this problem, that variable length lookahead on the reverse string is about 10% faster than unrolling the loop, despite the number of calls made to reverse.

Why is this? Does this have something to do with the internal optimizers? Does the Regex engine squeeze that much more juice out of variable width lookahead than alternation? Is it a matter of regex compilation? Is this question misguided by something else altogether?
  • Comment on Seeking Audience with the Perl Regex Droids...

Replies are listed 'Best First'.
Re: Seeking Audience with the Perl Regex Droids...
by BrowserUk (Patriarch) on Mar 29, 2008 at 01:29 UTC

    Firstly, you omitted the other contender:

    ... sub lookbehind { my $x = shift; split '(?:(?<=[^#]####)|(?<=[^#]##)|(?<!#))[@]', $x; } C:\test>junk8 Subroutine Benchmark::mytime redefined at c:/Perl/lib/Benchmark.pm lin +e 459. Benchmark: timing 10000 iterations of lookbehind, rollahead, unroll ... lookbehind: 2.34183 wallclock secs ( 2.19 usr + 0.00 sys = 2.19 CPU) + @ 4570.38/s (n=10000) rollahead: 3.46343 wallclock secs ( 3.23 usr + 0.00 sys = 3.23 CPU) + @ 3092.15/s (n=10000) unroll: 3.51775 wallclock secs ( 3.33 usr + 0.02 sys = 3.34 CPU) + @ 2990.43/s (n=10000) Rate unroll rollahead lookbehind unroll 2990/s -- -3% -35% rollahead 3092/s 3% -- -32% lookbehind 4570/s 53% 48% --

    Secondly, you didn't check your results. unroll and rollahead produce different results:

    rollahead => @# #@## ##@ ##@## #@# # @# #@## ##@@# #@## ##@ +@# #@ ... unroll => @# #@## ##@ ##@## #@# @# #@## ##@@# #@## ##@@ +# #@# ...

    And both sets are wrong! Any output that has an even number of #s preceding a @ (as with the 3rd, 4th etc. outputs above) is wrong. If you look back at the op, you'll see that any @s that persist into the results set are preceded by an odd number of ##s.

    The results set from the fastest algorithm:

    lookbehind => #@## ###@#### #####@ #####@#### ###@## # #@## + ###@#### ...

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Your solution is actually incorrect on parsing any string which has a consecutive stretch of pound signs (ie '#') that is a multiple of 4 and greater than or equal to 8.

      Try this:
      my $x = '@########@';
      it should parse as:
      '', '####', '',
      Your solution (as you've hard coded it for one example, and admittedly so in your post), returns:
      '', '###@'
      which would have been delimited as:
      '@#######@'
      That is, seven pound signs...

      I ask you to please consider the original post as I dont think you are making your assertions with an understanding of what was asked. For example, the escape character itself must be escaped when created a delimited line. Hence the list:
      my @x = qw(## @ #);
      would be delimited as:
      '####@#@@##'
      Finally, I would be glad to verify your second claim if you can provide the input string upon which rollahead and unroll return different results. I believe such a string does not exist.
        Finally, I would be glad to verify your second claim if you can provide the input string upon which rollahead and unroll return different results. I believe such a string does not exist.

        Actually, it was your own benchmark:

        #! perl -slw use strict; use Benchmark ':all', ':hireswallclock'; my $x = "#@##@###@####@#####@"; my $y = reverse $x; my $z = "$x$x$x$y$y$x$x$y$y$y$y$y$x$x$x$x"; our %results; cmpthese -3, { unroll => sub { my @x = ([unroll($x)],[unroll($y)],[unroll($z)]); $results{ unroll } = \@x; }, rollahead => sub { my @x = ([rollahead($x)],[rollahead($y)],[rollahead($z)]); $results{ rollahead } = \@x; }, lookbehind => sub { my @x = ([lookbehind($x)],[lookbehind($y)],[lookbehind($z)]); $results{ lookbehind } = \@x; }, }; printf "%20s => %s\n", $_, "@{[ map{ qq[@$_] } @{ $results{ $_ } } ]} +" for keys %results; sub unroll { my @x = $_[0] =~ /(?:^|@)((?:##|#@|[^#@])*)/g; for(@x){ $_ =~ s/##/#/g; $_ =~ s/#@/@/g; } @x } sub rollahead { my $x = shift; $x = reverse $x; my @x = reverse(split/\@(?=(?:##)*(?!#))/,$x,-1); for(@x){ $_ = reverse; $_ =~ s/##/#/g; $_ =~ s/#@/@/g; } @x } sub lookbehind { my $x = shift; split m[ (?: (?<=[^#]\#\#\#\#) | (?<=[^#]\#\#) | (?<!\#) ) [@] ]x, + $x; }

        Output (read'em and weep:):

        C:\test>junk8 Rate unroll rollahead lookbehind unroll 3096/s -- -3% -33% rollahead 3175/s 3% -- -32% lookbehind 4639/s 50% 46% -- rollahead => @# #@## ##@ ##@## #@# # @# #@## ##@@# #@## ## +@@# #@## ##@ ##@## #@# @##@## #@# # # #@## ##@@# #@## ##@ ##@## #@# @ +##@## #@# @##@## #@# @##@## #@# @##@## #@# # # #@## ##@@# #@## ##@@# +#@## ##@@# #@## ##@ unroll => @# #@## ##@ ##@## #@# @# #@## ##@@# #@## ##@ +@# #@## ##@ ##@## #@# @##@## #@# # # #@## ##@@# #@## ##@ ##@## #@# @# +#@## #@# @##@## #@# @##@## #@# @##@## #@# # # #@## ##@@# #@## ##@@# # +@## ##@@# #@## ##@ lookbehind => #@## ###@#### #####@ #####@#### ###@## # #@## + ###@#### #####@#@## ###@#### #####@#@## ###@#### #####@ #####@#### # +##@## #@#####@#### ###@## ## ## ###@#### #####@#@## ###@#### #####@ # +####@#### ###@## #@#####@#### ###@## #@#####@#### ###@## #@#####@#### + ###@## #@#####@#### ###@## ## ## ###@#### #####@#@## ###@#### #####@ +#@## ###@#### #####@#@## ###@#### #####@

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.