in reply to Splitting on escapable delimiter

I’d try reversing the string, split it, then reverse all the pieces. That way you can use a variable-width look-ahead assertion instead of a(n unsupported) variable width look-behind assertion.
$_ = "#@##@###@####@#####@"; $_ = reverse; my @pieces = reverse (split /\@(?=(?:##)*(?!#))/); + for (@pieces) { $_ = reverse; } print "@pieces\n";
The regex is a little hairy; it has a negative look-ahead assertion inside the positive look-ahead assertion.

Replies are listed 'Best First'.
Re^2: Splitting on escapable delimiter
by mobiusinversion (Beadle) on Mar 28, 2008 at 22:25 UTC
    I have to say, that is clever!

    At first I smacked my forehead that after 2 years of daily Perl programming I had never thought "Duh! Variable width lookbehind is just variable width lookahead on the reverse string!".

    Bravo!

    Unfortunately, this solution does *not* recover empty fields delimited in this way... For example, try the example string above with two '@''s appended to the beginning (as you would find after having delimited empty fields).

    See my post below for the correct way to handle this using loop-unrolling (in one regex and no lookaround!).
      *reads documentation for split*

      Ah, I need to add a -1 as a third parameter to split. Good spot.

        Wow, I totally should have seen that!

        Okay so now that mutual correctness has been established, it is time for optimality checking. It turns out that your method is about 10% faster.

        I called my method unroll, and your method rollahead.

        Here are the benchtests:

        Benchmark: timing 100000 iterations of rollahead, unroll...
        rollahead: 18.3956 wallclock secs (17.94 usr + 0.00 sys = 17.94 CPU) @ 5575.07/s (n=100000)
        unroll: 22.1357 wallclock secs (20.58 usr + 0.00 sys = 20.58 CPU) @ 4859.56/s (n=100000)
                                 Rate unroll rollahead
        unroll           4860/s       --          -13%
        rollahead     5575/s    15%            --


        and the code:
        use strict; use Benchmark ':all', ':hireswallclock'; my $x = "#@##@###@####@#####@"; my $y = reverse $x; my $z = "$x$x$x$y$y$x$x$y$y$y$y$y$x$x$x$x"; my $r = timethese( 100000, { unroll => sub { my @x = ([unroll($x)],[unroll($y)],[unroll($z)]) }, rollahead => sub { my @x = ([rollahead($x)],[rollahead($y)],[rollahead($z)]) }, } ); cmpthese($r); sub unroll { my @x = $_[0] =~ /(?:^|@)((?:##|#@|[^#@])*)/g; for(@x){ $_ =~ s/##/#/g; $_ =~ s/#@/@/g; } @x } sub rollahead { my $x = shift; $x = reverse $x; my @x = reverse(split/\@(?=(?:##)*(?!#))/,$x,-1); for(@x){ $_ = reverse; $_ =~ s/##/#/g; $_ =~ s/#@/@/g; } @x }
        Is there a monk who could explain why rollahead is faster? I was surprised considering the number of calls to reverse. I can only guess that somewhere deep inside the guts of the Perl-Regex-Beasty, that the optimizer droids are nasty hardcore with lookaround automata but can't be bothered with alternation.