Daryn has asked for the wisdom of the Perl Monks concerning the following question:

Hi gentle monks,

I am looking for an efficient and correct way to split a string against an escapable delimiter. Let's say the delimiter is @ and the escape char #. The escape char is used to cancel any special effect of the following character (including the escape itself). Using this encoding, "one@two@three" would be split into the three strings "one", "two" and "three" ; "## is a hash and #@ is an arobace" would be split into a single string identical to the input ; "#@##@###@####@#####@" would be split into "#@##", "###@####", "#####@".

In other terms @ is a real delimiter only when preceded by zero or an even number of #.

I am currently working in two steps, first splitting on @ irrespective of any preceding escapes, and then joining back consecutive strings as needed. There is certainly a better way.

Any taker ? TIA.

Replies are listed 'Best First'.
Re: Splitting on escapable delimiter
by BrowserUk (Patriarch) on Mar 28, 2008 at 14:47 UTC

    Intuatively, you want to use split '(?<=(?:##)+)\@', $s;; but that gets you:

    [Variable length lookbehind not implemented in regex; ...

    So how to achieve a variable length lookbehind? Here's one way:

    print $s;; #@##@###@####@#####@ print for split '(?:(?<=[^#]####)|(?<=[^#]##)|(?<!#))[@]', $s;; #@## ###@#### #####@

    Of course the downside is that you need to include a case for each length of lookbehind which quickly gets unweildy:

    print for split '(?:(?<=[^#]########)|(?<=[^#]######)|(?<=[^#]####)|(? +<=[^#]##)|(?<!#))[@]', $s;;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Splitting on escapable delimiter
by almut (Canon) on Mar 28, 2008 at 16:09 UTC

    You could also implement a simple state machine with a binary state (escaped, unescaped), and then split on encountering '@' only when unescaped. E.g.

    sub mysplit { my $string = shift; my @parts; my $part = ''; my $escaped = 0; for ($string =~ m/(.)/gs) { if ($_ eq '@') { unless ($escaped) { push @parts, $part; $part = ''; next; } } if ($_ eq "#") { $escaped ^= 1; # toggle state } else { $escaped = 0; # reset } $part .= $_; } push @parts, $part; return @parts; } my @tests = ( 'one@two@three', '## is a hash and #@ is an arobace', '#@##@###@####@#####@', ); for my $s (@tests) { print join(', ', mysplit($s) ), "\n"; }

    Output

    one, two, three ## is a hash and #@ is an arobace #@##, ###@####, #####@

    (Not well tested for edge cases... but you get the idea.)

Re: Splitting on escapable delimiter
by ikegami (Patriarch) on Mar 28, 2008 at 15:50 UTC
    # Extract fields my @fields = /((?:[^#@]+|#.)*)/sg; # Remove seperators my $ff = 0; @fields = grep $ff^=1, @fields; # Unescape s/#(.)/$1/sg for @fields;

    or use Text::CSV

    Updated to remove empty elements were being placed in @fields.

      If you consume the separator, you don't have to filter it out. And if you put the escape regex first, you don't have to mention the # twice.
      my @fields = /((?:#.|[^@])*)\@?/sg;

      Caution: Contents may have been coded under pressure.
        Nope, that returns an extra (empty) field most of the time, and there's no way to know when. For example, 'a@b' incorrectly returns 3 fields, although 'a@' correctly returns 2.
Re: Splitting on escapable delimiter
by Anonymous Monk on Mar 28, 2008 at 17:49 UTC
    I’d try reversing the string, split it, then reverse all the pieces. That way you can use a variable-width look-ahead assertion instead of a(n unsupported) variable width look-behind assertion.
    $_ = "#@##@###@####@#####@"; $_ = reverse; my @pieces = reverse (split /\@(?=(?:##)*(?!#))/); + for (@pieces) { $_ = reverse; } print "@pieces\n";
    The regex is a little hairy; it has a negative look-ahead assertion inside the positive look-ahead assertion.
      I have to say, that is clever!

      At first I smacked my forehead that after 2 years of daily Perl programming I had never thought "Duh! Variable width lookbehind is just variable width lookahead on the reverse string!".

      Bravo!

      Unfortunately, this solution does *not* recover empty fields delimited in this way... For example, try the example string above with two '@''s appended to the beginning (as you would find after having delimited empty fields).

      See my post below for the correct way to handle this using loop-unrolling (in one regex and no lookaround!).
        *reads documentation for split*

        Ah, I need to add a -1 as a third parameter to split. Good spot.

Re: Splitting on escapable delimiter
by apl (Monsignor) on Mar 28, 2008 at 15:37 UTC
    I'm not as clever as some, so I did it the long way...
    #!/usr/bin/perl use strict; use warnings; while( my $ln = <DATA> ) { chomp( $ln ); my @flds = split( '@', $ln ); foreach ( @flds ) { if ( /(#*)/ ) { print "/$ln/ --> /$1/\n" if ( length( $1 ) % 2 ) == 1; } } } __DATA__ #@##@###@####@#####@ #@## ###@#### #####@

    This results in:

    /#@##@###@####@#####@/ --> /#/ /#@##@###@####@#####@/ --> /###/ /#@##@###@####@#####@/ --> /#####/ /#@##/ --> /#/ /###@####/ --> /###/ /#####@/ --> /#####/
    Revised: This won't work for a string with no pound signs. You'd need to modify the length test to include  || length( $1 ) == 0.
Re: Splitting on escapable delimiter
by mobiusinversion (Beadle) on Mar 28, 2008 at 22:18 UTC
    You can do it all in one regex:
    sub unroll { my @x = $_[0] =~ /(?:^|@)((?:##|#@|[^#@])*)/g; for(@x){ $_ =~ s/##/#/g; $_ =~ s/#@/@/g; } @x }
    so that:
    unroll("#@##@###@####@#####@")
    produces the following fully unescaped list:
    '@#', '#@##', '##@',
    This approach also has the benefit of handling empty sequences correctly, eg:
    unroll("@@#@##@###@####@#####@")
    produces:
    '', '', '@#', '#@##', '##@'
    as it probably should.

    In general, this technique is called 'unrolling the loop' and can be found in the owl book.

    To escape and join data in your way, you could use the following:
    sub my_escape { my $x = shift; $x =~ s/#/##/g; $x =~ s/@/#@/g; $x } sub my_join { join('@',@_) }
    Apply my_escape to each element of the list and then call my_join on it, so that:
    my_join(map{my_escape($_)}('','','@#','#@##','##@',))
    produces:
    '@@#@##@###@####@#####@'
Re: Splitting on escapable delimiter
by jfraire (Beadle) on Mar 28, 2008 at 17:57 UTC

    Well, here is my try (which does not work!). It is possible to use reverse and then lookahead assertions:

    use strict; use warnings; # use re 'debug'; my $s = '#@##@###@####@#####@'; my @list = reverse split '@(?=(##)*[^#])', reverse $s; print scalar reverse $_, "\n" for @list;

    I see the regexp only matches at the good @ signs, but I am getting a couple of ## in the output that I can't explain. I have tried with use re 'debug' and so I know the regexp is matching where I intended.

    Output: #@## ## ###@#### ## #####@

    Julio

      I suspect this version will fail on "####@####", because your regex looks for a character that isn’t a '#', erroneously failing at end of string.
Re: Splitting on escapable delimiter
by Daryn (Sexton) on Mar 28, 2008 at 20:08 UTC
    Thank you all for your time and answers.

    I did use a state machine in an older, similar problem where I was reading the text instead of processing strings.

    The strings I deal with can get to the multi-megabyte size range so reversing them is not really attractive.

    I'll probably go with Roy Johnson's very neat solution unless benchmarking shows that a finite machine beats the regexp engine (which I doubt).

    Again, thanks to all for an instructive thread.

Re: Splitting on escapable delimiter
by wade (Pilgrim) on Mar 28, 2008 at 15:26 UTC
    So, this is more of a follow-up question than an answer. I tried:
    use strict; use warnings; { my $var1 = "####@#####@##@###@######@###"; print "START '$var1'\n"; my @foo = split /(?<=[^#]((##)+))[@]/, $var1; foreach (@foo) { print "HERE: '$_'\n"; } }
    But I got the error message: "Variable length lookbehind not implemented in regex;". Is this an ActivePerl thing (that's what I'm using), a Perl v5.8.8 thing, or did I do something boneheaded and just didn't see it?
    --
    Wade

      Did you notice the bit highlighted below in the post where you got that regex from?

      Intuatively, you want to use split '(?<=(?:##)+)\@', $s;; but that gets you:

      [Variable length lookbehind not implemented in regex; ...


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Doh! No, like an idiot, I looked at the problem and thought "I can solve that!". Thanks!
        --
        Wade