bsb has asked for the wisdom of the Perl Monks concerning the following question:

Is there a split regex, or other regex technique to split on an unescaped delimiters and ignore escaped ones? (backslash escaping, not quoting)

I'm doing it with two regexen but it feels clumsy, I suspect there's an easier way.

# delimiter is X $escaped_str = <<'EOT'; Xa\X\\bXc\\XdX EOT chomp $escaped_str; @a = split(/X/ ,$escaped_str, -1); print "(",join(':',@a),")"; # gives: (:a\:\\b:c\\:d:) # want: (:a\X\\b\\:c\\:d:)

Replies are listed 'Best First'.
Re: split on unescaped delimiters
by Abigail-II (Bishop) on Jan 08, 2004 at 10:09 UTC
    Assuming backslashes themselves could be escaped, you want to split on colons which are preceeded by an even amount of backslashes. You can't do look behind in this case, because you can't do variable length look behind.

    But you can reverse the string, and look for an even amount of trailing backslashes. After splitting, you need to do some reversing again:

    reverse map {scalar reverse} split /:(?=(?:\\\\)*(?!\\))/ => reverse $ +string;

    Abigail

      Very, very nice.

      Even eh? Reminds my of your prime matching japh
      ...thinking music...

      No good. The split would still take the slashes and lookbehind is fixed length.

Re: split on unescaped delimiters
by Roger (Parson) on Jan 08, 2004 at 12:02 UTC
    TIMTOWTDI, A split (with capture) example...
    use strict; use warnings; use Data::Dumper; # delimiter is X my $escaped_str = 'Xa\Xdc\\bXc\\\\XdXe'; my @a = (); my $i = 0; foreach (split /(\\.)|X/, $escaped_str) { defined $_ ? do { $a[$i] .= $_ } : do {$i++ } } print Dumper(\@a);

    and the output is as expected -
    $VAR1 = [ '', 'a\\Xdc\\b', 'c\\\\', 'd', 'e' ];
      I like that. Trying to come up with a map() version, but
      my $scratch = ''; my @a = (map(defined() ? ($scratch.=$_)[()] : substr($scratch,0,length($scratch),''), split /(\\.)|X/, $escaped_str), length($scratch) ? $scratch : ());
      is the best I can do. And that's way too ugly.

      Maybe something based on @a = @{List::Util::reduce { ... } [], split... };

      Thank you for this nice innovative approach. One detail: do not forget -1 as the 3rd parameter to split or else empty values will be discarded as explained in perlfunc for split.
Re: split on unescaped delimiters
by Abigail-II (Bishop) on Jan 08, 2004 at 10:16 UTC
    Instead of splitting, you can also extract what you want:
    my @parts = $string =~ /([^:\\]*(?:\\.[^:\\]*)*)(?(?{length $^N})|(?!) +)/g;

    Abigail

Re: split on unescaped delimiters
by bsb (Priest) on Jan 08, 2004 at 09:42 UTC
    Here's my working solution, the clumsy one
    # in the real code '.' is my delimiter # Using 'X' above since it's not a regex metachar my $first = $1 if $name =~ m/^ ( [^\\.]* (?: \\(?:.|$) [^\\.]* )* ) /gx; my (@remainder) = $name =~ m/\G (?:\.) ( [^\\.]* (?: \\(?:.|$) [^\\.]* )* ) /gx;
Re: split on unescaped delimiters
by davido (Cardinal) on Jan 08, 2004 at 09:42 UTC
    Here is a use of a negative lookbehind zero-width assertion to prevent a split on a comma if it is preceeded by a backslash.

    my @array = split /(?<!\\),/, $string;

    It looks like your code is using a colon as the delimiter. This solution can be easily adapted to whatever delimiter or escape sequence you desire.

    For more elaborate things, like balanced quotes, you're better off going to a Text::Balanced module, or maybe the DBD::CSV module.

    Update: To answer the escaped escape problem that you've mentioned, you could resort to alternation within the split:

    my @array = split/(?:\\\\,)|(?:(?<!\\),)/, $string;

    You really end up with some ugly leaning toothpicks!


    Dave

      But the backslash might be backslashed
      I'm not having much luck with the alternation code.

      I think it'd have problems with 5 or 6 backslashes. What's more, since it matches the backslashes, it'd trim them off the end of the resulting split strings.

      I'd better try that out and update...

      DB<9> x @a= split /(?:\\\\X)|(?:(?<!\\)X)/, $escaped_str 0 '' 1 'a\\X\\\\b' 2 'c' 3 'd' DB<10> p $escaped_str Xa\X\\bXc\\XdX
      Yes, the c gets striped
Re: split on unescaped delimiters
by delirium (Chaplain) on Jan 08, 2004 at 13:14 UTC
    This is similar to Roger's.

    $escaped_str = <<'EOT'; Xa\X\\bXc\\XdX EOT chomp $escaped_str; my @array = (); my $escaped = 0; my $count = 0; for (split //, $escaped_str) { if (!$escaped && $_ eq 'X') { $count++; next; } $array[$count] .= $_; $escaped = ($_ eq "\\" && !$escaped) ? 1 : 0; } print "(",join(':',@array),")";
Re: split on unescaped delimiters
by cLive ;-) (Prior) on Jan 08, 2004 at 12:54 UTC

    I suspect there's an easier way

    How about DBD::CSV ?

    .02

    cLive ;-)

      The Art of Unix Programming brainwashed me:

      In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.

      Although after this discussing I think placing the escape after the character being escaped might be better for regexes