in reply to split on delimiter unless escaped

A typical approach is not to split, but to parse the chunks you want to preserve.

A regex for that is:

my $re = qr{ (?> # don't backtrack into this group !. # either the escape character, # followed by any other character | # or [^!;\n] # a character that is neither escape # nor split character )+ }x; while ($str =~ /($re)/g) { print "Chunk '$1'\n"; }

This technique is fairly general, and works for example for quoted strings, where the backslash can escape the quote character to not terminate the string.

You can read more about it in Mastering Regular Expressions by Jeffrey E.F. Friedl, a book I can warmly recommend.

Update: Added \n to the negated character class; mr_mischief pointed out that it is probably closer to the desired output that way.

Perl 6 - links to (nearly) everything that is Perl 6.

Replies are listed 'Best First'.
Re^2: split on delimiter unless escaped
by ikegami (Patriarch) on Nov 09, 2010 at 22:54 UTC
    Not quite. "+" means you can't have empty fields. And if you change it "*", you can get one too many empty fields. That's why my solution is slightly different.

      Hi ikegami,

      Thanks for your example. I'm still trying to figure it all out. I'm running it as below, and it doesn't seem to quite do what I want. I only want the escape character to be treated specially if it's in !+; - i.e. a!!b should be a!!b, whereas a!!!;b should be a!;b.

      Also, I seem to be getting an empty field at the end. One or more semicolons at the end seem to be parsed properly, though.

      One test string returns a blank result. ?

      sub dequote { my $x = $_[0]; $x =~ s/!(.)/$1/sg; return $x; } while(<>) { chomp; my @fields = map dequote($_), /\G((?:[^!;]+|!.)*)(?:;|\z)/sg; print "$_ => " . join( '|', @fields ) . "\n"; # print "$_ => @fields\n"; }

      Sample results:

      aval!!!!;bval => aval!!|bval| aval!!!!!;bval => aval!!;bval| a!!val!!!!!;bval! => !a!!!val!!!!!;bval!! => a!val!!;bval!| a!val!;bva!l; => aval;bval| a!!val!!;;bv!!al;; => a!val!||bv!al||

        I only want the escape character to be treated specially if it's in !+;

        Yuck! I hope you're being forced to deal with this format.

        It's not only tricker for a human to understand, it's tricker to code. In particular, the definition of a field varies based on whether it's the last field or not, and the function of the "!" varies based on its position in the field.

        sub unescape { my $x = $_[0]; my ($base, $end) = $x =~ /^(.*)(!+)\z/s; return $base . ('!' x (length($end)/2)); } my $last_field = qr/ [^;]* /x; my $other_field = qr/ (?: [^!]+ | (?: ![^!] )+ )* (?:!!)* /x; # Validation my $record = qr/^ (?: $other_field ; )* $last_field \z/x; # Extraction my @fields = map unescape($_), / \G ( $other_field (?= ; ) | $last_field (?= \z ) ) (?:;|\z) /xg;

        You are free to skip the validation.

Re^2: split on delimiter unless escaped
by yrp001 (Initiate) on Nov 09, 2010 at 22:21 UTC

    Ah, neat. I made a couple of small modification, and now I'm very close. The trouble left now is how to capture an empty field - i.e. where I have two delimiter characters next to each other I should emit an empty chunk, instead of no chunk. Still stuck on that. What I have so far:

    my $re = qr{ (?> # don't backtrack into this group !! # either the escape character, # followed by an escape character | # or !; # escape followed by delimiter char | # or [^;\n] # a character that is neither delimiter # character or a newline )+ }x; while(<>) { chomp; $str = $_; print "$_\n"; while ($str =~ /($re)/g) { print " Chunk '$1' => "; $s = $1; $s =~ s/!!(?=(!|;))/!/g; print "$s\n"; } }

    Example of paired delimiters (;;)

    a!!val!!;;bv!!al;; Chunk 'a!!val!!' => a!!val!! Chunk 'bv!!al' => bv!!al

      So the following seems to do exactly what I want, but doesn't handle empty fields. It might not matter, because my input shouldn't have any empty fields. I'll probably just check that my input string doesn't begin or end with a delimiter, or have two consecutive delimiters in the middle anywhere. If it does, it's bad input, and I can just throw it out. Would still be fun to know how to handle empty fields, though...

      my $re = qr{ (?> # don't backtrack into this group !! # either the escape character, # followed by an escape character | # or !; # escape followed by delimiter char | # or [^;\n] # a character that is neither delimiter # character or a newline )+ }x; while(<>) { chomp; my @aray; $str = $_; print "$_\n "; while ($str =~ /($re)/g) { $s = $1; $s =~ s/!!(?=(!|;|\z))/!/g; push( @aray, $s ); } print join(' | ', @aray) . "\n"; }