split on unescaped delimiters

bsb has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: split on unescaped delimiters by Abigail-II (Bishop) on Jan 08, 2004 at 10:09 UTC
Assuming backslashes themselves could be escaped, you want to split on colons which are preceeded by an even amount of backslashes. You can't do look behind in this case, because you can't do variable length look behind. But you can reverse the string, and look for an even amount of trailing backslashes. After splitting, you need to do some reversing again: `reverse map {scalar reverse} split /:(?=(?:\\\\)*(?!\\))/ => reverse $ +string;` [download] Abigail	[reply] [d/l]
Re: Re: split on unescaped delimiters by bsb (Priest) on Jan 08, 2004 at 10:17 UTC
Very, very nice. Even eh? Reminds my of your prime matching japh ...thinking music... No good. The split would still take the slashes and lookbehind is fixed length.	[reply]
Re: split on unescaped delimiters by Roger (Parson) on Jan 08, 2004 at 12:02 UTC
TIMTOWTDI, A split (with capture) example... `use strict; use warnings; use Data::Dumper; # delimiter is X my $escaped_str = 'Xa\Xdc\\bXc\\\\XdXe'; my @a = (); my $i = 0; foreach (split /(\\.)\|X/, $escaped_str) { defined $_ ? do { $a[$i] .= $_ } : do {$i++ } } print Dumper(\@a);` [download] and the output is as expected - `$VAR1 = [ '', 'a\\Xdc\\b', 'c\\\\', 'd', 'e' ];` [download]	[reply] [d/l] [select]
Re: Re: split on unescaped delimiters by ysth (Canon) on Jan 08, 2004 at 12:24 UTC
I like that. Trying to come up with a map() version, but `my $scratch = ''; my @a = (map(defined() ? ($scratch.=$_)[()] : substr($scratch,0,length($scratch),''), split /(\\.)\|X/, $escaped_str), length($scratch) ? $scratch : ());` [download] is the best I can do. And that's way too ugly. Maybe something based on `@a = @{List::Util::reduce { ... } [], split... };`	[reply] [d/l] [select]
Re^2: split on unescaped delimiters by jdeguest (Beadle) on May 27, 2019 at 02:27 UTC
Thank you for this nice innovative approach. One detail: do not forget -1 as the 3rd parameter to split or else empty values will be discarded as explained in perlfunc for split.	[reply]
Re: split on unescaped delimiters by Abigail-II (Bishop) on Jan 08, 2004 at 10:16 UTC
Instead of splitting, you can also extract what you want: `my @parts = $string =~ /([^:\\](?:\\.[^:\\])*)(?(?{length $^N})\|(?!) +)/g;` [download] Abigail	[reply] [d/l]
Re: split on unescaped delimiters by bsb (Priest) on Jan 08, 2004 at 09:42 UTC
Here's my working solution, the clumsy one `# in the real code '.' is my delimiter # Using 'X' above since it's not a regex metachar my $first = $1 if $name =~ m/^ ( [^\\.]* (?: \\(?:.\|$) [^\\.]* )* ) /gx; my (@remainder) = $name =~ m/\G (?:\.) ( [^\\.]* (?: \\(?:.\|$) [^\\.]* )* ) /gx;` [download]	[reply] [d/l]
Re: split on unescaped delimiters by davido (Cardinal) on Jan 08, 2004 at 09:42 UTC
Here is a use of a negative lookbehind zero-width assertion to prevent a split on a comma if it is preceeded by a backslash. `my @array = split /(?<!\\),/, $string;` [download] It looks like your code is using a colon as the delimiter. This solution can be easily adapted to whatever delimiter or escape sequence you desire. For more elaborate things, like balanced quotes, you're better off going to a Text::Balanced module, or maybe the DBD::CSV module. Update: To answer the escaped escape problem that you've mentioned, you could resort to alternation within the split: `my @array = split/(?:\\\\,)\|(?:(?<!\\),)/, $string;` [download] You really end up with some ugly leaning toothpicks! Dave	[reply] [d/l] [select]
Re: Re: split on unescaped delimiters by bsb (Priest) on Jan 08, 2004 at 09:43 UTC
But the backslash might be backslashed	[reply]
Re: Re: split on unescaped delimiters by bsb (Priest) on Jan 08, 2004 at 10:11 UTC
I'm not having much luck with the alternation code. I think it'd have problems with 5 or 6 backslashes. What's more, since it matches the backslashes, it'd trim them off the end of the resulting split strings. I'd better try that out and update... `DB<9> x @a= split /(?:\\\\X)\|(?:(?<!\\)X)/, $escaped_str 0 '' 1 'a\\X\\\\b' 2 'c' 3 'd' DB<10> p $escaped_str Xa\X\\bXc\\XdX` [download] Yes, the c gets striped	[reply] [d/l]
Re: split on unescaped delimiters by delirium (Chaplain) on Jan 08, 2004 at 13:14 UTC
This is similar to Roger's. `$escaped_str = <<'EOT'; Xa\X\\bXc\\XdX EOT chomp $escaped_str; my @array = (); my $escaped = 0; my $count = 0; for (split //, $escaped_str) { if (!$escaped && $_ eq 'X') { $count++; next; } $array[$count] .= $_; $escaped = ($_ eq "\\" && !$escaped) ? 1 : 0; } print "(",join(':',@array),")";` [download]	[reply] [d/l]
Re: split on unescaped delimiters by cLive ;-) (Prior) on Jan 08, 2004 at 12:54 UTC
I suspect there's an easier way How about DBD::CSV ? .02 cLive ;-)	[reply]
Re: Re: split on unescaped delimiters by bsb (Priest) on Jan 09, 2004 at 01:22 UTC
The Art of Unix Programming brainwashed me: In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field. Although after this discussing I think placing the escape after the character being escaped might be better for regexes	[reply]