Splitting on escapable delimiter

Daryn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Splitting on escapable delimiter by BrowserUk (Patriarch) on Mar 28, 2008 at 14:47 UTC
Intuatively, you want to use `split '(?<=(?:##)+)\@', $s;;` but that gets you: `[Variable length lookbehind not implemented in regex; ...` [download] So how to achieve a variable length lookbehind? Here's one way: `print $s;; #@##@###@####@#####@ print for split '(?:(?<=[^#]####)\|(?<=[^#]##)\|(?<!#))[@]', $s;; #@## ###@#### #####@` [download] Of course the downside is that you need to include a case for each length of lookbehind which quickly gets unweildy: `print for split '(?:(?<=[^#]########)\|(?<=[^#]######)\|(?<=[^#]####)\|(? +<=[^#]##)\|(?<!#))[@]', $s;;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: Splitting on escapable delimiter by almut (Canon) on Mar 28, 2008 at 16:09 UTC
You could also implement a simple state machine with a binary state (escaped, unescaped), and then split on encountering '@' only when unescaped. E.g. `sub mysplit { my $string = shift; my @parts; my $part = ''; my $escaped = 0; for ($string =~ m/(.)/gs) { if ($_ eq '@') { unless ($escaped) { push @parts, $part; $part = ''; next; } } if ($_ eq "#") { $escaped ^= 1; # toggle state } else { $escaped = 0; # reset } $part .= $_; } push @parts, $part; return @parts; } my @tests = ( 'one@two@three', '## is a hash and #@ is an arobace', '#@##@###@####@#####@', ); for my $s (@tests) { print join(', ', mysplit($s) ), "\n"; }` [download] Output `one, two, three ## is a hash and #@ is an arobace #@##, ###@####, #####@` [download] (Not well tested for edge cases... but you get the idea.)	[reply] [d/l] [select]
Re: Splitting on escapable delimiter by ikegami (Patriarch) on Mar 28, 2008 at 15:50 UTC
`# Extract fields my @fields = /((?:[^#@]+\|#.))/sg; # Remove seperators my $ff = 0; @fields = grep $ff^=1, @fields; # Unescape s/#(.)/$1/sg for @fields;` [download] or use Text::CSV Updated* to remove empty elements were being placed in `@fields`.	[reply] [d/l] [select]
Re^2: Splitting on escapable delimiter by Roy Johnson (Monsignor) on Mar 28, 2008 at 18:24 UTC
If you consume the separator, you don't have to filter it out. And if you put the escape regex first, you don't have to mention the # twice. `my @fields = /((?:#.\|[^@]))\@?/sg;` [download] Caution:* Contents may have been coded under pressure.	[reply] [d/l]
Re^3: Splitting on escapable delimiter by ikegami (Patriarch) on Mar 28, 2008 at 21:10 UTC
Nope, that returns an extra (empty) field most of the time, and there's no way to know when. For example, 'a@b' incorrectly returns 3 fields, although 'a@' correctly returns 2.	[reply]
Re^4: Splitting on escapable delimiter by Roy Johnson (Monsignor) on Mar 31, 2008 at 19:55 UTC
Re^5: Splitting on escapable delimiter by ikegami (Patriarch) on Mar 31, 2008 at 21:49 UTC
Re: Splitting on escapable delimiter by Anonymous Monk on Mar 28, 2008 at 17:49 UTC
I’d try reversing the string, split it, then reverse all the pieces. That way you can use a variable-width look-ahead assertion instead of a(n unsupported) variable width look-behind assertion. `$_ = "#@##@###@####@#####@"; $_ = reverse; my @pieces = reverse (split /\@(?=(?:##)*(?!#))/); + for (@pieces) { $_ = reverse; } print "@pieces\n";` [download] The regex is a little hairy; it has a negative look-ahead assertion inside the positive look-ahead assertion.	[reply] [d/l]
Re^2: Splitting on escapable delimiter by mobiusinversion (Beadle) on Mar 28, 2008 at 22:25 UTC
I have to say, that is clever! At first I smacked my forehead that after 2 years of daily Perl programming I had never thought "Duh! Variable width lookbehind is just variable width lookahead on the reverse string!". Bravo! Unfortunately, this solution does not recover empty fields delimited in this way... For example, try the example string above with two '@''s appended to the beginning (as you would find after having delimited empty fields). See my post below for the correct way to handle this using loop-unrolling (in one regex and no lookaround!).	[reply]
Re^3: Splitting on escapable delimiter by Anonymous Monk on Mar 28, 2008 at 22:57 UTC
reads documentation for split Ah, I need to add a -1 as a third parameter to split. Good spot.	[reply]
Re^4: Splitting on escapable delimiter by mobiusinversion (Beadle) on Mar 28, 2008 at 23:40 UTC
Re^5: Splitting on escapable delimiter by Anonymous Monk on Mar 29, 2008 at 01:30 UTC
Some notes below your chosen depth have not been shown here
Re: Splitting on escapable delimiter by apl (Monsignor) on Mar 28, 2008 at 15:37 UTC
I'm not as clever as some, so I did it the long way... `#!/usr/bin/perl use strict; use warnings; while( my $ln = <DATA> ) { chomp( $ln ); my @flds = split( '@', $ln ); foreach ( @flds ) { if ( /(#)/ ) { print "/$ln/ --> /$1/\n" if ( length( $1 ) % 2 ) == 1; } } } __DATA__ #@##@###@####@#####@ #@## ###@#### #####@` [download] This results in: `/#@##@###@####@#####@/ --> /#/ /#@##@###@####@#####@/ --> /###/ /#@##@###@####@#####@/ --> /#####/ /#@##/ --> /#/ /###@####/ --> /###/ /#####@/ --> /#####/` [download] Revised:* This won't work for a string with no pound signs. You'd need to modify the length test to include `\|\| length( $1 ) == 0`.	[reply] [d/l] [select]
Re: Splitting on escapable delimiter by mobiusinversion (Beadle) on Mar 28, 2008 at 22:18 UTC
You can do it all in one regex: `sub unroll { my @x = $_[0] =~ /(?:^\|@)((?:##\|#@\|[^#@])*)/g; for(@x){ $_ =~ s/##/#/g; $_ =~ s/#@/@/g; } @x }` [download] so that: `unroll("#@##@###@####@#####@")` [download] produces the following fully unescaped list: `'@#', '#@##', '##@',` [download] This approach also has the benefit of handling empty sequences correctly, eg: `unroll("@@#@##@###@####@#####@")` [download] produces: `'', '', '@#', '#@##', '##@'` [download] as it probably should. In general, this technique is called 'unrolling the loop' and can be found in the owl book. To escape and join data in your way, you could use the following: `sub my_escape { my $x = shift; $x =~ s/#/##/g; $x =~ s/@/#@/g; $x } sub my_join { join('@',@_) }` [download] Apply my_escape to each element of the list and then call my_join on it, so that: `my_join(map{my_escape($_)}('','','@#','#@##','##@',))` [download] produces: `'@@#@##@###@####@#####@'` [download]	[reply] [d/l] [select]
Re: Splitting on escapable delimiter by jfraire (Beadle) on Mar 28, 2008 at 17:57 UTC
Well, here is my try (which does not work!). It is possible to use `reverse` and then lookahead assertions: `use strict; use warnings; # use re 'debug'; my $s = '#@##@###@####@#####@'; my @list = reverse split '@(?=(##)*[^#])', reverse $s; print scalar reverse $_, "\n" for @list;` [download] I see the regexp only matches at the good @ signs, but I am getting a couple of `##` in the output that I can't explain. I have tried with `use re 'debug'` and so I know the regexp is matching where I intended. `Output: #@## ## ###@#### ## #####@` [download] Julio	[reply] [d/l] [select]
Re^2: Splitting on escapable delimiter by BrowserUk (Patriarch) on Mar 28, 2008 at 18:05 UTC
Switch your capturing parens for non-capturing ones: `'@(?=(?:##)*[^#])' #.....^^` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: Splitting on escapable delimiter by Anonymous Monk on Mar 28, 2008 at 20:05 UTC
I suspect this version will fail on `"####@####"`, because your regex looks for a character that isn’t a `'#'`, erroneously failing at end of string.	[reply] [d/l] [select]
Re: Splitting on escapable delimiter by Daryn (Sexton) on Mar 28, 2008 at 20:08 UTC
Thank you all for your time and answers. I did use a state machine in an older, similar problem where I was reading the text instead of processing strings. The strings I deal with can get to the multi-megabyte size range so reversing them is not really attractive. I'll probably go with Roy Johnson's very neat solution unless benchmarking shows that a finite machine beats the regexp engine (which I doubt). Again, thanks to all for an instructive thread.	[reply]
Re: Splitting on escapable delimiter by wade (Pilgrim) on Mar 28, 2008 at 15:26 UTC
So, this is more of a follow-up question than an answer. I tried: `use strict; use warnings; { my $var1 = "####@#####@##@###@######@###"; print "START '$var1'\n"; my @foo = split /(?<=[^#]((##)+))[@]/, $var1; foreach (@foo) { print "HERE: '$_'\n"; } }` [download] But I got the error message: "Variable length lookbehind not implemented in regex;". Is this an ActivePerl thing (that's what I'm using), a Perl v5.8.8 thing, or did I do something boneheaded and just didn't see it? -- Wade	[reply] [d/l]
Re^2: Splitting on escapable delimiter by BrowserUk (Patriarch) on Mar 28, 2008 at 15:31 UTC
Did you notice the bit highlighted below in the post where you got that regex from? Intuatively, you want to use split '(?<=(?:##)+)\@', $s;; but that gets you: [Variable length lookbehind not implemented in regex; ... Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: Splitting on escapable delimiter by wade (Pilgrim) on Mar 28, 2008 at 16:11 UTC
Doh! No, like an idiot, I looked at the problem and thought "I can solve that!". Thanks! -- Wade	[reply]