Re: split on unescaped delimiters
by Abigail-II (Bishop) on Jan 08, 2004 at 10:09 UTC
|
Assuming backslashes themselves could be escaped, you want
to split on colons which are preceeded by an even amount
of backslashes. You can't do look behind in this case, because
you can't do variable length look behind.
But you can reverse the string, and look for an even amount
of trailing backslashes. After splitting, you need to do some
reversing again:
reverse map {scalar reverse} split /:(?=(?:\\\\)*(?!\\))/ => reverse $
+string;
Abigail | [reply] [d/l] |
|
|
| [reply] |
Re: split on unescaped delimiters
by Roger (Parson) on Jan 08, 2004 at 12:02 UTC
|
TIMTOWTDI, A split (with capture) example...
use strict;
use warnings;
use Data::Dumper;
# delimiter is X
my $escaped_str = 'Xa\Xdc\\bXc\\\\XdXe';
my @a = ();
my $i = 0;
foreach (split /(\\.)|X/, $escaped_str) {
defined $_ ? do { $a[$i] .= $_ } : do {$i++ }
}
print Dumper(\@a);
and the output is as expected -
$VAR1 = [
'',
'a\\Xdc\\b',
'c\\\\',
'd',
'e'
];
| [reply] [d/l] [select] |
|
|
I like that. Trying to come up with a map() version, but
my $scratch = '';
my @a = (map(defined() ? ($scratch.=$_)[()]
: substr($scratch,0,length($scratch),''),
split /(\\.)|X/, $escaped_str),
length($scratch) ? $scratch : ());
is the best I can do. And that's way too ugly.
Maybe something based on @a = @{List::Util::reduce { ... } [], split... }; | [reply] [d/l] [select] |
|
|
Thank you for this nice innovative approach. One detail: do not forget -1 as the 3rd parameter to split or else empty values will be discarded as explained in perlfunc for split.
| [reply] |
Re: split on unescaped delimiters
by Abigail-II (Bishop) on Jan 08, 2004 at 10:16 UTC
|
Instead of splitting, you can also extract what you want:
my @parts = $string =~ /([^:\\]*(?:\\.[^:\\]*)*)(?(?{length $^N})|(?!)
+)/g;
Abigail | [reply] [d/l] |
Re: split on unescaped delimiters
by bsb (Priest) on Jan 08, 2004 at 09:42 UTC
|
Here's my working solution, the clumsy one
# in the real code '.' is my delimiter
# Using 'X' above since it's not a regex metachar
my $first = $1
if $name =~ m/^ ( [^\\.]* (?: \\(?:.|$) [^\\.]* )* ) /gx;
my (@remainder) =
$name =~ m/\G (?:\.) ( [^\\.]* (?: \\(?:.|$) [^\\.]* )* ) /gx;
| [reply] [d/l] |
Re: split on unescaped delimiters
by davido (Cardinal) on Jan 08, 2004 at 09:42 UTC
|
Here is a use of a negative lookbehind zero-width assertion to prevent a split on a comma if it is preceeded by a backslash.
my @array = split /(?<!\\),/, $string;
It looks like your code is using a colon as the delimiter. This solution can be easily adapted to whatever delimiter or escape sequence you desire.
For more elaborate things, like balanced quotes, you're better off going to a Text::Balanced module, or maybe the DBD::CSV module.
Update: To answer the escaped escape problem that you've mentioned, you could resort to alternation within the split:
my @array = split/(?:\\\\,)|(?:(?<!\\),)/, $string;
You really end up with some ugly leaning toothpicks!
| [reply] [d/l] [select] |
|
|
But the backslash might be backslashed
| [reply] |
|
|
I'm not having much luck with the alternation code.
I think it'd have problems with 5 or 6 backslashes.
What's more, since it matches the backslashes,
it'd trim them off
the end of the resulting split strings.
I'd better try that out and update...
DB<9> x @a= split /(?:\\\\X)|(?:(?<!\\)X)/, $escaped_str
0 ''
1 'a\\X\\\\b'
2 'c'
3 'd'
DB<10> p $escaped_str
Xa\X\\bXc\\XdX
Yes, the c gets striped | [reply] [d/l] |
Re: split on unescaped delimiters
by delirium (Chaplain) on Jan 08, 2004 at 13:14 UTC
|
This is similar to Roger's.
$escaped_str = <<'EOT';
Xa\X\\bXc\\XdX
EOT
chomp $escaped_str;
my @array = ();
my $escaped = 0;
my $count = 0;
for (split //, $escaped_str) {
if (!$escaped && $_ eq 'X') { $count++; next; }
$array[$count] .= $_;
$escaped = ($_ eq "\\" && !$escaped) ? 1 : 0;
}
print "(",join(':',@array),")";
| [reply] [d/l] |
Re: split on unescaped delimiters
by cLive ;-) (Prior) on Jan 08, 2004 at 12:54 UTC
|
| [reply] |
|
|
The Art of Unix Programming brainwashed me:
In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.
Although after this discussing I think placing the escape
after the character being escaped might be
better for regexes
| [reply] |