Re: Help with regex for complicated key=value string
by ikegami (Patriarch) on Oct 29, 2008 at 18:51 UTC
|
use strict;
use warnings;
sub dequote {
my ($s) = @_;
if ($s =~ /^"/) {
$s =~ s/^"//;
$s =~ s/"$//;
} else {
$s =~ s/\\([\\,])/$1/g;
}
return $s;
}
sub parse {
my @terms;
for ($_[0]) {
my $key = /\G ( [^=,]+ ) = /xgc ? $1 : undef;
my $val = /\G
(
" [^"]+ "
|
(?: [^\\,] | \\[\\,] )*
)
/xgc && dequote("$1");
push @terms, [ $key, $val ];
/\G $ /xgc
and last;
/\G , /xgc
or die("Expected comma at pos ", pos(), "\n");
redo;
}
return @terms;
}
# ---
use Data::Dumper qw( Dumper );
my $str = join ',', <<'__EOI__' =~ /.+/g;
key=value\,value
key=\\
key="value,value"
value
__EOI__
my @terms = parse($str);
local $Data::Dumper::Indent = 1;
local $Data::Dumper::Terse = 1;
local $Data::Dumper::Useqq = 1;
print(Dumper(\@terms), "\n");
The above assumes slashes can also be escaped.
The above otherwise assumes everything you didn't specify is disallowed.
The following might be better to get the value, but differs from what you asked:
sub dequote {
my ($s) = @_;
if ($s =~ /^"/) {
$s =~ s/^"//;
$s =~ s/"$//;
}
$s =~ s/\\(.)/$1/sg;
return $s;
}
my $val = /\G
(
" (?: [^\\"] | \\. )* "
|
(?: [^\\,] | \\. )*
)
/xsgc && dequote("$1");
Update: Fixed to use double quotes instead of single quotes. Misread.
Update: Remove "or die("Expected value at pos ", pos(), "\n")" after the value extraction match. It'll never be executed since the pattern can match zero characters.
| [reply] [d/l] [select] |
|
|
Thanks ikegami,
using \G is a good idea. Thanks for all the effort.
I also coded an simple parser in the meantime, which seems to do the job for me but without doing strict error-checking like yours. It also only splits the string on the commas, not yet the keys and values. Because only alphanumerical keys are allowed anyway (I didn't mentioned that before, sorry) there is no need to look for escaped '=', etc., i.e. everything which doesn't start with /\s*[\w_]+=/ is taken as a single key-less value.
#!/usr/bin/perl
use strict;
use warnings;
sub cskwparser {
my $string = shift;
my $esc = 0;
my $quote = 0;
my @args;
my $narg = 0;
CHAR:
foreach my $char ( split //, $string ) {
if ($esc) {
$esc = 0;
}
elsif ( $char eq '"' ) {
$quote = !$quote;
}
elsif ( $char eq '\\' ) {
$esc = 1;
}
elsif ( $char eq ',' and not $quote ) {
$narg++;
next CHAR;
}
$args[$narg] .= $char;
}
return @args;
}
# Test loop:
local $, = '|';
while (<>) {
print cskwparser $_;
}
| [reply] [d/l] [select] |
|
|
$esc is never used in your code, so that means "\\," is not handled yet.
but without doing strict error-checking like yours
It's not really. It only checks for 'key=,...' (key with no value) and 'v"alu"e' (invalid quoting).
Because only alphanumerical keys are allowed anyway (I didn't mentioned that before, sorry) there is no need to look for escaped '=', etc., i.e. everything which doesn't start with /\s*[\w_]+=/ is taken as a single key-less value.
That's the same strategy I used, but I didn't limit to alphanum chars. To do so, change
/\G ( [^=,]+ ) = /xgc && ( $key = $1 );
to
/\G ( [a-zA-Z0-9]+ ) = /xgc && ( $key = $1 );
| [reply] [d/l] [select] |
|
|
|
|
Re: Help with regex for complicated key=value string
by apl (Monsignor) on Oct 29, 2008 at 18:15 UTC
|
I'm a coward; I'd write a parser rather than try to fit the logic into a regex. | [reply] |
|
|
Yeah, a parser was also one of my first thoughts when I got confronted with this problem. It would be definitive a cleaner solution. But I'm still have some hope to find a RegEx.
| [reply] |
|
|
| [reply] |
Re: Help with regex for complicated key=value string
by mr_mischief (Monsignor) on Oct 29, 2008 at 18:41 UTC
|
If this is an ongoing project that will need quite a bit of maintenance, I'd consider writing the lexer with lex and the parser with yacc. That'd give you a C parser you could then use via a wrapper from Perl or another wrapper from Python. This solution of course assumes a few things.
Another thing I'd consider is that unless I had some specific requirement for two complete single-language solutions, I might write the parser in only Perl. The parser could read and parse the input format and write out an intermediate representation suited to the application. That format could be made to be trivial for both the Python and Perl back-ends to access. Perhaps a file delimited with colons, tabs or nulls would work. Maybe a properly formatted CSV file would be suitable. Perhaps XML, YAML (does Python do YAML?), JSON, or maybe even a database would be good. | [reply] |
Re: Help with regex for complicated key=value string
by ikegami (Patriarch) on Oct 29, 2008 at 18:32 UTC
|
Points 1.-3. alone can be simply done with some negative look-behind,
Point 2 can't be done using just a look-behind if you can escape slashes as well as commas.
key1=\\\\\\value\\\\\\,key2=///value///
| [reply] [d/l] |
|
|
Point 2 can be done with:
split /(?:^|[^\\])(?:\\\\)*\K,/ # 5.10 required
But that's just point 2, it doesn't consider commas inside quotes.
| [reply] [d/l] |
|
|
split /(?:$key_pat=)?$val_pat\K,/
Using split means everything has to be parsed twice. Once to find the commas on which to split and once to find the composing terms.
Update: Simplified pattern slightly.
| [reply] [d/l] [select] |
A reply falls below the community's threshold of quality. You may see it by logging in.
|
Re: Help with regex for complicated key=value string
by kvale (Monsignor) on Oct 29, 2008 at 18:49 UTC
|
I would split on everything that is not a bare comma (untested):
my $q_str = qr("[^"]*"); # quoted string
my $unq_nc_str = qr([^",]+); # string with no quotes or commas
my $esc_com = qr(\\\\,); # escaped comma
my @bits = grep {$_ ne ','}
split /((?:$q_str|$unq_nc_str|$esc_com)+)/, $line;
By grouping the split regex, we get split bits as well as bare commas. Then we filter bare commas. This assumes that quotes are not nested.
| [reply] [d/l] |
Re: Help with regex for complicated key=value string
by JavaFan (Canon) on Oct 29, 2008 at 18:19 UTC
|
There's no easy way to do this with split. Split looks "locally", but you need to look much more globally - in fact, you may have to look at the entire string to determine whether there's a surrounding quote pair or not.
What you have sounds very much like a CSV file. If you can't (or don't want) to use a ready available CPAN module, I suggest to write a small parser. Either from scratch, by using Parse::RecDescent, or by using a nifty 5.10 regexp. | [reply] |
|
|
Thanks, but you missed the part of my OP where I say that I also need that solution in Python. I tried to work with CSV modules, but there only support full quoted values, not partially quoted ones, e.g. only "key=value","key=value" not key="value",key=value1:"value2":value3.
| [reply] [d/l] [select] |
|
|
Thanks, but you missed the part of my OP where I say that I also need that solution in Python.
Frankly, I can't understand your insistence on this point: why does the fact that you must solve the problem both in Perl and in Python imply that you have to use a (pre-5.10) regex? Because it's the least common functionality, maybe? If so, then I see the point: code once, use twice. But then I would consider yours bad lazyness since it's not guaranteed a priori that coding twice and more precisely with the best tools that each language will provide you respectively will not be overall less work than the other way round. Or else I've not understood your concerns at all.
| [reply] [d/l] |
|
|
|
|
Well, then I suggest you write a parser in Python.
| [reply] |