Help with regex for complicated key=value string

mscharrer has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help with regex for complicated key=value string by ikegami (Patriarch) on Oct 29, 2008 at 18:51 UTC
use strict; use warnings; sub dequote { my ($s) = @_; if ($s =~ /^"/) { $s =~ s/^"//; $s =~ s/"$//; } else { $s =~ s/\\([\\,])/$1/g; } return $s; } sub parse { my @terms; for ($_[0]) { my $key = /\G ( [^=,]+ ) = /xgc ? $1 : undef; my $val = /\G ( " [^"]+ " \| (?: [^\\,] \| \\[\\,] )* ) /xgc && dequote("$1"); push @terms, [ $key, $val ]; /\G $ /xgc and last; /\G , /xgc or die("Expected comma at pos ", pos(), "\n"); redo; } return @terms; } # --- use Data::Dumper qw( Dumper ); my $str = join ',', <<'__EOI__' =~ /.+/g; key=value\,value key=\\ key="value,value" value __EOI__ my @terms = parse($str); local $Data::Dumper::Indent = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Useqq = 1; print(Dumper(\@terms), "\n"); [download] The above assumes slashes can also be escaped. The above otherwise assumes everything you didn't specify is disallowed. The following might be better to get the value, but differs from what you asked: `sub dequote { my ($s) = @_; if ($s =~ /^"/) { $s =~ s/^"//; $s =~ s/"$//; } $s =~ s/\\(.)/$1/sg; return $s; } my $val = /\G ( " (?: [^\\"] \| \\. )* " \| (?: [^\\,] \| \\. )* ) /xsgc && dequote("$1");` [download] Update: Fixed to use double quotes instead of single quotes. Misread. Update: Remove "`or die("Expected value at pos ", pos(), "\n")`" after the value extraction match. It'll never be executed since the pattern can match zero characters.	[reply] [d/l] [select]
Re^2: Help with regex for complicated key=value string by mscharrer (Hermit) on Oct 29, 2008 at 19:22 UTC
Thanks ikegami, using `\G` is a good idea. Thanks for all the effort. I also coded an simple parser in the meantime, which seems to do the job for me but without doing strict error-checking like yours. It also only splits the string on the commas, not yet the keys and values. Because only alphanumerical keys are allowed anyway (I didn't mentioned that before, sorry) there is no need to look for escaped '=', etc., i.e. everything which doesn't start with `/\s*[\w_]+=/` is taken as a single key-less value. `#!/usr/bin/perl use strict; use warnings; sub cskwparser { my $string = shift; my $esc = 0; my $quote = 0; my @args; my $narg = 0; CHAR: foreach my $char ( split //, $string ) { if ($esc) { $esc = 0; } elsif ( $char eq '"' ) { $quote = !$quote; } elsif ( $char eq '\\' ) { $esc = 1; } elsif ( $char eq ',' and not $quote ) { $narg++; next CHAR; } $args[$narg] .= $char; } return @args; } # Test loop: local $, = '\|'; while (<>) { print cskwparser $_; }` [download]	[reply] [d/l] [select]
Re^3: Help with regex for complicated key=value string by ikegami (Patriarch) on Oct 29, 2008 at 19:36 UTC
~~`$esc` is never used in your code, so that means "`\\,`" is not handled yet.~~ but without doing strict error-checking like yours It's not really. It only checks for ~~'`key=,...`' (key with no value) and~~ '`v"alu"e`' (invalid quoting). Because only alphanumerical keys are allowed anyway (I didn't mentioned that before, sorry) there is no need to look for escaped '=', etc., i.e. everything which doesn't start with `/\s*[\w_]+=/` is taken as a single key-less value. That's the same strategy I used, but I didn't limit to alphanum chars. To do so, change `/\G ( [^=,]+ ) = /xgc && ( $key = $1 );` to `/\G ( [a-zA-Z0-9]+ ) = /xgc && ( $key = $1 );`	[reply] [d/l] [select]
Re^4: Help with regex for complicated key=value string by mscharrer (Hermit) on Oct 29, 2008 at 19:43 UTC
Re^5: Help with regex for complicated key=value string by ikegami (Patriarch) on Oct 29, 2008 at 19:45 UTC
Re: Help with regex for complicated key=value string by apl (Monsignor) on Oct 29, 2008 at 18:15 UTC
I'm a coward; I'd write a parser rather than try to fit the logic into a regex.	[reply]
Re^2: Help with regex for complicated key=value string by mscharrer (Hermit) on Oct 29, 2008 at 18:20 UTC
Yeah, a parser was also one of my first thoughts when I got confronted with this problem. It would be definitive a cleaner solution. But I'm still have some hope to find a RegEx.	[reply]
Re^3: Help with regex for complicated key=value string by JavaFan (Canon) on Oct 29, 2008 at 18:31 UTC
Well, I know that I could write a regexp, but if I were to spend the half hour to do so, I'm afraid your reply would be "but Python doesn't support 5.10 regexes yet". Frankly, I think you're better off asking Python questions in a Python forum.	[reply]
Re: Help with regex for complicated key=value string by mr_mischief (Monsignor) on Oct 29, 2008 at 18:41 UTC
If this is an ongoing project that will need quite a bit of maintenance, I'd consider writing the lexer with lex and the parser with yacc. That'd give you a C parser you could then use via a wrapper from Perl or another wrapper from Python. This solution of course assumes a few things. Another thing I'd consider is that unless I had some specific requirement for two complete single-language solutions, I might write the parser in only Perl. The parser could read and parse the input format and write out an intermediate representation suited to the application. That format could be made to be trivial for both the Python and Perl back-ends to access. Perhaps a file delimited with colons, tabs or nulls would work. Maybe a properly formatted CSV file would be suitable. Perhaps XML, YAML (does Python do YAML?), JSON, or maybe even a database would be good.	[reply]
Re: Help with regex for complicated key=value string by ikegami (Patriarch) on Oct 29, 2008 at 18:32 UTC
Points 1.-3. alone can be simply done with some negative look-behind, Point 2 can't be done using just a look-behind if you can escape slashes as well as commas. `key1=\\\\\\value\\\\\\,key2=///value///` [download]	[reply] [d/l]
Re^2: Help with regex for complicated key=value string by JavaFan (Canon) on Oct 29, 2008 at 18:52 UTC
Point 2 can be done with: `split /(?:^\|[^\\])(?:\\\\)*\K,/ # 5.10 required` [download] But that's just point 2, it doesn't consider commas inside quotes.	[reply] [d/l]
Re^3: Help with regex for complicated key=value string by ikegami (Patriarch) on Oct 29, 2008 at 19:09 UTC
That would be: `split /(?:$key_pat=)?$val_pat\K,/` [download] Read more... Test (547 Bytes) Using `split` means everything has to be parsed twice. Once to find the commas on which to split and once to find the composing terms. Update: Simplified pattern slightly.	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Help with regex for complicated key=value string by kvale (Monsignor) on Oct 29, 2008 at 18:49 UTC
I would split on everything that is not a bare comma (untested): `my $q_str = qr("[^"]*"); # quoted string my $unq_nc_str = qr([^",]+); # string with no quotes or commas my $esc_com = qr(\\\\,); # escaped comma my @bits = grep {$_ ne ','} split /((?:$q_str\|$unq_nc_str\|$esc_com)+)/, $line;` [download] By grouping the split regex, we get split bits as well as bare commas. Then we filter bare commas. This assumes that quotes are not nested. -Mark	[reply] [d/l]
Re: Help with regex for complicated key=value string by JavaFan (Canon) on Oct 29, 2008 at 18:19 UTC
There's no easy way to do this with split. Split looks "locally", but you need to look much more globally - in fact, you may have to look at the entire string to determine whether there's a surrounding quote pair or not. What you have sounds very much like a CSV file. If you can't (or don't want) to use a ready available CPAN module, I suggest to write a small parser. Either from scratch, by using Parse::RecDescent, or by using a nifty 5.10 regexp.	[reply]
Re^2: Help with regex for complicated key=value string by mscharrer (Hermit) on Oct 29, 2008 at 18:25 UTC
Thanks, but you missed the part of my OP where I say that I also need that solution in Python. I tried to work with CSV modules, but there only support full quoted values, not partially quoted ones, e.g. only `"key=value","key=value"` not `key="value",key=value1:"value2":value3`.	[reply] [d/l] [select]
Re^3: Help with regex for complicated key=value string by blazar (Canon) on Oct 29, 2008 at 21:11 UTC
Thanks, but you missed the part of my OP where I say that I also need that solution in Python. Frankly, I can't understand your insistence on this point: why does the fact that you must solve the problem both in Perl and in Python imply that you have to use a (pre-5.10) regex? Because it's the least common functionality, maybe? If so, then I see the point: code once, use twice. But then I would consider yours bad lazyness since it's not guaranteed a priori that coding twice and more precisely with the best tools that each language will provide you respectively will not be overall less work than the other way round. Or else I've not understood your concerns at all. `--` ~~If you can't understand the incipit, then please check the IPB Campaign.~~	[reply] [d/l]
Re^4: Help with regex for complicated key=value string by mscharrer (Hermit) on Oct 30, 2008 at 14:04 UTC
Re^3: Help with regex for complicated key=value string by JavaFan (Canon) on Oct 29, 2008 at 18:29 UTC
Well, then I suggest you write a parser in Python.	[reply]