mscharrer has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm looking for a regex which can be used together with split. I need this solution in Perl and unfortunatly also in Python, so I need a regex and can't use some special Perl module. My input string has the following properties:
  1. Comma separated 'words'
  2. Escaped commas ("\\,") should be ignored
  3. Words can have the forms 'value' or 'key=value'
  4. The value can be quoted, which should protect included commas: 'key="value, with, commas"'
  5. The value can contain a list which elements are separated by an other character, e.g. ':'
  6. Single elements of such an list can be quoted by itself, while the whole list isn't quoted: 'key=Elem1:Elem2:"Element, with, comma":Elem4'
I only need help to split the string at the commas. The additional steps to split the key-value pairs and the lists is easy then.

Points 1.-3. alone can be simply done with some negative look-behind, but then it gets complicated. The main problem is how to track if the comma is in quotes or not. My RegEx Kung-Fu isn't bad but just not strong enough here, so I ask your help.

Is there a RegEx possible for this or should try to enforce some easier format for the input?

Thanks in advance, and no, it's not some form of homework.
Martin

Replies are listed 'Best First'.
Re: Help with regex for complicated key=value string
by ikegami (Patriarch) on Oct 29, 2008 at 18:51 UTC
    use strict; use warnings; sub dequote { my ($s) = @_; if ($s =~ /^"/) { $s =~ s/^"//; $s =~ s/"$//; } else { $s =~ s/\\([\\,])/$1/g; } return $s; } sub parse { my @terms; for ($_[0]) { my $key = /\G ( [^=,]+ ) = /xgc ? $1 : undef; my $val = /\G ( " [^"]+ " | (?: [^\\,] | \\[\\,] )* ) /xgc && dequote("$1"); push @terms, [ $key, $val ]; /\G $ /xgc and last; /\G , /xgc or die("Expected comma at pos ", pos(), "\n"); redo; } return @terms; } # --- use Data::Dumper qw( Dumper ); my $str = join ',', <<'__EOI__' =~ /.+/g; key=value\,value key=\\ key="value,value" value __EOI__ my @terms = parse($str); local $Data::Dumper::Indent = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Useqq = 1; print(Dumper(\@terms), "\n");

    The above assumes slashes can also be escaped.
    The above otherwise assumes everything you didn't specify is disallowed.

    The following might be better to get the value, but differs from what you asked:

    sub dequote { my ($s) = @_; if ($s =~ /^"/) { $s =~ s/^"//; $s =~ s/"$//; } $s =~ s/\\(.)/$1/sg; return $s; } my $val = /\G ( " (?: [^\\"] | \\. )* " | (?: [^\\,] | \\. )* ) /xsgc && dequote("$1");

    Update: Fixed to use double quotes instead of single quotes. Misread.
    Update: Remove "or die("Expected value at pos ", pos(), "\n")" after the value extraction match. It'll never be executed since the pattern can match zero characters.

      Thanks ikegami, using \G is a good idea. Thanks for all the effort.

      I also coded an simple parser in the meantime, which seems to do the job for me but without doing strict error-checking like yours. It also only splits the string on the commas, not yet the keys and values. Because only alphanumerical keys are allowed anyway (I didn't mentioned that before, sorry) there is no need to look for escaped '=', etc., i.e. everything which doesn't start with /\s*[\w_]+=/ is taken as a single key-less value.

      #!/usr/bin/perl use strict; use warnings; sub cskwparser { my $string = shift; my $esc = 0; my $quote = 0; my @args; my $narg = 0; CHAR: foreach my $char ( split //, $string ) { if ($esc) { $esc = 0; } elsif ( $char eq '"' ) { $quote = !$quote; } elsif ( $char eq '\\' ) { $esc = 1; } elsif ( $char eq ',' and not $quote ) { $narg++; next CHAR; } $args[$narg] .= $char; } return @args; } # Test loop: local $, = '|'; while (<>) { print cskwparser $_; }

        $esc is never used in your code, so that means "\\," is not handled yet.

        but without doing strict error-checking like yours

        It's not really. It only checks for 'key=,...' (key with no value) and 'v"alu"e' (invalid quoting).

        Because only alphanumerical keys are allowed anyway (I didn't mentioned that before, sorry) there is no need to look for escaped '=', etc., i.e. everything which doesn't start with /\s*[\w_]+=/ is taken as a single key-less value.

        That's the same strategy I used, but I didn't limit to alphanum chars. To do so, change
        /\G ( [^=,]+ ) = /xgc && ( $key = $1 );
        to
        /\G ( [a-zA-Z0-9]+ ) = /xgc && ( $key = $1 );

Re: Help with regex for complicated key=value string
by apl (Monsignor) on Oct 29, 2008 at 18:15 UTC
    I'm a coward; I'd write a parser rather than try to fit the logic into a regex.
      Yeah, a parser was also one of my first thoughts when I got confronted with this problem. It would be definitive a cleaner solution. But I'm still have some hope to find a RegEx.
        Well, I know that I could write a regexp, but if I were to spend the half hour to do so, I'm afraid your reply would be "but Python doesn't support 5.10 regexes yet".

        Frankly, I think you're better off asking Python questions in a Python forum.

Re: Help with regex for complicated key=value string
by mr_mischief (Monsignor) on Oct 29, 2008 at 18:41 UTC
    If this is an ongoing project that will need quite a bit of maintenance, I'd consider writing the lexer with lex and the parser with yacc. That'd give you a C parser you could then use via a wrapper from Perl or another wrapper from Python. This solution of course assumes a few things.

    Another thing I'd consider is that unless I had some specific requirement for two complete single-language solutions, I might write the parser in only Perl. The parser could read and parse the input format and write out an intermediate representation suited to the application. That format could be made to be trivial for both the Python and Perl back-ends to access. Perhaps a file delimited with colons, tabs or nulls would work. Maybe a properly formatted CSV file would be suitable. Perhaps XML, YAML (does Python do YAML?), JSON, or maybe even a database would be good.

Re: Help with regex for complicated key=value string
by ikegami (Patriarch) on Oct 29, 2008 at 18:32 UTC

    Points 1.-3. alone can be simply done with some negative look-behind,

    Point 2 can't be done using just a look-behind if you can escape slashes as well as commas.

    key1=\\\\\\value\\\\\\,key2=///value///
      Point 2 can be done with:
      split /(?:^|[^\\])(?:\\\\)*\K,/ # 5.10 required
      But that's just point 2, it doesn't consider commas inside quotes.
        That would be:
        split /(?:$key_pat=)?$val_pat\K,/

        Using split means everything has to be parsed twice. Once to find the commas on which to split and once to find the composing terms.

        Update: Simplified pattern slightly.

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Help with regex for complicated key=value string
by kvale (Monsignor) on Oct 29, 2008 at 18:49 UTC
    I would split on everything that is not a bare comma (untested):
    my $q_str = qr("[^"]*"); # quoted string my $unq_nc_str = qr([^",]+); # string with no quotes or commas my $esc_com = qr(\\\\,); # escaped comma my @bits = grep {$_ ne ','} split /((?:$q_str|$unq_nc_str|$esc_com)+)/, $line;
    By grouping the split regex, we get split bits as well as bare commas. Then we filter bare commas. This assumes that quotes are not nested.

    -Mark

Re: Help with regex for complicated key=value string
by JavaFan (Canon) on Oct 29, 2008 at 18:19 UTC
    There's no easy way to do this with split. Split looks "locally", but you need to look much more globally - in fact, you may have to look at the entire string to determine whether there's a surrounding quote pair or not.

    What you have sounds very much like a CSV file. If you can't (or don't want) to use a ready available CPAN module, I suggest to write a small parser. Either from scratch, by using Parse::RecDescent, or by using a nifty 5.10 regexp.

      Thanks, but you missed the part of my OP where I say that I also need that solution in Python. I tried to work with CSV modules, but there only support full quoted values, not partially quoted ones, e.g. only "key=value","key=value" not key="value",key=value1:"value2":value3.
        Thanks, but you missed the part of my OP where I say that I also need that solution in Python.

        Frankly, I can't understand your insistence on this point: why does the fact that you must solve the problem both in Perl and in Python imply that you have to use a (pre-5.10) regex? Because it's the least common functionality, maybe? If so, then I see the point: code once, use twice. But then I would consider yours bad lazyness since it's not guaranteed a priori that coding twice and more precisely with the best tools that each language will provide you respectively will not be overall less work than the other way round. Or else I've not understood your concerns at all.

        --
        If you can't understand the incipit, then please check the IPB Campaign.
        Well, then I suggest you write a parser in Python.