in reply to Parsing arguments

Here you go.. not much duplication, supports escaping quotes, and properly fails if there's garbage in the input (you can replace the die with more elegant error handling)

# skip initial whitespace, if any $tag =~ /^\s*/gcxs; # look for optional key, and a bareword or the start of a quoted strin +g while ($tag =~ / \G (?: (\w+) \s* = \s*)? ( ['"](?=.) | \w+ ) /gcxs) { my ($k, $v) = ($1, $2); # extract quoted string if ($v eq "'" || $v eq '"') { $tag =~ / \G ( (?: \\. | [^$v] )* ) $v /gcxs or last; $v = $1; $v =~ s/\\(.)/$1/g; # unescape characters } # skip optional separator $tag =~ / \G \s* :? \s* /gcxs; # save value push @args, $k ? [$k, $v] : $v; } # check if parsing was successful die if pos $tag != length $tag;

•Update: If you like, you can ofcourse precompile the quote-patterns so you don't have any variable patterns, like:

my %quotes; $quotes{$_} = qr/ \G ( (?: \\. | [^$_] )* ) $_ /xs for qw(' ");
and then replace the if-block with:
if (my $pat = $quotes{$v}) { $tag =~ /$pat/gc or last; $v = $1; $v =~ s/\\(.)/$1/g; }
But I don't know if that results in any significant increase in speed.

Replies are listed 'Best First'.
Re: Re: Parsing arguments
by hv (Prior) on Feb 20, 2003 at 19:15 UTC

    Thanks, that's an interesting approach. I find the logic still rather complex, though, and I think if I were going in this direction I'd separate it out a bit differently:

    while (pos($tag) < length($tag)) { if (m{ \G (\w+) \s* = }gcx) { push @args [ $1 ]; } elsif (m{ \G (\w+) (?= \s* | \z ) \s* }gcx) { push @args, $1; } elsif (m{ \G (['"]) ( \\. | [^\\] )*? \1 \s* }gcx) { (my $quoted = $2) =~ s/\\(.)/$1/g; push @args, $quoted; } else { die "parsing error\n"; } } for (my $i = 0; $i < @args; ++$i) { $args[$i][1] = splice(@args, $i + 1, 1) if ref $args[$_]; }

    Hugo
      There are many variations possible. I have to admit yours is simpler, although I should note it parses a different language than your original request ("foo=foo=foo=foo=" is considered valid in this version)

      BTW, benchmarks have shown using .*?D is slower than [^D]*D (where D is the delimiter).

      Also note using [^\\] is not necessary, though harmless (since if the char is a backslash, the \\. will match unless the backslash is end the end, which case there's also no delimiter and the whole pattern will not match).

      And finally, in your original you used the /s while you're not using it here.. I don't know if that's deliberate or a mistake, but I thought I'd note it.

      •Update: and I just noticed you completely forgot support for the colon-delimiter (although that's not hard to add). Also that (?= \s* | \z ) zero-width assertion is completely futile since it also matches 0 chars (due to the \s*).

        Bother, lots of problems. foo=foo=foo isn't supposed to be allowed, but the post-processing phase could check for that; .*?D really shouldn't be slower than [^D]*D - I'll have to take a look why that happens and see if it can be fixed (the minimal matching support was added to perl relatively recently, and it hasn't had the same degree of optimisation that the older codepaths have had)

        I think the [^\\] is marginally clearer about its intent than . would be, though I accept the point; the missing /s and support for the colon delimiter were simply oversights; and the tail assertion should have been (?= \s | \z ).

        So let's try again:

        while (pos($_) < length($_)) { if (m{ \G (\w+) \s* = \s* (?=\S) }gcx) { # key-value pair is fixed-up in post-processing push @args [ $1 ]; } elsif (m{ \G (\w+) (?= \s | \z ) \s* }gcx) { push @args, $1; } elsif (m{ \G (['"]) ( \\. | [^\\] )*? \1 (?= \s | \z ) \s* }gcxs) { (my $quoted = $2) =~ s/\\(.)/$1/g; push @args, $quoted; } elsif (m{ \G (:) \s+ }) { push @args, $1; } else { die "parsing error\n"; } } for (my $i = 0; $i < @args; ++$i) { next unless ref $args[$i]; my $value = splice @args, $i+1, 1; die "parsing error\n" if !defined($value) || ref $value; $args[$i] = [ $args[$i], $value ]; }

        Update: string inconsistently used as $tag and $_, fixed up to use $_ throughout.

        Hugo