I am writing a simple script language that needs to parse arguments passed as keyword=value pairs within the same string.
There are a few constraints, which account for some additional complexity:
Example source strings are:
my @all_keywords = (qw(one two three )); my @mandatory_keywords = (qw(two three )); my $source1 = q{ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' }; my $source2 = <<'END'; two=a_34 three = 'name="O\'Hara"' ONE="xyz\t" END
The desired output, from both sources, is a hash containing
my %statement = ( one => 'xyz\t', two => 'a_34', three => q{name="O\'Hara"}, );
In addition, I need to make sure that all the keywords are valid ones, and that the mandatory keywords are defined. Meeting all the requirements is not extremely difficult.
Please have a look at my test code. (The real code is a full-fledged module).
#!/usr/bin/perl -w use strict; my @all_keywords = (qw(one two three four five)); my @mandatory_keywords = (qw(two three four )); my $RE_value = qr/ (\w+) # (1) a keyword \s* = \s* # an equal sign with optional spaces (?: # quoted keyword ... ( # [\'\"\`] # (2) a quoting character ) ( # (3) the quoted value: (?: # either \\\2 # an escaped quote | # or [^\2] # any non-quote character ) +? # repeat (non-greedily) ) \2 # until the initial quote shows up again | (\S+) # (4) ... bare word value ) /x; sub set_value { my ($stat, $kw, $value) = (@_); # case insensitive keyword return 0 unless exists $stat->{lc $kw}; $stat->{lc $kw} = $value; return 1 } sub parse_pairs { my $src = shift; my %statement = map {$_, undef} @all_keywords; for ($src) { while ( ! m/ \G \s* \z /gcx ) { my $result = 0; if ( /\G \s* $RE_value \s* /xgc ) { $result = set_value( \%statement, $1, $4 ? $4 : $3 ); } else { die "syntax error >" . substr($_, pos) ."\n"; } die "invalid keyword $1 \n" unless $result; } } return \%statement; } sub check_pairs { my $statement = shift; for my $kw (@all_keywords) { if (defined $statement->{$kw}) { print "$kw \t -> <$statement->{$kw}>\n" } else { warn "- missing keyword <$kw>!\n" if grep {$kw eq $_} @mandatory_keywords; } } } my @sources = ( q{ ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' four=`'one' two` fi +ve = ah! }, q{ five = ah! ONE="xyz\t" three = 'name="O\'Hara" two=a_34' four=`'one' two` }); for (@sources) { print "\n>>Source: //$_//\n\n"; my $stat = parse_pairs($_); check_pairs($stat); } __END__ output: >>Source: // ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' four=`'one' + two` five = ah! // one -> <xyz\t> two -> <a_34> three -> <name="O\'Hara"> four -> <'one' two> five -> <ah!> >>Source: // five = ah! ONE="xyz\t" three = 'name="O\'Hara" two=a_34' four=`'one' two` // one -> <xyz\t> - missing keyword <two>! three -> <name="O\'Hara" two=a_34> four -> <'one' two> five -> <ah!>
This Regex rightly captures both the barewords and the quoted strings, taking care of embedded quotes and the escaped quote in the name.

Questions:
(1) Could I have achieved the same result using any standard module?
(2) Also, does anyone spot any weakness where the paradigm may break?
So far, it is strong enough to handle correctly sources like
q{one="two=xyz" two=abc} # ^embedded keyword pattern q{one="xyz two=abc three= efg"} # ^missing quotes^
In the first case, the value for two is eaten up by the engine, so it starts examining for a new match after the quoted string, thus rigthly assigning "abc" to two and "two=xyz" to one.
The second case is an input mistake, and the error is found during the check at the end of the loop.
Also, about the preparation work, I had a look at Text::Balanced, which can deal with all the quotes, but it is not clear to me if and how it can also deal with barewords at the same time, and how it could fit in the engine.
TIA
 _  _ _  _  
(_|| | |(_|><
 _|   

In reply to Regex capturing either quoted strings or bare words by gmax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.