Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I am writing a simple script language that needs to parse arguments passed as keyword=value pairs within the same string.
There are a few constraints, which account for some additional complexity:
  • Each pair can be either on a separate line or merged in a single line
  • Spaces are allowed before and after the equal (=) sign
  • Values can be either barewords or quoted strings. Quote symbols may be single, double or inverse.
  • Values may contain spaces and the equal sign
  • Values may contain escaped quotes.
Example source strings are:
my @all_keywords = (qw(one two three )); my @mandatory_keywords = (qw(two three )); my $source1 = q{ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' }; my $source2 = <<'END'; two=a_34 three = 'name="O\'Hara"' ONE="xyz\t" END
The desired output, from both sources, is a hash containing
my %statement = ( one => 'xyz\t', two => 'a_34', three => q{name="O\'Hara"}, );
In addition, I need to make sure that all the keywords are valid ones, and that the mandatory keywords are defined. Meeting all the requirements is not extremely difficult.
Please have a look at my test code. (The real code is a full-fledged module).
#!/usr/bin/perl -w use strict; my @all_keywords = (qw(one two three four five)); my @mandatory_keywords = (qw(two three four )); my $RE_value = qr/ (\w+) # (1) a keyword \s* = \s* # an equal sign with optional spaces (?: # quoted keyword ... ( # [\'\"\`] # (2) a quoting character ) ( # (3) the quoted value: (?: # either \\\2 # an escaped quote | # or [^\2] # any non-quote character ) +? # repeat (non-greedily) ) \2 # until the initial quote shows up again | (\S+) # (4) ... bare word value ) /x; sub set_value { my ($stat, $kw, $value) = (@_); # case insensitive keyword return 0 unless exists $stat->{lc $kw}; $stat->{lc $kw} = $value; return 1 } sub parse_pairs { my $src = shift; my %statement = map {$_, undef} @all_keywords; for ($src) { while ( ! m/ \G \s* \z /gcx ) { my $result = 0; if ( /\G \s* $RE_value \s* /xgc ) { $result = set_value( \%statement, $1, $4 ? $4 : $3 ); } else { die "syntax error >" . substr($_, pos) ."\n"; } die "invalid keyword $1 \n" unless $result; } } return \%statement; } sub check_pairs { my $statement = shift; for my $kw (@all_keywords) { if (defined $statement->{$kw}) { print "$kw \t -> <$statement->{$kw}>\n" } else { warn "- missing keyword <$kw>!\n" if grep {$kw eq $_} @mandatory_keywords; } } } my @sources = ( q{ ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' four=`'one' two` fi +ve = ah! }, q{ five = ah! ONE="xyz\t" three = 'name="O\'Hara" two=a_34' four=`'one' two` }); for (@sources) { print "\n>>Source: //$_//\n\n"; my $stat = parse_pairs($_); check_pairs($stat); } __END__ output: >>Source: // ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' four=`'one' + two` five = ah! // one -> <xyz\t> two -> <a_34> three -> <name="O\'Hara"> four -> <'one' two> five -> <ah!> >>Source: // five = ah! ONE="xyz\t" three = 'name="O\'Hara" two=a_34' four=`'one' two` // one -> <xyz\t> - missing keyword <two>! three -> <name="O\'Hara" two=a_34> four -> <'one' two> five -> <ah!>
This Regex rightly captures both the barewords and the quoted strings, taking care of embedded quotes and the escaped quote in the name.

Questions:
(1) Could I have achieved the same result using any standard module?
(2) Also, does anyone spot any weakness where the paradigm may break?
So far, it is strong enough to handle correctly sources like
q{one="two=xyz" two=abc} # ^embedded keyword pattern q{one="xyz two=abc three= efg"} # ^missing quotes^
In the first case, the value for two is eaten up by the engine, so it starts examining for a new match after the quoted string, thus rigthly assigning "abc" to two and "two=xyz" to one.
The second case is an input mistake, and the error is found during the check at the end of the loop.
Also, about the preparation work, I had a look at Text::Balanced, which can deal with all the quotes, but it is not clear to me if and how it can also deal with barewords at the same time, and how it could fit in the engine.
TIA
 _  _ _  _  
(_|| | |(_|><
 _|   

In reply to Regex capturing either quoted strings or bare words by gmax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-03-29 05:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found