I am writing a simple script language that needs to parse arguments passed as
keyword=value pairs within the same string.
There are a few constraints, which account for some additional complexity:
- Each pair can be either on a separate line or merged in a single line
- Spaces are allowed before and after the equal (=) sign
- Values can be either barewords or quoted strings. Quote symbols may be single, double or inverse.
- Values may contain spaces and the equal sign
- Values may contain escaped quotes.
Example source strings are:
my @all_keywords = (qw(one two three ));
my @mandatory_keywords = (qw(two three ));
my $source1 =
q{ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' };
my $source2 = <<'END';
two=a_34
three = 'name="O\'Hara"'
ONE="xyz\t"
END
The desired output, from both sources, is a hash containing
my %statement = (
one => 'xyz\t',
two => 'a_34',
three => q{name="O\'Hara"},
);
In addition, I need to make sure that all the keywords are valid ones, and that the mandatory keywords are defined.
Meeting all the requirements is not extremely difficult.
Please have a look at my test code. (The real code is a full-fledged module).
#!/usr/bin/perl -w
use strict;
my @all_keywords = (qw(one two three four five));
my @mandatory_keywords = (qw(two three four ));
my $RE_value = qr/
(\w+) # (1) a keyword
\s* = \s* # an equal sign with optional spaces
(?: # quoted keyword ...
( #
[\'\"\`] # (2) a quoting character
)
( # (3) the quoted value:
(?: # either
\\\2 # an escaped quote
| # or
[^\2] # any non-quote character
)
+? # repeat (non-greedily)
)
\2 # until the initial quote shows up again
|
(\S+) # (4) ... bare word value
)
/x;
sub set_value {
my ($stat, $kw, $value) = (@_);
# case insensitive keyword
return 0 unless exists $stat->{lc $kw};
$stat->{lc $kw} = $value;
return 1
}
sub parse_pairs {
my $src = shift;
my %statement = map {$_, undef} @all_keywords;
for ($src) {
while ( ! m/ \G \s* \z /gcx ) {
my $result = 0;
if ( /\G \s* $RE_value \s* /xgc ) {
$result = set_value( \%statement, $1, $4 ? $4 : $3 );
}
else {
die "syntax error >" . substr($_, pos) ."\n";
}
die "invalid keyword $1 \n" unless $result;
}
}
return \%statement;
}
sub check_pairs {
my $statement = shift;
for my $kw (@all_keywords) {
if (defined $statement->{$kw}) {
print "$kw \t -> <$statement->{$kw}>\n"
}
else {
warn "- missing keyword <$kw>!\n"
if grep {$kw eq $_} @mandatory_keywords;
}
}
}
my @sources = (
q{ ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' four=`'one' two` fi
+ve = ah! },
q{
five = ah!
ONE="xyz\t"
three = 'name="O\'Hara" two=a_34'
four=`'one' two`
});
for (@sources) {
print "\n>>Source: //$_//\n\n";
my $stat = parse_pairs($_);
check_pairs($stat);
}
__END__
output:
>>Source: // ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' four=`'one'
+ two` five = ah! //
one -> <xyz\t>
two -> <a_34>
three -> <name="O\'Hara">
four -> <'one' two>
five -> <ah!>
>>Source: //
five = ah!
ONE="xyz\t"
three = 'name="O\'Hara" two=a_34'
four=`'one' two`
//
one -> <xyz\t>
- missing keyword <two>!
three -> <name="O\'Hara" two=a_34>
four -> <'one' two>
five -> <ah!>
This Regex rightly captures both the barewords and the quoted strings, taking care of embedded quotes and the escaped quote in the
name.
Questions:
(1) Could I have achieved the same result using any standard module?
(2) Also, does anyone spot any weakness where the paradigm may break?
So far, it is strong enough to handle correctly sources like
q{one="two=xyz" two=abc}
# ^embedded keyword pattern
q{one="xyz two=abc three= efg"}
# ^missing quotes^
In the first case, the value for
two is eaten up by the engine, so it starts examining for a new match after the quoted string, thus rigthly assigning "abc" to
two and "two=xyz" to
one.
The second case is an input mistake, and the error is found during the check at the end of the loop.
Also, about the preparation work, I had a look at
Text::Balanced, which can deal with all the quotes, but it is not clear to me if and how it can also deal with barewords at the same time, and how it could fit in the engine.
TIA
_ _ _ _
(_|| | |(_|><
_|
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.