Re: nested reg ex over multiple lines
by holli (Abbot) on Jun 20, 2005 at 13:19 UTC
|
use strict;
use warnings;
use Data::Dumper;
my $key;
my %data;
while (<DATA>)
{
$key = $1, next if /^CALCON\((\w+)\)/;
$data{$key}->{$1} = $2 if /^\s+(\w+)\(([\w\d\s]+)\)/;
}
print Dumper (\%data);
__DATA__
CALCON(test1)
{
TYPE(U8)
FEATURE(DCOM)
NAM(stmin)
LABEL(Min seperation time between CFs)
MIN(0)
MAX(127)
UNITS(ms)
}
CALCON(test2)
{
TYPE(U16)
FEATURE(DCOM)
NAM(dcomc_sestmr_timeout)
LABEL(DCOM Session Timer Timeout)
MIN(0)
MAX(65535)
UNITS(ms)
}
Output:
$VAR1 = {
'test1' => {
'MIN' => '0',
'UNITS' => 'ms',
'FEATURE' => 'DCOM',
'NAM' => 'stmin',
'MAX' => '127',
'TYPE' => 'U8'
},
'test2' => {
'MIN' => '0',
'UNITS' => 'ms',
'FEATURE' => 'DCOM',
'NAM' => 'dcomc_sestmr_timeout',
'MAX' => '65535',
'TYPE' => 'U16'
}
};
| [reply] [d/l] [select] |
|
|
I liked this solution, but since it felt a little "idiomatic" to me, I thought it might also be idiomatic to someone even newer to perl. So, I rewrote it in a way that was a bit easier for me to understand. Mainly I just put conditionals in parens, bracked off the results, and filled in the default variables where they were being assumed.
use strict;
use warnings;
use Data::Dumper;
my $content = "";
my ($key, %data);
while (<DATA>)
{
if ( $_ =~ /^CALCON\((\w+)\)/ ) {
$key = $1;
} else {
if ( $_ =~ /^\s+(\w+)\(([\w\d\s]+)\)/ ) {
$data{$key}->{$1} = $2 ;
}
}
}
print Dumper(\%data);
__DATA__
CALCON(test1)
{
TYPE(U8)
FEATURE(DCOM)
NAM(stmin)
LABEL(Min seperation time between CFs)
MIN(0)
MAX(127)
UNITS(ms)
}
CALCON(test2)
{
TYPE(U16)
FEATURE(DCOM)
NAM(dcomc_sestmr_timeout)
LABEL(DCOM Session Timer Timeout)
MIN(0)
MAX(65535)
UNITS(ms)
}
| [reply] [d/l] |
|
|
idiomatic? The only difference with your code is the use of if/else instead of the next statement. And next is really not very idiomatic. Other languages have a similar construct.
| [reply] [d/l] [select] |
|
|
Maybe it's just me and my love for default variables, but to me, all of your extraneous punctuation and explicit default variable usage makes it harder to read, at least to me. There's more code I have to read, and ignore, before I can understand exactly what is happening. But maybe it's just me.
| [reply] |
Re: nested reg ex over multiple lines
by tlm (Prior) on Jun 20, 2005 at 13:28 UTC
|
Does this do what you want?
Note the two ? modifiers in the regex (after + and *) that prevent greedy matching. Without them, your matches would be longer than you want.
| [reply] [d/l] |
Re: nested reg ex over multiple lines
by tphyahoo (Vicar) on Jun 20, 2005 at 13:25 UTC
|
PS, this might be a case where you're better off using Parse::Recdescent than a regex. Sorry my P::RD fu is weak but maybe one of the gods can whip something out... | [reply] |
|
|
my $state = 'declaration';
my( $name, %data );
while( <> ) {
if( $state eq 'declaration' ) {
if( /CALCON\((.*?)\)/ ) {
$name = $1;
$state eq 'opencurly';
next;
}
}
if( $state eq 'opencurly' ) {
$state = 'body' if /{/;
}
if( $state eq 'body' ) {
if( /(\S+)\(.*?\)\s*$/ ) {
$data{ $name }->{ $1 } = $2;
}
if( /\s*}\s*$/ ) {
$state = 'declaration';
}
}
}
--
We're looking for people in ATL
| [reply] [d/l] |
Re: nested reg ex over multiple lines
by tphyahoo (Vicar) on Jun 20, 2005 at 13:19 UTC
|
Maybe you're reading the data in wrong somehow? The following seems to do what you want, I think. (Not 100% sure if I followed you, but hope this helps.)
use strict;
use warnings;
my $content = "";
while (<DATA>) {
$content = $content . $_;
}
print "content: $content"; # sanity check
while ($content =~m/^(CAL.+\((\w+)\))/mg){
print "\n1= $1";
print "\n2= $2";
}
__DATA__
CALCON(test1)
{
TYPE(U8)
FEATURE(DCOM)
NAM(stmin)
LABEL(Min seperation time between CFs)
MIN(0)
MAX(127)
UNITS(ms)
}
CALCON(test2)
{
TYPE(U16)
FEATURE(DCOM)
NAM(dcomc_sestmr_timeout)
LABEL(DCOM Session Timer Timeout)
MIN(0)
MAX(65535)
UNITS(ms)
}
This outputs:
1= CALCON(test1)
2= test1
1= CALCON(test2)
2= test2
| [reply] [d/l] [select] |
|
|
I have the entire file in one variable. I can get the result you have by leaving the ~m//mg rather than ~m//smg (single line mode.)
My problem is that I want $1 to equal the entire element, i.e.
CALCON(test2)
{
TYPE(U16)
FEATURE(DCOM)
NAM(dcomc_sestmr_timeout)
LABEL(DCOM Session Timer Timeout)
MIN(0)
MAX(65535)
UNITS(ms)
}
but $2 to equal "test2".
perl seems to be greedy, and gets all the way to the "ms" in brackets if I add the /s to the match, i.e. it is greedy.
Does this explain things better?
Cheers. | [reply] [d/l] |
|
|
I have the entire file in one variable.
Maybe you should have mentioned that. Anyway here's a working solution:
use strict;
use warnings;
$_ = qq"CALCON(test1)
{
TYPE(U8)
FEATURE(DCOM)
NAM(stmin)
LABEL(Min seperation time between CFs)
MIN(0)
MAX(127)
UNITS(ms)
}
CALCON(test2)
{
TYPE(U16)
FEATURE(DCOM)
NAM(dcomc_sestmr_timeout)
LABEL(DCOM Session Timer Timeout)
MIN(0)
MAX(65535)
UNITS(ms)
}";
while ( /(CALCON\(\w+\)\n{\n[^}]+})/msg )
{
print "****\n$1\n";
}
Output:
****
CALCON(test1)
{
TYPE(U8)
FEATURE(DCOM)
NAM(stmin)
LABEL(Min seperation time between CFs)
MIN(0)
MAX(127)
UNITS(ms)
}
****
CALCON(test2)
{
TYPE(U16)
FEATURE(DCOM)
NAM(dcomc_sestmr_timeout)
LABEL(DCOM Session Timer Timeout)
MIN(0)
MAX(65535)
UNITS(ms)
}
| [reply] [d/l] [select] |
|
|
I came up with something that I think does what you want below using "inch along with negative lookahead" strategy.
| [reply] |
Re: nested reg ex over multiple lines
by tphyahoo (Vicar) on Jun 20, 2005 at 14:01 UTC
|
On reflection, there is a way to do this kind of parsing, kind of use, using regexes. I think of it as the "inch along" with negative lookahead strategy described by Merlyn (sort of) at Death to Dot Star. Something like this does what you described you needed above, I believe.
use strict;
use warnings;
my $content = "";
while (<DATA>) {
$content = $content . $_;
}
#print "content: $content"; # sanity check
while ($content =~m/(
CALCON\([^)]*?\)[\r\n]*{[^}]*?} #entire
+ match. Same as in negative lookahead on next line.
((?!CALCON\([^)]*?\)[\r\n]*{[^}]*?}).)* #inch alon
+g with negative lookahead
)/xsmg){
my $entire_match = $1;
if ($entire_match =~ /CALCON\((.*?)\)/) {
my $test_number = $1;
print "entire match: $entire_match\n";
print "test number: $test_number\n";
print "\n\n";
}
}
__DATA__
CALCON(test1)
{
TYPE(U8)
FEATURE(DCOM)
NAM(stmin)
LABEL(Min seperation time between CFs)
MIN(0)
MAX(127)
UNITS(ms)
}
CALCON(test2)
{
TYPE(U16)
FEATURE(DCOM)
NAM(dcomc_sestmr_timeout)
LABEL(DCOM Session Timer Timeout)
MIN(0)
MAX(65535)
UNITS(ms)
}
CALCON(test3)
{
TYPE(U16)
FEATURE(CALCON)
NAM(dcomc_sestmr_timeout)
LABEL(DCOM Session Timer Timeout)
MIN(CALCON)
MAX(65535)
UNITS(ms)
}
This may be a case of killing a mosquito with a flamethrower, but... well... TIMTOWTDI. Maybe you like it :)
But seriously, an internal rule of thumb for me is that when I start having to inch along, it may be time to stop thinking regexes and start thinking something else.
Disclaimer: this works for your input data, but it makes me a little uneasy. Are there may be edge cases I haven't thought of? That's why the gut still says, uh oh, reach for P::RD.
UPDATE: Replaced the $& with $1 per holli below.
UPDATE 2: Made the "inch ahead" a more thorough, so doesn't fail on "CALCON" in the data area, as in the third test case. Originally this was just
$content =~m/(CALCON((?!CALCON).)* )/xsmg
| [reply] [d/l] [select] |
|
|
Are you aware of the runtime drawbacks that $& (and his brethren $' and $`) impose?
| [reply] |
|
|
Yeah, but to be honest, I had kind of forgotten about them when I posted the above. I was just all into the inch along with negative lookahead thing.
Basically, the $& construct is slow, and might not be supported into the future. (Right?) What's the "right" way to do this again?
UPDATE: Changed above code to use $1 instead.
| [reply] |
|
|
Re: nested reg ex over multiple lines
by TedPride (Priest) on Jun 20, 2005 at 17:46 UTC
|
use strict;
use warnings;
use Data::Dumper;
my ($key1, $key2, $val, %hash);
while (<DATA>) {
if (($key2, $val) = m/(\w+)\((.*)\)/) {
if ($key2 =~ /^CAL/) {
$key1 = $val;
}
else {
$hash{$key1}{$key2} = $val;
}
}
}
print Dumper(\%hash);
__DATA__
CALCON(test1)
{
TYPE(U8)
FEATURE(DCOM)
NAM(stmin)
LABEL(Min seperation time between CFs)
MIN(0)
MAX(127)
UNITS(ms)
}
CALCON(test2)
{
TYPE(U16)
FEATURE(DCOM)
NAM(dcomc_sestmr_timeout)
LABEL(DCOM Session Timer Timeout)
MIN(0)
MAX(65535)
UNITS(ms)
}
| [reply] [d/l] |
|
|
That's basically the same code as mine, just yours misses the values with spaces in them (LABEL).
| [reply] [d/l] |