eg8rds has asked for the wisdom of the Perl Monks concerning the following question:

I'm pretty new to perl, and have a problem which I just can't figure out. I'm have a file which is autogenerated, and has some entries in a specific layout. I want to suck these into a hash, which I can then sort into alphabetical order according to the entry title. The file entries look like this:
CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }
and my loop in perl looks like this:
while ($content =~m/^(CAL.+\((\w+)\))/mg){ print "\n1= $1"; print "\n2= $2"; }
so I'm trying to get the variable name (for example "test1") then also the data entry:
CALCON(test1) { TYPE(U8) ... ... }
so as to make the hash. Only I can't get it to work. Adding /s to the exp match means I only get the last name in brackets. Can anyone help?? Many thanks, Rob.

Replies are listed 'Best First'.
Re: nested reg ex over multiple lines
by holli (Abbot) on Jun 20, 2005 at 13:19 UTC
    Split your regex in two:
    use strict; use warnings; use Data::Dumper; my $key; my %data; while (<DATA>) { $key = $1, next if /^CALCON\((\w+)\)/; $data{$key}->{$1} = $2 if /^\s+(\w+)\(([\w\d\s]+)\)/; } print Dumper (\%data); __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }
    Output:
    $VAR1 = { 'test1' => { 'MIN' => '0', 'UNITS' => 'ms', 'FEATURE' => 'DCOM', 'NAM' => 'stmin', 'MAX' => '127', 'TYPE' => 'U8' }, 'test2' => { 'MIN' => '0', 'UNITS' => 'ms', 'FEATURE' => 'DCOM', 'NAM' => 'dcomc_sestmr_timeout', 'MAX' => '65535', 'TYPE' => 'U16' } };


    holli, /regexed monk/
      I liked this solution, but since it felt a little "idiomatic" to me, I thought it might also be idiomatic to someone even newer to perl. So, I rewrote it in a way that was a bit easier for me to understand. Mainly I just put conditionals in parens, bracked off the results, and filled in the default variables where they were being assumed.
      use strict; use warnings; use Data::Dumper; my $content = ""; my ($key, %data); while (<DATA>) { if ( $_ =~ /^CALCON\((\w+)\)/ ) { $key = $1; } else { if ( $_ =~ /^\s+(\w+)\(([\w\d\s]+)\)/ ) { $data{$key}->{$1} = $2 ; } } } print Dumper(\%data); __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }
        idiomatic? The only difference with your code is the use of if/else instead of the next statement. And next is really not very idiomatic. Other languages have a similar construct.


        holli, /regexed monk/
        Maybe it's just me and my love for default variables, but to me, all of your extraneous punctuation and explicit default variable usage makes it harder to read, at least to me. There's more code I have to read, and ignore, before I can understand exactly what is happening. But maybe it's just me.
Re: nested reg ex over multiple lines
by tlm (Prior) on Jun 20, 2005 at 13:28 UTC

    Does this do what you want?

    Note the two ? modifiers in the regex (after + and *) that prevent greedy matching. Without them, your matches would be longer than you want.

    the lowliest monk

Re: nested reg ex over multiple lines
by tphyahoo (Vicar) on Jun 20, 2005 at 13:25 UTC
    PS, this might be a case where you're better off using Parse::Recdescent than a regex. Sorry my P::RD fu is weak but maybe one of the gods can whip something out...

      If not P::RD, at least a more statefull parser rather than trying to do everything with just a single regex. Something like (and this is really rough and presumes curlies won't be on the same line as a declaration):

      my $state = 'declaration'; my( $name, %data ); while( <> ) { if( $state eq 'declaration' ) { if( /CALCON\((.*?)\)/ ) { $name = $1; $state eq 'opencurly'; next; } } if( $state eq 'opencurly' ) { $state = 'body' if /{/; } if( $state eq 'body' ) { if( /(\S+)\(.*?\)\s*$/ ) { $data{ $name }->{ $1 } = $2; } if( /\s*}\s*$/ ) { $state = 'declaration'; } } }

      --
      We're looking for people in ATL

Re: nested reg ex over multiple lines
by tphyahoo (Vicar) on Jun 20, 2005 at 13:19 UTC
    Maybe you're reading the data in wrong somehow? The following seems to do what you want, I think. (Not 100% sure if I followed you, but hope this helps.)
    use strict; use warnings; my $content = ""; while (<DATA>) { $content = $content . $_; } print "content: $content"; # sanity check while ($content =~m/^(CAL.+\((\w+)\))/mg){ print "\n1= $1"; print "\n2= $2"; } __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }
    This outputs:
    1= CALCON(test1) 2= test1 1= CALCON(test2) 2= test2
      I have the entire file in one variable. I can get the result you have by leaving the ~m//mg rather than ~m//smg (single line mode.) My problem is that I want $1 to equal the entire element, i.e.
      CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }
      but $2 to equal "test2". perl seems to be greedy, and gets all the way to the "ms" in brackets if I add the /s to the match, i.e. it is greedy. Does this explain things better? Cheers.
        I have the entire file in one variable.
        Maybe you should have mentioned that. Anyway here's a working solution:
        use strict; use warnings; $_ = qq"CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }"; while ( /(CALCON\(\w+\)\n{\n[^}]+})/msg ) { print "****\n$1\n"; }
        Output:
        **** CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } **** CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }


        holli, /regexed monk/
        I came up with something that I think does what you want below using "inch along with negative lookahead" strategy.
Re: nested reg ex over multiple lines
by tphyahoo (Vicar) on Jun 20, 2005 at 14:01 UTC
    On reflection, there is a way to do this kind of parsing, kind of use, using regexes. I think of it as the "inch along" with negative lookahead strategy described by Merlyn (sort of) at Death to Dot Star. Something like this does what you described you needed above, I believe.
    use strict; use warnings; my $content = ""; while (<DATA>) { $content = $content . $_; } #print "content: $content"; # sanity check while ($content =~m/( CALCON\([^)]*?\)[\r\n]*{[^}]*?} #entire + match. Same as in negative lookahead on next line. ((?!CALCON\([^)]*?\)[\r\n]*{[^}]*?}).)* #inch alon +g with negative lookahead )/xsmg){ my $entire_match = $1; if ($entire_match =~ /CALCON\((.*?)\)/) { my $test_number = $1; print "entire match: $entire_match\n"; print "test number: $test_number\n"; print "\n\n"; } } __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) } CALCON(test3) { TYPE(U16) FEATURE(CALCON) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(CALCON) MAX(65535) UNITS(ms) }
    This may be a case of killing a mosquito with a flamethrower, but... well... TIMTOWTDI. Maybe you like it :)

    But seriously, an internal rule of thumb for me is that when I start having to inch along, it may be time to stop thinking regexes and start thinking something else.

    Disclaimer: this works for your input data, but it makes me a little uneasy. Are there may be edge cases I haven't thought of? That's why the gut still says, uh oh, reach for P::RD.

    UPDATE: Replaced the $& with $1 per holli below.

    UPDATE 2: Made the "inch ahead" a more thorough, so doesn't fail on "CALCON" in the data area, as in the third test case. Originally this was just

    $content =~m/(CALCON((?!CALCON).)* )/xsmg
      Are you aware of the runtime drawbacks that $& (and his brethren $' and $`) impose?
        Yeah, but to be honest, I had kind of forgotten about them when I posted the above. I was just all into the inch along with negative lookahead thing.

        Basically, the $& construct is slow, and might not be supported into the future. (Right?) What's the "right" way to do this again?

        UPDATE: Changed above code to use $1 instead.

Re: nested reg ex over multiple lines
by TedPride (Priest) on Jun 20, 2005 at 17:46 UTC
    EDIT: Oops, I should have had that as .* instead of \w+. Fixed. And yes, I missed the fact that your solution was line by line instead of straight regex, holli. Shouldn't have posted a (less pretty!) duplicate, though I did arrive at it independently.

    You can go ahead and ding me if you like.

    ----------------

    Line by line processing:

    use strict; use warnings; use Data::Dumper; my ($key1, $key2, $val, %hash); while (<DATA>) { if (($key2, $val) = m/(\w+)\((.*)\)/) { if ($key2 =~ /^CAL/) { $key1 = $val; } else { $hash{$key1}{$key2} = $val; } } } print Dumper(\%hash); __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }
      That's basically the same code as mine, just yours misses the values with spaces in them (LABEL).


      holli, /regexed monk/