nested reg ex over multiple lines

eg8rds has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: nested reg ex over multiple lines by holli (Abbot) on Jun 20, 2005 at 13:19 UTC
Split your regex in two: `use strict; use warnings; use Data::Dumper; my $key; my %data; while (<DATA>) { $key = $1, next if /^CALCON$(\w+)$/; $data{$key}->{$1} = $2 if /^\s+(\w+)$([\w\d\s]+)$/; } print Dumper (\%data); __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }` [download] Output: `$VAR1 = { 'test1' => { 'MIN' => '0', 'UNITS' => 'ms', 'FEATURE' => 'DCOM', 'NAM' => 'stmin', 'MAX' => '127', 'TYPE' => 'U8' }, 'test2' => { 'MIN' => '0', 'UNITS' => 'ms', 'FEATURE' => 'DCOM', 'NAM' => 'dcomc_sestmr_timeout', 'MAX' => '65535', 'TYPE' => 'U16' } };` [download] holli, /regexed monk/	[reply] [d/l] [select]
Re^2: nested reg ex over multiple lines by tphyahoo (Vicar) on Jun 20, 2005 at 13:40 UTC
I liked this solution, but since it felt a little "idiomatic" to me, I thought it might also be idiomatic to someone even newer to perl. So, I rewrote it in a way that was a bit easier for me to understand. Mainly I just put conditionals in parens, bracked off the results, and filled in the default variables where they were being assumed. use strict; use warnings; use Data::Dumper; my $content = ""; my ($key, %data); while (<DATA>) { if ( $_ =~ /^CALCON$(\w+)$/ ) { $key = $1; } else { if ( $_ =~ /^\s+(\w+)$([\w\d\s]+)$/ ) { $data{$key}->{$1} = $2 ; } } } print Dumper(\%data); __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) } [download]	[reply] [d/l]
Re^3: nested reg ex over multiple lines by holli (Abbot) on Jun 20, 2005 at 13:52 UTC
idiomatic? The only difference with your code is the use of `if/else` instead of the `next` statement. And `next` is really not very idiomatic. Other languages have a similar construct. holli, /regexed monk/	[reply] [d/l] [select]
Re^3: nested reg ex over multiple lines by BUU (Prior) on Jun 20, 2005 at 17:33 UTC
Maybe it's just me and my love for default variables, but to me, all of your extraneous punctuation and explicit default variable usage makes it harder to read, at least to me. There's more code I have to read, and ignore, before I can understand exactly what is happening. But maybe it's just me.	[reply]
Re: nested reg ex over multiple lines by tlm (Prior) on Jun 20, 2005 at 13:28 UTC
Does this do what you want? Read more... (993 Bytes) Note the two `?` modifiers in the regex (after `+` and `*`) that prevent greedy matching. Without them, your matches would be longer than you want. the lowliest monk	[reply] [d/l]
Re: nested reg ex over multiple lines by tphyahoo (Vicar) on Jun 20, 2005 at 13:25 UTC
PS, this might be a case where you're better off using Parse::Recdescent than a regex. Sorry my P::RD fu is weak but maybe one of the gods can whip something out...	[reply]
Re^2: nested reg ex over multiple lines by Fletch (Bishop) on Jun 20, 2005 at 13:34 UTC
If not P::RD, at least a more statefull parser rather than trying to do everything with just a single regex. Something like (and this is really rough and presumes curlies won't be on the same line as a declaration): `my $state = 'declaration'; my( $name, %data ); while( <> ) { if( $state eq 'declaration' ) { if( /CALCON$(.?)$/ ) { $name = $1; $state eq 'opencurly'; next; } } if( $state eq 'opencurly' ) { $state = 'body' if /{/; } if( $state eq 'body' ) { if( /(\S+)$.?$\s$/ ) { $data{ $name }->{ $1 } = $2; } if( /\s}\s*$/ ) { $state = 'declaration'; } } }` [download] -- We're looking for people in ATL	[reply] [d/l]
Re: nested reg ex over multiple lines by tphyahoo (Vicar) on Jun 20, 2005 at 13:19 UTC
Maybe you're reading the data in wrong somehow? The following seems to do what you want, I think. (Not 100% sure if I followed you, but hope this helps.) `use strict; use warnings; my $content = ""; while (<DATA>) { $content = $content . $_; } print "content: $content"; # sanity check while ($content =~m/^(CAL.+$(\w+)$)/mg){ print "\n1= $1"; print "\n2= $2"; } __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }` [download] This outputs: `1= CALCON(test1) 2= test1 1= CALCON(test2) 2= test2` [download]	[reply] [d/l] [select]
Re^2: nested reg ex over multiple lines by eg8rds (Acolyte) on Jun 20, 2005 at 13:30 UTC
I have the entire file in one variable. I can get the result you have by leaving the ~m//mg rather than ~m//smg (single line mode.) My problem is that I want $1 to equal the entire element, i.e. `CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }` [download] but $2 to equal "test2". perl seems to be greedy, and gets all the way to the "ms" in brackets if I add the /s to the match, i.e. it is greedy. Does this explain things better? Cheers.	[reply] [d/l]
Re^3: nested reg ex over multiple lines by holli (Abbot) on Jun 20, 2005 at 13:49 UTC
I have the entire file in one variable. Maybe you should have mentioned that. Anyway here's a working solution: `use strict; use warnings; $_ = qq"CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }"; while ( /(CALCON$\w+$\n{\n[^}]+})/msg ) { print "**\n$1\n"; }` [download] Output: ` CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } ** CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }` [download] holli, /regexed monk/	[reply] [d/l] [select]
Re^3: nested reg ex over multiple lines by tphyahoo (Vicar) on Jun 20, 2005 at 14:18 UTC
I came up with something that I think does what you want below using "inch along with negative lookahead" strategy.	[reply]
Re: nested reg ex over multiple lines by tphyahoo (Vicar) on Jun 20, 2005 at 14:01 UTC
On reflection, there is a way to do this kind of parsing, kind of use, using regexes. I think of it as the "inch along" with negative lookahead strategy described by Merlyn (sort of) at Death to Dot Star. Something like this does what you described you needed above, I believe. use strict; use warnings; my $content = ""; while (<DATA>) { $content = $content . $_; } #print "content: $content"; # sanity check while ($content =~m/( CALCON$[^)]?$[\r\n]{[^}]?} #entire + match. Same as in negative lookahead on next line. ((?!CALCON$[^)]?$[\r\n]{[^}]?}).)* #inch alon +g with negative lookahead )/xsmg){ my $entire_match = $1; if ($entire_match =~ /CALCON$(.?)$/) { my $test_number = $1; print "entire match: $entire_match\n"; print "test number: $test_number\n"; print "\n\n"; } } __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) } CALCON(test3) { TYPE(U16) FEATURE(CALCON) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(CALCON) MAX(65535) UNITS(ms) } [download] This may be a case of killing a mosquito with a flamethrower, but... well... TIMTOWTDI. Maybe you like it :) But seriously, an internal rule of thumb for me is that when I start having to inch along, it may be time to stop thinking regexes and start thinking something else. Disclaimer: this works for your input data, but it makes me a little uneasy. Are there may be edge cases I haven't thought of? That's why the gut still says, uh oh, reach for P::RD. UPDATE: Replaced the $& with $1 per holli below. UPDATE 2: Made the "inch ahead" a more thorough, so doesn't fail on "CALCON" in the data area, as in the third test case. Originally this was just `$content =~m/(CALCON((?!CALCON).) )/xsmg` [download]	[reply] [d/l] [select]
Re^2: nested reg ex over multiple lines by holli (Abbot) on Jun 20, 2005 at 14:07 UTC
Are you aware of the runtime drawbacks that $& (and his brethren $' and $`) impose?	[reply]
Re^3: nested reg ex over multiple lines by tphyahoo (Vicar) on Jun 20, 2005 at 14:10 UTC
Yeah, but to be honest, I had kind of forgotten about them when I posted the above. I was just all into the inch along with negative lookahead thing. Basically, the $& construct is slow, and might not be supported into the future. (Right?) What's the "right" way to do this again? UPDATE: Changed above code to use $1 instead.	[reply]
Re^4: nested reg ex over multiple lines by holli (Abbot) on Jun 20, 2005 at 14:15 UTC
Re: nested reg ex over multiple lines by TedPride (Priest) on Jun 20, 2005 at 17:46 UTC
EDIT: Oops, I should have had that as .* instead of \w+. Fixed. And yes, I missed the fact that your solution was line by line instead of straight regex, holli. Shouldn't have posted a (less pretty!) duplicate, though I did arrive at it independently. You can go ahead and ding me if you like. ---------------- Line by line processing: `use strict; use warnings; use Data::Dumper; my ($key1, $key2, $val, %hash); while (<DATA>) { if (($key2, $val) = m/(\w+)$(.*)$/) { if ($key2 =~ /^CAL/) { $key1 = $val; } else { $hash{$key1}{$key2} = $val; } } } print Dumper(\%hash); __DATA__ CALCON(test1) { TYPE(U8) FEATURE(DCOM) NAM(stmin) LABEL(Min seperation time between CFs) MIN(0) MAX(127) UNITS(ms) } CALCON(test2) { TYPE(U16) FEATURE(DCOM) NAM(dcomc_sestmr_timeout) LABEL(DCOM Session Timer Timeout) MIN(0) MAX(65535) UNITS(ms) }` [download]	[reply] [d/l]
Re^2: nested reg ex over multiple lines by holli (Abbot) on Jun 20, 2005 at 19:10 UTC
That's basically the same code as mine, just yours misses the values with spaces in them (LABEL). holli, /regexed monk/	[reply] [d/l]